You’d think there might be a one-step approach to this, as it would be a common need, but because we’re going from a database version that didn’t know about character encodings at all, its contents could be in any old encoding. These steps assume that the database was used to store Latin1 (ISO 8859-1) characters before. In the examples below, we’re going to convert a database called “libris2″ (because that’s a real one I had to convert – these steps are tried and tested!)
1. Dump the old database
The standard way of dumping a database from MySQL is
mysqldump --opt libris2 > libris2.sql
That dumps all the tables from a database and all the data in a form designed for efficient insertion into a new database, with very long INSERT statements. However, I want to be able to check the conversions that I’m about to do, so I’ll dump the database slightly differently:
mysqldump --add-drop-table --add-locks --all
--quick --lock-tables libris2 > libris2.sql
Here, I’ve used all the options that “–opt” would give me, with the exception of “–extended-insert”. Now, I have the complete database in a text file, libris2.sql.
Note: Perform this dump on the machine running the old MySQL server, so that you get an old (matching) version of mysqldump. If mysqldump doesn’t match the server version, you may not get any output at all. I have been in the situation where my database was automatically migrated from an old server to a newer version, and convincing mysqldump to ignore character encoding details in the dump on newer versions is a pain. Migrate your data while you still have the old server running.
2. Convert the character encodings
If I’m sure of the encoding of data in the database, I can do this:
iconv -f latin1 -t utf8 < libris2.sql > libris-utf8.sql
Another way of doing this is to edit the file in vim, which will automatically convert the character encoding of a file into one appropriate for the shell its running in. For example, if I do
vim libris2.sql
vim will open the file and display a status line saying:
"libris2.sql" [converted] 234L, 733981C
showing that the encoding of the file wasn’t already UTF-8. If I type the command “:set fileencoding” (or “:set fenc”), vim will report
fileencoding=latin1
If I wanted to, I could get vim to reencode the file for me:
:set fenc=utf8
:w libris2-utf8.sql
:q
Although using vim takes a bit longer because it’s interactive, vim will at least make a good guess at the original encoding, in case I couldn’t remember.
At this point, I like to see what conversions have been made, which is why I didn’t want super-long lines in the database dump:
diff libris2.sql libris2-utf8 | more
As my shell uses UTF-8, some characters in the original file will appear as junk, but they should appear correctly in the output file. If I’ve made any mistakes in guessing the original encoding when I used iconv, I can have another go.
3. Update SQL definitions
For some reason, mysqldump may produce table definitions that are invalid in later versions of MySQL if you are using auto incrementing key fields. If mysqldump produces a table definition that looks like this:
CREATE TABLE DIRECTORS (
DIRECTORS.id int(11) DEFAULT '0' NOT NULL auto_increment,
PRIMARY KEY (id)
);
then the default clause will need to be filtered before importing into MySQL 5.0. I do this for all table definitions in one go with a simple sed script:
sed -e "/auto_increment/ s/DEFAULT '0'//" database.sql > database-filt.sql
4. Import the database dump
Now we have a database dump with the correct encoding, so it only remains to import it to MySQL 5.0 correctly. If we were re-importing the dump to the same version of MySQL, that would be as simple as:
mysql libris2 < libris2.sql
but we need to make sure that the default encoding of the new database is correct, and we need to set the encoding used for our connection to the database. To do that, I’m going to create a small file containing some extra commands that I’ll place in front of the commands in the database dump. Create a file called libris2-header containing these lines:
set names utf8;
drop database if exists libris2;
create database libris2 character set utf8;
use libris2;
The beauty of putting these lines in a separate file, rather than adding them to the top of the database dump, is that I can re-use them if I need to dump the database from the old server again.
Now, I use my little header file and the converted database dump to create a new database on my MySQL 5.0 server:
cat libris2-header libris2-utf8.sql | mysql
Note that I haven’t specified the database name as an option to mysql: that’s because it doesn’t exist yet.