Archive

Archive for February, 2009

My Perl and MySQL UTF-8 crib

February 11th, 2009 Paul Flo Williams 1 comment

Over the years I’ve had various ways of dealing with data beyond the ASCII range in web applications. I’ve had horrible things go wrong when maintaining a “home” and a “live” version of Manx, when I had machines with different versions of MySQL, and I never understood why dealing with UTF-8 across the Perl–MySQL bridge went wrong so much. However, time has healed these wounds, so here is my little crib sheet for getting things right.

MySQL

We started off with MySQL version 3.23, which had no idea of character encodings. If you created a CHAR, VARCHAR, or TEXT column, MySQL still treated the characters as if they were bytes. You were expected to know which character encoding you were using on the way in, and use the same one on the way out. MySQL couldn’t label character columns as having a particular encoding, so its idea of sorting strings was also restricted to numerical comparisons of bytes.

Character encodings were introduced in MySQL version 4.1 and today, MySQL version 5.0 is present on modern Linux distributions.

If you’re starting the database from scratch with MySQL 5.0, things are easy, because all you have to do is to label the character encoding of the database uniformly as UTF-8.

When you create the database, use the statement CREATE DATABASE foo CHARACTER SET utf8; This sets the default encoding for all tables in the database, and all columns in those tables.

Database connections

Having labelled the character encoding for the database itself, we now need to make sure that connections to the database are labelled with the same encoding, or transcoding of character sets will happen. If you’re using the MySQL client, mysql, to connect in a shell that is using UTF-8 (i.e. from a modern Linux box), you can issue the statement

SET NAMES utf8;

before you do anything else, and several internal variables to do with the connection encoding will be set correctly.

If you’re using Perl’s DBI and DBD::mysql modules to connect, you can define the character encoding in an attribute on the session handle:

my $dbh = DBI->connect($data_source, $user, $pass, { mysql_enable_utf8 => 1} );

This is the trick that I was missing for quite a while, and I was plugging the gap by using a module called UTF8DBI.pm to wrap DBI, but this is no longer necessary.

Perl

Since about version 5.8.0, Perl now knows the character encoding of strings that it uses, and the encoding of file streams. Web applications using the CGI interface will send their output to STDOUT, so we need to label the encoding of STDOUT to be the same as our internal encoding so that Perl doesn’t transcode. Somewhere near the top of the program, before any output is produced, do:

binmode STDOUT, ':utf8';

If you’re using the CGI module, you will need to specify the encoding in the HTTP headers that go to the browser, because the default is latin1:

print header(-type => 'text/html', -charset => 'utf-8');

Note the difference between the labelling of the stream, utf8 and the labelling of the web content, utf-8. Grrrrr

Categories: MySQL, Perl Tags:

Converting a MySQL version 3.23 database to version 5.0

February 5th, 2009 Paul Flo Williams 5 comments

You’d think there might be a one-step approach to this, as it would be a common need, but because we’re going from a database version that didn’t know about character encodings at all, its contents could be in any old encoding. These steps assume that the database was used to store Latin1 (ISO 8859-1) characters before. In the examples below, we’re going to convert a database called “libris2″ (because that’s a real one I had to convert – these steps are tried and tested!)

1. Dump the old database

The standard way of dumping a database from MySQL is

mysqldump --opt libris2 > libris2.sql

That dumps all the tables from a database and all the data in a form designed for efficient insertion into a new database, with very long INSERT statements. However, I want to be able to check the conversions that I’m about to do, so I’ll dump the database slightly differently:

mysqldump --add-drop-table --add-locks --all
          --quick --lock-tables libris2 > libris2.sql

Here, I’ve used all the options that “–opt” would give me, with the exception of “–extended-insert”. Now, I have the complete database in a text file, libris2.sql.

Note: Perform this dump on the machine running the old MySQL server, so that you get an old (matching) version of mysqldump. If mysqldump doesn’t match the server version, you may not get any output at all. I have been in the situation where my database was automatically migrated from an old server to a newer version, and convincing mysqldump to ignore character encoding details in the dump on newer versions is a pain. Migrate your data while you still have the old server running.

2. Convert the character encodings

If I’m sure of the encoding of data in the database, I can do this:

iconv -f latin1 -t utf8 < libris2.sql > libris-utf8.sql

Another way of doing this is to edit the file in vim, which will automatically convert the character encoding of a file into one appropriate for the shell its running in. For example, if I do

vim libris2.sql

vim will open the file and display a status line saying:

"libris2.sql" [converted] 234L, 733981C

showing that the encoding of the file wasn’t already UTF-8. If I type the command “:set fileencoding” (or “:set fenc”), vim will report

fileencoding=latin1

If I wanted to, I could get vim to reencode the file for me:

:set fenc=utf8
:w libris2-utf8.sql
:q

Although using vim takes a bit longer because it’s interactive, vim will at least make a good guess at the original encoding, in case I couldn’t remember.

At this point, I like to see what conversions have been made, which is why I didn’t want super-long lines in the database dump:

diff libris2.sql libris2-utf8 | more

As my shell uses UTF-8, some characters in the original file will appear as junk, but they should appear correctly in the output file. If I’ve made any mistakes in guessing the original encoding when I used iconv, I can have another go.

3. Update SQL definitions

For some reason, mysqldump may produce table definitions that are invalid in later versions of MySQL if you are using auto incrementing key fields. If mysqldump produces a table definition that looks like this:

CREATE TABLE DIRECTORS (
   DIRECTORS.id int(11) DEFAULT '0' NOT NULL auto_increment,
   PRIMARY KEY (id)
);

then the default clause will need to be filtered before importing into MySQL 5.0. I do this for all table definitions in one go with a simple sed script:

sed -e "/auto_increment/ s/DEFAULT '0'//" database.sql > database-filt.sql

4. Import the database dump

Now we have a database dump with the correct encoding, so it only remains to import it to MySQL 5.0 correctly. If we were re-importing the dump to the same version of MySQL, that would be as simple as:

mysql libris2 < libris2.sql

but we need to make sure that the default encoding of the new database is correct, and we need to set the encoding used for our connection to the database. To do that, I’m going to create a small file containing some extra commands that I’ll place in front of the commands in the database dump. Create a file called libris2-header containing these lines:

set names utf8;
drop database if exists libris2;
create database libris2 character set utf8;
use libris2;

The beauty of putting these lines in a separate file, rather than adding them to the top of the database dump, is that I can re-use them if I need to dump the database from the old server again.

Now, I use my little header file and the converted database dump to create a new database on my MySQL 5.0 server:

cat libris2-header libris2-utf8.sql | mysql

Note that I haven’t specified the database name as an option to mysql: that’s because it doesn’t exist yet.

Categories: MySQL Tags:

Adding history to a database

February 5th, 2009 Paul Flo Williams Comments off

I’ve been wondering how to let other people collaborate on a database without it turning to crap. You see, I’ve been updating Manx, a catalogue of old computer manuals, for a few years now by myself.

Manx lists the manuals produced by a bunch of old computer companies, and records scanned copies that have been put online. On the surface, the database is very simple. The records of each publication can be objectively correct; if you have the manual in front of you and the title, part number and publication date match the database, your work is done. However, Manx attempts to catalogue manuals that we don’t yet have copies of. These entries have come from documentation indexes, and are likely partial. Entries pulled from other databases or sources online are also likely to be partial, or contain errors.

At the moment, the database doesn’t store the history of individual records, as the assumption is that each record will become more correct over time, and there is no point recording how poor each entry used to be. However, collaboration changes that requirement. Even the best intentioned of contributors will make mistakes, and I now need a way of finding these and rolling them back.

I have been looking at different approaches at dealing with history in databases, and it is obvious that the plan of attack depends on why you need to store history. The term for this need seems to be “slow changing dimensions“. My application is rather like Amazon’s book catalogue. They accept corrections, and they go into the catalogue after a human reviewer has taken a look.

The approach I’m going to take is rather like the “Type 4″ methodology mentioned in the Wikipedia article, which happens to be the approach used for tracking changes to Wikipedia articles: history tables.

The current PUB table that keeps the details of each publication will be split into PUB and PUBHISTORY. PUBHISTORY will record every version of every publication record, along with some new details about when the change was made, and by whom. PUB will now store an index into PUBHISTORY for the current version of each publication record, to speed up searching. Once a row goes into PUBHISTORY, it will never be modified. Even deletion of a publication (because it was created by mistake) will be recorded at PUB level, rather than by deleting history.

Categories: Manx Tags: