My Perl and MySQL UTF-8 crib
By Paul Flo Williams
Over the years I’ve had various ways of dealing with data beyond the ASCII range in web applications. I’ve had horrible things go wrong when maintaining a “home” and a “live” version of Manx, when I had machines with different versions of MySQL, and I never understood why dealing with UTF-8 across the Perl–MySQL bridge went wrong so much. However, time has healed these wounds, so here is my little crib sheet for getting things right.
MySQL
We started off with MySQL version 3.23, which had no idea of character encodings. If you created a CHAR, VARCHAR, or TEXT column, MySQL still treated the characters as if they were bytes. You were expected to know which character encoding you were using on the way in, and use the same one on the way out. MySQL couldn’t label character columns as having a particular encoding, so its idea of sorting strings was also restricted to numerical comparisons of bytes.
Character encodings were introduced in MySQL version 4.1 and today, MySQL version 5.0 is present on modern Linux distributions.
If you’re starting the database from scratch with MySQL 5.0, things are easy, because all you have to do is to label the character encoding of the database uniformly as UTF-8.
When you create the database, use the statement CREATE DATABASE foo
CHARACTER SET utf8;
This sets the default encoding for all tables in the
database, and all columns in those tables.
Database connections
Having labelled the character encoding for the database itself, we now need to make sure that connections to the database are labelled with the same encoding, or transcoding of character sets will happen. If you’re using the MySQL client, mysql, to connect in a shell that is using UTF-8 (i.e. from a modern Linux box), you can issue the statement
SET NAMES utf8;
before you do anything else, and several internal variables to do with the connection encoding will be set correctly.
If you’re using Perl’s DBI and DBD::mysql modules to connect, you can define the character encoding in an attribute on the session handle:
my $dbh = DBI->connect($data_source, $user, $pass, { mysql_enable_utf8 => 1} );
This is the trick that I was missing for quite a while, and I was plugging the gap by using a module called UTF8DBI.pm to wrap DBI, but this is no longer necessary.
Perl
Since about version 5.8.0, Perl now knows the character encoding of strings that it uses, and the encoding of file streams. Web applications using the CGI interface will send their output to STDOUT, so we need to label the encoding of STDOUT to be the same as our internal encoding so that Perl doesn’t transcode. Somewhere near the top of the program, before any output is produced, do:
binmode STDOUT, ':utf8';
If you’re using the CGI module, you will need to specify the encoding in the HTTP headers that go to the browser, because the default is latin1:
print header(-type => 'text/html', -charset => 'utf-8');
Note the difference between the labelling of the stream, utf8 and the labelling of the web content, utf-8. Grrrrr