<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>His Deeds Are Dust &#187; MySQL</title>
	<atom:link href="http://hisdeedsaredust.com/category/mysql/feed/" rel="self" type="application/rss+xml" />
	<link>http://hisdeedsaredust.com</link>
	<description>surveying sub-optimal solutions</description>
	<lastBuildDate>Tue, 17 Jan 2012 15:52:58 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>My Perl and MySQL UTF-8 crib</title>
		<link>http://hisdeedsaredust.com/2009/02/my-perl-and-mysql-utf-8-crib/</link>
		<comments>http://hisdeedsaredust.com/2009/02/my-perl-and-mysql-utf-8-crib/#comments</comments>
		<pubDate>Wed, 11 Feb 2009 14:09:37 +0000</pubDate>
		<dc:creator>Paul Flo Williams</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://hisdeedsaredust.com/?p=17</guid>
		<description><![CDATA[Over the years I&#8217;ve had various ways of dealing with data beyond the ASCII range in web applications. I&#8217;ve had horrible things go wrong when maintaining a &#8220;home&#8221; and a &#8220;live&#8221; version of Manx, when I had machines with different versions of MySQL, and I never understood why dealing with UTF-8 across the Perl&#8211;MySQL bridge [...]]]></description>
			<content:encoded><![CDATA[<p>Over the years I&#8217;ve had various ways of dealing with data beyond the ASCII range in web applications. I&#8217;ve had horrible things go wrong when maintaining a &#8220;home&#8221; and a &#8220;live&#8221; version of <a href="http://vt100.net/manx">Manx</a>, when I had machines with different versions of MySQL, and I never understood why dealing with UTF-8 across the Perl&ndash;MySQL bridge went wrong so much. However, time has healed these wounds, so here is my little crib sheet for getting things right.</p>
<h4>MySQL</h4>
<p>We started off with MySQL version 3.23, which had no idea of character encodings. If you created a CHAR, VARCHAR, or TEXT column, MySQL still treated the characters as if they were bytes. You were expected to know which character encoding you were using on the way in, and use the same one on the way out. MySQL couldn&#8217;t label character columns as having a particular encoding, so its idea of sorting strings was also restricted to numerical comparisons of bytes.</p>
<p>Character encodings were introduced in MySQL version 4.1 and today, MySQL version 5.0 is present on modern Linux distributions.</p>
<p>If you&#8217;re starting the database from scratch with MySQL 5.0, things are easy, because all you have to do is to label the character encoding of the database uniformly as UTF-8.</p>
<p>When you create the database, use the statement <code>CREATE DATABASE foo CHARACTER SET utf8;</code> This sets the default encoding for all tables in the database, and all columns in those tables.</p>
<h4>Database connections</h4>
<p>Having labelled the character encoding for the database itself, we now need to make sure that connections to the database are labelled with the same encoding, or transcoding of character sets will happen. If you&#8217;re using the MySQL client, <tt>mysql</tt>, to connect in a shell that is using UTF-8 (i.e. from a modern Linux box), you can issue the statement</p>
<p><code>SET NAMES utf8;</code></p>
<p>before you do anything else, and several internal variables to do with the connection encoding will be set correctly.</p>
<p>If you&#8217;re using Perl&#8217;s DBI and DBD::mysql modules to connect, you can define the character encoding in an attribute on the session handle:</p>
<p><code>my $dbh = DBI->connect($data_source, $user, $pass, { mysql_enable_utf8 => 1} );</code></p>
<p>This is the trick that I was missing for quite a while, and I was plugging the gap by using a module called UTF8DBI.pm to wrap DBI, but this is no longer necessary.</p>
<h4>Perl</h4>
<p>Since about version 5.8.0, Perl now knows the character encoding of strings that it uses, and the encoding of file streams. Web applications using the CGI interface will send their output to STDOUT, so we need to label the encoding of STDOUT to be the same as our internal encoding so that Perl doesn&#8217;t transcode. Somewhere near the top of the program, before any output is produced, do:</p>
<p><code>binmode STDOUT, ':utf8';</code></p>
<p>If you&#8217;re using the CGI module, you will need to specify the encoding in the HTTP headers that go to the browser, because the default is latin1:</p>
<p><code>print header(-type => 'text/html', -charset => 'utf-8');</code></p>
<p>Note the difference between the labelling of the stream, <b>utf8</b> and the labelling of the web content, <b>utf-8</b>. <em>Grrrrr</em></p>
]]></content:encoded>
			<wfw:commentRss>http://hisdeedsaredust.com/2009/02/my-perl-and-mysql-utf-8-crib/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Converting a MySQL version 3.23 database to version 5.0</title>
		<link>http://hisdeedsaredust.com/2009/02/mysql-version-3-to-version-5/</link>
		<comments>http://hisdeedsaredust.com/2009/02/mysql-version-3-to-version-5/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 13:03:51 +0000</pubDate>
		<dc:creator>Paul Flo Williams</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://hisdeedsaredust.com/?p=11</guid>
		<description><![CDATA[You&#8217;d think there might be a one-step approach to this, as it would be a common need, but because we&#8217;re going from a database version that didn&#8217;t know about character encodings at all, its contents could be in any old encoding. These steps assume that the database was used to store Latin1 (ISO&#160;8859-1) characters before. [...]]]></description>
			<content:encoded><![CDATA[<p>You&#8217;d think there might be a one-step approach to this, as it would be a common need, but because we&#8217;re going from a database version that didn&#8217;t know about character encodings at all, its contents could be in any old encoding. These steps assume that the database was used to store Latin1 (ISO&nbsp;8859-1) characters before. In the examples below, we&#8217;re going to convert a database called &#8220;libris2&#8243; (because that&#8217;s a real one I had to convert &ndash; these steps are tried and tested!)</p>
<h4>1. Dump the old database</h4>
<p>The standard way of dumping a database from MySQL is</p>
<pre>mysqldump --opt libris2 &gt; libris2.sql</pre>
<p>That dumps all the tables from a database and all the data in a form designed for efficient insertion into a new database, with very long INSERT statements. However, I want to be able to check the conversions that I&#8217;m about to do, so I&#8217;ll dump the database slightly differently:</p>
<pre>mysqldump --add-drop-table --add-locks --all
          --quick --lock-tables libris2 &gt; libris2.sql</pre>
<p>Here, I&#8217;ve used all the options that &#8220;&#8211;opt&#8221; would give me, with the exception of &#8220;&#8211;extended-insert&#8221;. Now, I have the complete database in a text file, <tt>libris2.sql</tt>.</p>
<p><strong>Note</strong>: Perform this dump on the machine running the old MySQL server, so that you get an old (matching) version of <tt>mysqldump</tt>. If <tt>mysqldump</tt> doesn&#8217;t match the server version, you may not get any output at all. I have been in the situation where my database was automatically migrated from an old server to a newer version, and convincing <tt>mysqldump</tt> to ignore character encoding details in the dump on newer versions is a pain. Migrate your data while you still have the old server running.</p>
<h4>2. Convert the character encodings</h4>
<p>If I&#8217;m sure of the encoding of data in the database, I can do this:</p>
<pre>iconv -f latin1 -t utf8 &lt; libris2.sql &gt; libris-utf8.sql</pre>
<p>Another way of doing this is to edit the file in <tt>vim</tt>, which will automatically convert the character encoding of a file into one appropriate for the shell its running in. For example, if I do</p>
<pre>vim libris2.sql</pre>
<p><tt>vim</tt> will open the file and display a status line saying:</p>
<pre>"libris2.sql" [converted] 234L, 733981C</pre>
<p>showing that the encoding of the file wasn&#8217;t already UTF-8. If I type the command &#8220;:set fileencoding&#8221; (or &#8220;:set fenc&#8221;), <tt>vim</tt> will report</p>
<pre>fileencoding=latin1</pre>
<p>If I wanted to, I could get <tt>vim</tt> to reencode the file for me:</p>
<pre>:set fenc=utf8
:w libris2-utf8.sql
:q
</pre>
<p>Although using <tt>vim</tt> takes a bit longer because it&#8217;s interactive, <tt>vim</tt> will at least make a good guess at the original encoding, in case I couldn&#8217;t remember.</p>
<p>At this point, I like to see what conversions have been made, which is why I didn&#8217;t want super-long lines in the database dump:</p>
<pre>diff libris2.sql libris2-utf8 | more</pre>
<p>As my shell uses UTF-8, some characters in the original file will appear as junk, but they should appear correctly in the output file. If I&#8217;ve made any mistakes in guessing the original encoding when I used <tt>iconv</tt>, I can have another go.</p>
<h4>3. Update SQL definitions</h4>
<p>For some reason, <tt>mysqldump</tt> may produce table definitions that are invalid in later versions of MySQL if you are using auto incrementing key fields. If <tt>mysqldump</tt> produces a table definition that looks like this:</p>
<pre>CREATE TABLE DIRECTORS (
   DIRECTORS.id int(11) DEFAULT '0' NOT NULL auto_increment,
   PRIMARY KEY (id)
);
</pre>
<p>then the default clause will need to be filtered before importing into MySQL 5.0. I do this for all table definitions in one go with a simple <tt>sed</tt> script:</p>
<pre>sed -e "/auto_increment/ s/DEFAULT '0'//" database.sql &gt; database-filt.sql</pre>
<h4>4. Import the database dump</h4>
<p>Now we have a database dump with the correct encoding, so it only remains to import it to MySQL 5.0 correctly. If we were re-importing the dump to the same version of MySQL, that would be as simple as:</p>
<pre>mysql libris2 &lt; libris2.sql</pre>
<p>but we need to make sure that the default encoding of the new database is correct, and we need to set the encoding used for our connection to the database. To do that, I&#8217;m going to create a small file containing some extra commands that I&#8217;ll place in front of the commands in the database dump. Create a file called <tt>libris2-header</tt> containing these lines:</p>
<pre>set names utf8;
drop database if exists libris2;
create database libris2 character set utf8;
use libris2;
</pre>
<p>The beauty of putting these lines in a separate file, rather than adding them to the top of the database dump, is that I can re-use them if I need to dump the database from the old server again.</p>
<p>Now, I use my little header file and the converted database dump to create a new database on my MySQL 5.0 server:</p>
<pre>cat libris2-header libris2-utf8.sql | mysql
</pre>
<p>Note that I haven&#8217;t specified the database name as an option to <tt>mysql</tt>: that&#8217;s because it doesn&#8217;t exist yet.</p>
]]></content:encoded>
			<wfw:commentRss>http://hisdeedsaredust.com/2009/02/mysql-version-3-to-version-5/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

