<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>His Deeds Are Dust &#187; database</title>
	<atom:link href="http://hisdeedsaredust.com/tag/database/feed/" rel="self" type="application/rss+xml" />
	<link>http://hisdeedsaredust.com</link>
	<description>surveying sub-optimal solutions</description>
	<lastBuildDate>Wed, 02 May 2012 13:16:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Database dumps from Manx</title>
		<link>http://hisdeedsaredust.com/2009/04/database-dumps-from-manx/</link>
		<comments>http://hisdeedsaredust.com/2009/04/database-dumps-from-manx/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 21:09:58 +0000</pubDate>
		<dc:creator>Paul Flo Williams</dc:creator>
				<category><![CDATA[Manx]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://hisdeedsaredust.com/?p=28</guid>
		<description><![CDATA[I had an email yesterday from someone who&#8217;d been scanning documents and contributing them to some of the sites covered by Manx, including Bitsavers. Having a large stash of DEC documentation, he wanted to make sure he directed his efforts towards those documents that aren&#8217;t already scanned, so he asked for a list of all [...]]]></description>
			<content:encoded><![CDATA[<p>I had an email yesterday from someone who&#8217;d been scanning documents and contributing them to some of the sites covered by Manx, including <a href="http://bitsavers.org">Bitsavers</a>. Having a large stash of DEC documentation, he wanted to make sure he directed his efforts towards those documents that aren&#8217;t already scanned, so he asked for a list of all the DEC documents in <a href="http://vt100.net/manx/">Manx</a>, and whether any copies were known to exist online.</p>
<p>I&#8217;ve put the dump online, in case others find it useful: <a href="http://vt100.net/manx/dump/dec-all-20090403.tsv">DEC documentation status</a>.</p>
<p>This is likely out of date, because I don&#8217;t know what proportion of the DEC documentation online is not yet catalogued in Manx. I don&#8217;t even know what proportion of Bitsavers&#8217; holdings are not covered, and I <em>could</em> at least calculate that.</p>
<p>I used to put fuller dumps of Manx online, and supplied complete dumps by email to a number of other archivists, but I no longer generate them as I never heard from anyone who&#8217;d found them useful. Nevertheless, if you&#8217;d like a particular report from Manx, please contact me, as it may help in planning the new facilities for users.</p>
]]></content:encoded>
			<wfw:commentRss>http://hisdeedsaredust.com/2009/04/database-dumps-from-manx/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>My Perl and MySQL UTF-8 crib</title>
		<link>http://hisdeedsaredust.com/2009/02/my-perl-and-mysql-utf-8-crib/</link>
		<comments>http://hisdeedsaredust.com/2009/02/my-perl-and-mysql-utf-8-crib/#comments</comments>
		<pubDate>Wed, 11 Feb 2009 14:09:37 +0000</pubDate>
		<dc:creator>Paul Flo Williams</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://hisdeedsaredust.com/?p=17</guid>
		<description><![CDATA[Over the years I&#8217;ve had various ways of dealing with data beyond the ASCII range in web applications. I&#8217;ve had horrible things go wrong when maintaining a &#8220;home&#8221; and a &#8220;live&#8221; version of Manx, when I had machines with different versions of MySQL, and I never understood why dealing with UTF-8 across the Perl&#8211;MySQL bridge [...]]]></description>
			<content:encoded><![CDATA[<p>Over the years I&#8217;ve had various ways of dealing with data beyond the ASCII range in web applications. I&#8217;ve had horrible things go wrong when maintaining a &#8220;home&#8221; and a &#8220;live&#8221; version of <a href="http://vt100.net/manx">Manx</a>, when I had machines with different versions of MySQL, and I never understood why dealing with UTF-8 across the Perl&ndash;MySQL bridge went wrong so much. However, time has healed these wounds, so here is my little crib sheet for getting things right.</p>
<h4>MySQL</h4>
<p>We started off with MySQL version 3.23, which had no idea of character encodings. If you created a CHAR, VARCHAR, or TEXT column, MySQL still treated the characters as if they were bytes. You were expected to know which character encoding you were using on the way in, and use the same one on the way out. MySQL couldn&#8217;t label character columns as having a particular encoding, so its idea of sorting strings was also restricted to numerical comparisons of bytes.</p>
<p>Character encodings were introduced in MySQL version 4.1 and today, MySQL version 5.0 is present on modern Linux distributions.</p>
<p>If you&#8217;re starting the database from scratch with MySQL 5.0, things are easy, because all you have to do is to label the character encoding of the database uniformly as UTF-8.</p>
<p>When you create the database, use the statement <code>CREATE DATABASE foo CHARACTER SET utf8;</code> This sets the default encoding for all tables in the database, and all columns in those tables.</p>
<h4>Database connections</h4>
<p>Having labelled the character encoding for the database itself, we now need to make sure that connections to the database are labelled with the same encoding, or transcoding of character sets will happen. If you&#8217;re using the MySQL client, <tt>mysql</tt>, to connect in a shell that is using UTF-8 (i.e. from a modern Linux box), you can issue the statement</p>
<p><code>SET NAMES utf8;</code></p>
<p>before you do anything else, and several internal variables to do with the connection encoding will be set correctly.</p>
<p>If you&#8217;re using Perl&#8217;s DBI and DBD::mysql modules to connect, you can define the character encoding in an attribute on the session handle:</p>
<p><code>my $dbh = DBI->connect($data_source, $user, $pass, { mysql_enable_utf8 => 1} );</code></p>
<p>This is the trick that I was missing for quite a while, and I was plugging the gap by using a module called UTF8DBI.pm to wrap DBI, but this is no longer necessary.</p>
<h4>Perl</h4>
<p>Since about version 5.8.0, Perl now knows the character encoding of strings that it uses, and the encoding of file streams. Web applications using the CGI interface will send their output to STDOUT, so we need to label the encoding of STDOUT to be the same as our internal encoding so that Perl doesn&#8217;t transcode. Somewhere near the top of the program, before any output is produced, do:</p>
<p><code>binmode STDOUT, ':utf8';</code></p>
<p>If you&#8217;re using the CGI module, you will need to specify the encoding in the HTTP headers that go to the browser, because the default is latin1:</p>
<p><code>print header(-type => 'text/html', -charset => 'utf-8');</code></p>
<p>Note the difference between the labelling of the stream, <b>utf8</b> and the labelling of the web content, <b>utf-8</b>. <em>Grrrrr</em></p>
]]></content:encoded>
			<wfw:commentRss>http://hisdeedsaredust.com/2009/02/my-perl-and-mysql-utf-8-crib/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Converting a MySQL version 3.23 database to version 5.0</title>
		<link>http://hisdeedsaredust.com/2009/02/mysql-version-3-to-version-5/</link>
		<comments>http://hisdeedsaredust.com/2009/02/mysql-version-3-to-version-5/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 13:03:51 +0000</pubDate>
		<dc:creator>Paul Flo Williams</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://hisdeedsaredust.com/?p=11</guid>
		<description><![CDATA[You&#8217;d think there might be a one-step approach to this, as it would be a common need, but because we&#8217;re going from a database version that didn&#8217;t know about character encodings at all, its contents could be in any old encoding. These steps assume that the database was used to store Latin1 (ISO&#160;8859-1) characters before. [...]]]></description>
			<content:encoded><![CDATA[<p>You&#8217;d think there might be a one-step approach to this, as it would be a common need, but because we&#8217;re going from a database version that didn&#8217;t know about character encodings at all, its contents could be in any old encoding. These steps assume that the database was used to store Latin1 (ISO&nbsp;8859-1) characters before. In the examples below, we&#8217;re going to convert a database called &#8220;libris2&#8243; (because that&#8217;s a real one I had to convert &ndash; these steps are tried and tested!)</p>
<h4>1. Dump the old database</h4>
<p>The standard way of dumping a database from MySQL is</p>
<pre>mysqldump --opt libris2 &gt; libris2.sql</pre>
<p>That dumps all the tables from a database and all the data in a form designed for efficient insertion into a new database, with very long INSERT statements. However, I want to be able to check the conversions that I&#8217;m about to do, so I&#8217;ll dump the database slightly differently:</p>
<pre>mysqldump --add-drop-table --add-locks --all
          --quick --lock-tables libris2 &gt; libris2.sql</pre>
<p>Here, I&#8217;ve used all the options that &#8220;&#8211;opt&#8221; would give me, with the exception of &#8220;&#8211;extended-insert&#8221;. Now, I have the complete database in a text file, <tt>libris2.sql</tt>.</p>
<p><strong>Note</strong>: Perform this dump on the machine running the old MySQL server, so that you get an old (matching) version of <tt>mysqldump</tt>. If <tt>mysqldump</tt> doesn&#8217;t match the server version, you may not get any output at all. I have been in the situation where my database was automatically migrated from an old server to a newer version, and convincing <tt>mysqldump</tt> to ignore character encoding details in the dump on newer versions is a pain. Migrate your data while you still have the old server running.</p>
<h4>2. Convert the character encodings</h4>
<p>If I&#8217;m sure of the encoding of data in the database, I can do this:</p>
<pre>iconv -f latin1 -t utf8 &lt; libris2.sql &gt; libris-utf8.sql</pre>
<p>Another way of doing this is to edit the file in <tt>vim</tt>, which will automatically convert the character encoding of a file into one appropriate for the shell its running in. For example, if I do</p>
<pre>vim libris2.sql</pre>
<p><tt>vim</tt> will open the file and display a status line saying:</p>
<pre>"libris2.sql" [converted] 234L, 733981C</pre>
<p>showing that the encoding of the file wasn&#8217;t already UTF-8. If I type the command &#8220;:set fileencoding&#8221; (or &#8220;:set fenc&#8221;), <tt>vim</tt> will report</p>
<pre>fileencoding=latin1</pre>
<p>If I wanted to, I could get <tt>vim</tt> to reencode the file for me:</p>
<pre>:set fenc=utf8
:w libris2-utf8.sql
:q
</pre>
<p>Although using <tt>vim</tt> takes a bit longer because it&#8217;s interactive, <tt>vim</tt> will at least make a good guess at the original encoding, in case I couldn&#8217;t remember.</p>
<p>At this point, I like to see what conversions have been made, which is why I didn&#8217;t want super-long lines in the database dump:</p>
<pre>diff libris2.sql libris2-utf8 | more</pre>
<p>As my shell uses UTF-8, some characters in the original file will appear as junk, but they should appear correctly in the output file. If I&#8217;ve made any mistakes in guessing the original encoding when I used <tt>iconv</tt>, I can have another go.</p>
<h4>3. Update SQL definitions</h4>
<p>For some reason, <tt>mysqldump</tt> may produce table definitions that are invalid in later versions of MySQL if you are using auto incrementing key fields. If <tt>mysqldump</tt> produces a table definition that looks like this:</p>
<pre>CREATE TABLE DIRECTORS (
   DIRECTORS.id int(11) DEFAULT '0' NOT NULL auto_increment,
   PRIMARY KEY (id)
);
</pre>
<p>then the default clause will need to be filtered before importing into MySQL 5.0. I do this for all table definitions in one go with a simple <tt>sed</tt> script:</p>
<pre>sed -e "/auto_increment/ s/DEFAULT '0'//" database.sql &gt; database-filt.sql</pre>
<h4>4. Import the database dump</h4>
<p>Now we have a database dump with the correct encoding, so it only remains to import it to MySQL 5.0 correctly. If we were re-importing the dump to the same version of MySQL, that would be as simple as:</p>
<pre>mysql libris2 &lt; libris2.sql</pre>
<p>but we need to make sure that the default encoding of the new database is correct, and we need to set the encoding used for our connection to the database. To do that, I&#8217;m going to create a small file containing some extra commands that I&#8217;ll place in front of the commands in the database dump. Create a file called <tt>libris2-header</tt> containing these lines:</p>
<pre>set names utf8;
drop database if exists libris2;
create database libris2 character set utf8;
use libris2;
</pre>
<p>The beauty of putting these lines in a separate file, rather than adding them to the top of the database dump, is that I can re-use them if I need to dump the database from the old server again.</p>
<p>Now, I use my little header file and the converted database dump to create a new database on my MySQL 5.0 server:</p>
<pre>cat libris2-header libris2-utf8.sql | mysql
</pre>
<p>Note that I haven&#8217;t specified the database name as an option to <tt>mysql</tt>: that&#8217;s because it doesn&#8217;t exist yet.</p>
]]></content:encoded>
			<wfw:commentRss>http://hisdeedsaredust.com/2009/02/mysql-version-3-to-version-5/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Adding history to a database</title>
		<link>http://hisdeedsaredust.com/2009/02/adding-history-to-a-database/</link>
		<comments>http://hisdeedsaredust.com/2009/02/adding-history-to-a-database/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 12:12:08 +0000</pubDate>
		<dc:creator>Paul Flo Williams</dc:creator>
				<category><![CDATA[Manx]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://hisdeedsaredust.com/?p=8</guid>
		<description><![CDATA[I&#8217;ve been wondering how to let other people collaborate on a database without it turning to crap. You see, I&#8217;ve been updating Manx, a catalogue of old computer manuals, for a few years now by myself. Manx lists the manuals produced by a bunch of old computer companies, and records scanned copies that have been [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been wondering how to let other people collaborate on a database without it turning to crap. You see, I&#8217;ve been updating <a href="http://vt100.net/manx/">Manx</a>, a catalogue of old computer manuals, for a few years now by myself.</p>
<p>Manx lists the manuals produced by a bunch of old computer companies, and records scanned copies that have been put online. On the surface, the database is very simple. The records of each publication can be objectively correct; if you have the manual in front of you and the title, part number and publication date match the database, your work is done. However, Manx attempts to catalogue manuals that we don&#8217;t yet have copies of. These entries have come from documentation indexes, and are likely partial. Entries pulled from other databases or sources online are also likely to be partial, or contain errors.</p>
<p>At the moment, the database doesn&#8217;t store the history of individual records, as the assumption is that each record will become more correct over time, and there is no point recording how poor each entry used to be. However, collaboration changes that requirement. Even the best intentioned of contributors will make mistakes, and I now need a way of finding these and rolling them back.</p>
<p>I have been looking at different approaches at dealing with history in databases, and it is obvious that the plan of attack depends on <em>why</em> you need to store history. The term for this need seems to be &#8220;<a href="http://en.wikipedia.org/wiki/Slowly_changing_dimension">slow changing dimensions</a>&#8220;. My application is rather like Amazon&#8217;s book catalogue. They accept corrections, and they go into the catalogue after a human reviewer has taken a look.</p>
<p>The approach I&#8217;m going to take is rather like the &#8220;Type 4&#8243; methodology mentioned in the Wikipedia article, which happens to be the approach used for tracking changes to Wikipedia articles: <strong>history tables</strong>.</p>
<p>The current PUB table that keeps the details of each publication will be split into PUB and PUBHISTORY. PUBHISTORY will record every version of every publication record, along with some new details about when the change was made, and by whom. PUB will now store an index into PUBHISTORY for the current version of each publication record, to speed up searching. Once a row goes into PUBHISTORY, it will never be modified. Even deletion of a publication (because it was created by mistake) will be recorded at PUB level, rather than by deleting history.</p>
]]></content:encoded>
			<wfw:commentRss>http://hisdeedsaredust.com/2009/02/adding-history-to-a-database/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

