Adding history to a database
By Paul Flo Williams
I’ve been wondering how to let other people collaborate on a database without it turning to crap. You see, I’ve been updating Manx, a catalogue of old computer manuals, for a few years now by myself.
Manx lists the manuals produced by a bunch of old computer companies, and records scanned copies that have been put online. On the surface, the database is very simple. The records of each publication can be objectively correct; if you have the manual in front of you and the title, part number and publication date match the database, your work is done. However, Manx attempts to catalogue manuals that we don’t yet have copies of. These entries have come from documentation indexes, and are likely partial. Entries pulled from other databases or sources online are also likely to be partial, or contain errors.
At the moment, the database doesn’t store the history of individual records, as the assumption is that each record will become more correct over time, and there is no point recording how poor each entry used to be. However, collaboration changes that requirement. Even the best intentioned of contributors will make mistakes, and I now need a way of finding these and rolling them back.
I have been looking at different approaches at dealing with history in databases, and it is obvious that the plan of attack depends on why you need to store history. The term for this need seems to be “slow changing dimensions”. My application is rather like Amazon’s book catalogue. They accept corrections, and they go into the catalogue after a human reviewer has taken a look.
The approach I’m going to take is rather like the “Type 4” methodology mentioned in the Wikipedia article, which happens to be the approach used for tracking changes to Wikipedia articles: history tables.
The current PUB table that keeps the details of each publication will be split into PUB and PUBHISTORY. PUBHISTORY will record every version of every publication record, along with some new details about when the change was made, and by whom. PUB will now store an index into PUBHISTORY for the current version of each publication record, to speed up searching. Once a row goes into PUBHISTORY, it will never be modified. Even deletion of a publication (because it was created by mistake) will be recorded at PUB level, rather than by deleting history.