After preparing ebooks for years with HTML and getting frustrated with a morass of divs and spans with classes, I’ve decided to experiment with preparing texts in the vocabulary of the Text Encoding Initiative. Conversion to XHTML for web, EPUB and Kindle formats will be taken care of by some scripts, which may be XSLT later, but for now are Perl scripts.
As I’m preparing books from OCRed scans, I’d like to keep my marked-up text as close as possible to the original layout of the printed book, because it helps me spot errors. I’ve recently made two major leaps forward that allow me to work through and correct text a lot faster.
The first one is to keep all of the end of line hyphens intact, not even changing them to indicate “hard” or “soft” hyphens. The TEI to XHTML script takes care of removing or keeping all hyphens by using a spell checker. I’m using the Perl module Text::Hunspell, which can not only use multiple dictionaries (essential when recent works contain words in English, French, German, Latin and Hindi), but also a book-specific dictionary containing proper names and unusual or archaic words.
The second speed-up concerns quotation marks. Most quotation marks are removed from the text entirely, and replaced by one of the TEI elements <q> or <soCalled>. The remaining quote marks are all apostrophes, and they are retained as the ASCII single quote character, because they can be unambiguously changed to the Unicode right single quote U+2019 by the script. Quote marks will be produced for the other elements (doubles and singles, nested as required) by the script.
There’s a lot more work to do, but I’ve put the results of some experiments online so I can test reading through them. So far, I’ve put up two of Charles E. Pearce’s works, A Star of the East and Dragged from the Dark!. The sources for those aren’t online yet, but I’ll put them up shortly.