TEI with light markup

March 14th, 2013 Paul Flo Williams Comments off

After preparing ebooks for years with HTML and getting frustrated with a morass of divs and spans with classes, I’ve decided to experiment with preparing texts in the vocabulary of the Text Encoding Initiative. Conversion to XHTML for web, EPUB and Kindle formats will be taken care of by some scripts, which may be XSLT later, but for now are Perl scripts.

As I’m preparing books from OCRed scans, I’d like to keep my marked-up text as close as possible to the original layout of the printed book, because it helps me spot errors. I’ve recently made two major leaps forward that allow me to work through and correct text a lot faster.

The first one is to keep all of the end of line hyphens intact, not even changing them to indicate “hard” or “soft” hyphens. The TEI to XHTML script takes care of removing or keeping all hyphens by using a spell checker. I’m using the Perl module Text::Hunspell, which can not only use multiple dictionaries (essential when recent works contain words in English, French, German, Latin and Hindi), but also a book-specific dictionary containing proper names and unusual or archaic words.

The second speed-up concerns quotation marks. Most quotation marks are removed from the text entirely, and replaced by one of the TEI elements <q> or <soCalled>. The remaining quote marks are all apostrophes, and they are retained as the ASCII single quote character, because they can be unambiguously changed to the Unicode right single quote U+2019 by the script. Quote marks will be produced for the other elements (doubles and singles, nested as required) by the script.

There’s a lot more work to do, but I’ve put the results of some experiments online so I can test reading through them. So far, I’ve put up two of Charles E. Pearce’s works, A Star of the East and Dragged from the Dark!. The sources for those aren’t online yet, but I’ll put them up shortly.

Categories: Ebooks Tags:

A “Sensation” Diver in Jeopardy

November 16th, 2012 Paul Flo Williams Comments off

I’ve been having fun searching the British Newspaper Archive and came up with this snippet about the Brighton Chain Pier, originally from the Brighton Guardian, but reproduced in the Kentish Gazette of Tuesday 10 September 1867.

A “Sensation” Diver in Jeopardy.—A certain “Professor” Worthington, a young man who announces himself as a “sensation” and “star” diver, has recently been making “terrific plunges” from the head of the Brighton Chain Pier. He jumps from a height of between 100 and 130 feet, and, turning completely over as he falls, enters the water (professedly) head first. But those who have seen the “professor” aver that he does not always do so; and, of course, the shock of falling flat on the water from such a height would be of the most severe character. The correspondent of a contemporary alleges that on one occasion the “professor’s” face was swollen and discoloured from contact with the water,—no doubt, from the diver having entered the sea in a manner which did not give him the protection of his arms as a “cutwater.” On Saturday week he, oddly enough, advertised his “last sensation dive,” and an immense concourse of people assembled to witness it. The hour for the jump to be made was fixed for six, but it was three-quarters of an hour later before the “professor” was ready. It was then nearly dead low water of spring-tides, and it is said that because this was the case the “professor” determined to throw himself full length on the water. Either excessive ignorance or rashness must have dictated such a resolve, and the result was what might have been expected. Almost before he seemed lost to sight in the eddy of the plunge the diver appeared again floating “like a log” in the water, his head underneath. The boat which is always in attendance at once rowed to him, and on being pulled in he was found to be thoroughly insensible. He was treated as well as the circumstances would allow, and on being put upon the pier soon recovered sufficient strength to be able to walk with assistance. We have heard that his face was again injured. The water at the pier head, when his jump was made, was only about seven feet deep; but we cannot say whether “Professor” Worthington received his injuries from the concussion with the surface or from striking the bottom. In either case he could not apparently discern the evident danger.—Brighton Guardian

Categories: History Tags: ,

PANOSE in the wild

October 9th, 2012 Paul Flo Williams 1 comment

I am considering working on the PANOSE font matching part of Fontmatrix because I enjoy playing with Fontmatrix, but its idea of how PANOSE’s individual facets[*] are named or work seems to me to be a bit wonky. For instance, it only understands the names for Latin Text facets, and uses them even for Latin Decorative or Pictorial fonts.

The first step (apart from trying to persuade my one-year-old son to go to sleep long enough for me to even turn on the computer), is to take a look at whether improved matching or re-classifying facilities would do any good at all, and for that, I need to take a look at font classifications in the wild.

Turning to my Font Corpus database, I’ve extracted the following bare facts about PANOSE usage, and I’m quite buoyed up by the results.

From the 35420 fonts in the Corpus, I first get rid of fonts that have complete rubbish in the PANOSE field, which means discarding fonts with:

  1. Family Kind of “Any”(0), which means no attempt at all was made at classification. (13863 fonts).
  2. Facet values out of range. This is generally Weight, which for some reason, perhaps tool error, tends to have the values of 114 or 226. (409 fonts).
  3. Family Kind > 5. Family Kind values up here, which would be used for non Latin, aren’t formalised in any document I can find. (12 fonts).
  4. Weight of “Any”(0). Weight is the only facet that is present for all values of Family Kind, so it really ought to be set to something, even if “No Fit”(1) is the only appropriate value. (1540 fonts).

Having cleared out the rubbish, we are left with 19596 fonts, 55% of the total collection. Of these, just over 90% are Latin Text fonts.

For the Latin Text fonts, more complete classification means that more of the individual facets are set to any value other than “Any”(0). Even “No Fit”(1) gives us some information about the limitations of the classification system.

So, how many facets are set to non-zero in our remaining fonts?

No. of non-zero facets No. of fonts
3 349
4 1718
5 135
6 1391
7 6712
8 226
9 66
10 8689

Some of these facets are derived from measured values, and some of them are picked by judgement, which may explain the somewhat uneven coverage of classifications. (And who’s going to classify nine of the facets without doing one extra to complete the job?)

I think the end result is that there are enough fonts with decent classification in the wild to make this something worth working on. Go to sleep, baby boy!

[*] I call the individual numbers of a PANOSE number “facets” to help me from over-using the word “value”.

Categories: Fonts Tags: