Working around Kindlegen quirks with document transformations

By Paul Flo Williams

13 Jun 2012

While working on the Kindle (Mobi) version of Antigua and the Antiguans, I found that I was spending too much time attempting to carefully craft some XHTML markup and a stylesheet that would get the final book to look the way I expected it to. Because the Mobi format is only HTML 3.2 with a few extra elements, and no stylesheet support at all, Kindlegen has to do a lot of wizardry to down-convert any EPUB (XHTML + CSS) you are starting with, and the results don’t appear to be consistent across an entire book.

It was while attempting to mark up the genealogical information in the appendices of the book that Kindlegen really came unstuck, and I had to change tack completely. My markup in the EPUB looked like this:

<p class="offspring">Thomas Howard, 2nd Duke of Norfolk...</p>
<p class="offspring">Elizabeth, m. Thomas Boleyne, Viscount...</p>
<div class="generation">
  <p class="offspring">1. George Boleyne, Viscount Rochford,....</p>
  <p class="offspring">2. Anne, youngest ... left issue,</p>
  <div class="generation">
    <p class="offspring">Elizabeth, Queen of England</p>
  </div>
  <p class="offspring">3. Mary, eldest dau. of Thomas,...</p>
</div>

This struck me as sensible, semantic markup, with each successive generation being further indented, as in the original print book. I wished it to look as below, but Kindlegen resolutely refused to play ball:

It struck me that I could assign an individual indent to each of the paragraphs, and get the hanging indent to work with an appropriate number of non-breaking spaces at beginning of each paragraph, but why should I compromise the clean markup, or maintain separate source documents for EPUB and Mobi?

The strategy

The way I’ve decided to approach this is to write a script that will take my chosen markup vocabulary (XHTML elements with classes of my choosing) and transform it into Mobi’s HTML 3.2 + the few mbp-namespace elements, such that the resulting document passes through Kindlegen without any further changes taking place.

Pros

I still use Amazon’s chosen packaging program. (This is a big one, as Amazon seem to have recently rejected some books built with Calibre.)
The transformations performed by the script can handle things that Kindlegen won’t even touch, such as small caps support.
Predictability of output, without having to put odd workarounds for Kindlegen failings in the source document. (Want a top margin and a left margin without putting a <div> inside a <div>? Easy.)

Cons

This approach won’t work for KF8 documents with new features. That doesn’t matter for me, as I’m producing text-heavy books, but anyone wanting to take advantage of KF8 features would have to run Kindlegen on their KF8 source, and then insert the output from a script like mine in place of the mangled Mobi that Kindlegen had produced. (In effect, you’d need the hypothetical mobipack.py to match the existing mobiunpack.py, or try Peter Hatch’s ingenious hack
This is definitely in the “some assembly required” category. It works for my vocabulary for my documents. The transformations from the input vocabulary to the mobi output may vary somewhat depending on the book, but I expect the vocabulary to remain consistent over a long period.
I have to perform any transformations that Kindlegen would normally handle well, such as centring elements.

OK, what do I actually do?

Enough of the theory. What transformations do I perform on documents? Here is a selection of recipes that you may want to try. I’ll use the CSS element.class notation.

Remove any existing <link rel="stylesheet" ... /> elements. We don’t want Kindlegen to have any CSS information to allow it to further affect our document.
Delete all comments as they are pointless and have provoked Kindlegen bugs.
Process all children of span.sc to make a small caps effect, by inserting <font size="-2"> around lowercase characters.
Move all caption elements into centred paragraphs, immediately before their table.
Kindlegen doesn’t handle p elements inside blockquote, so just remove the blockquote and indent all paragraphs appropriately.
Convert ZWSP characters to ZWNJ to support line wrapping at em dashes without unduly spacing them out.
Insert page breaks before div.chapter
Produce nicely-indented lines of poetry.

Implementation

My understanding of XSLT is not good enough to implement all of my recipes in the fashion that I think would be neatest, so I use a Perl script and the XML::LibXML module. It lets me use XPath expressions to quickly select nodes of interest, and then add and replace nodes at will.