I have an old manual that I’d like to convert to PDF. Most of the documents I scan are just black and white. However, this one has a lot of pictures with blue highlights, and table backgrounds in the same blue and I’d like to preserve that limited use of colour without keeping every page in full colour. It strikes me that I should be able to do that by separating layers for each colour used, and overlaying them to make a neat PDF.

To do this, I’m going to experiment with GraphicsMagick to produce the colour separations and the Perl module PDF::Builder to make the final PDF. Eric Smith’s tumble can also be used to construct a PDF with overlays.

Assessing the pages

Firstly, let’s take a look at the pages of the manual in question. It has 142 pages, but there appear to be just a few different page types that we may have to treat differently.

Here are the different page types I can see:

section heading text page with image text page with table

The first one is a section heading. This one will be dealt with differently than the others, because I’ll just produce a black image by straightforward thresholding, and make the background blue using PDF::Builder.

The second page, an image with blue highlights, is the most important case, because the image wouldn’t make sense without colour.

The third image is a text page with a blue background to the table. It wouldn’t be the worst crime in the world to simply threshold this page, dropping the background colour, but we are aiming to handle this too, as it provides the real test for whether the final composited page will be legible. An alternative approach to this kind of page, seen in many scanned manuals, is to process this to a black and white image treating it as greyscale and dithering the table background to make a 1-bit-per-pixel TIFF image. However, this is much worse for legibility and compression than simply dropping the colour altogether.

(Incidentally, if you think that the blue of the first image is slightly different from the others, you’d be right. Our treatment of the images will take care of subtle differences.)

Scanning the original document

I’ve scanned the original document in 24-bit colour at 600 dpi and saved the pages as single lossless TIFF files. The only other options for colour scanning on my Epson WorkForce-series are to save to PDF, giving me no control over the output, or JPEG, which is a lossy format. My suspicion is that PDF output will actually embed a JPEG.

The resultant image files are huge. I haven’t investigated what support the TIFF format has for compressing colour files, but the scanner certainly doesn’t attempt any compression, giving me files that are nearly 100 MiB per page! The entire 142-page manual takes up 13 GiB on disk: I’m hoping to produce a PDF that is a thousandth of this size.

My scanner has additional options for the type of item being scanned: “text,” “text and images,” or “photos.” There doesn’t seem to be any information available on what the scanner does differently for each option. It is apparent that the scanner does a quick first pass for photographic items, probably adjusting colour balance, which it doesn’t do for the other two. From examination of scans, colours are somewhat more smudged on the first two settings; this works in our favour, so I always pick “text and images” mode.

How would colour separations work?

My standard technique for preserving a black and white document is to save every page in TIFF Group 4 format, one of the fax standard compression methods. This is lossless and produces very good results for text. So, my starting point was thinking that all the separations I produced for each page; black and blue in this case, would be all compressed the same way, and “masked” onto the page, so that the “off” pixels of each layer would become transparent, and the “on” pixels would be rendered in the appropriate colour.

However, Group 4 encoding can only handle black and white images, not white and some other colour. An initial reading of the specification says that Group 4 is for “bilevel” images, which I took to mean two colours, but then it becomes clear that only black and white can be represented, and PDF readers won’t apply a foreground colour to them on the page.

Still, never mind. The black layers of each page can use this compression method, and I’ll use PNG images for the other layers, which use Deflate compression. In order to use PDF’s masking feature, to make white pixels render transparently, I have to produce indexed (paletted) PNGs, which GraphicsMagick will quite happily do if you prefix your output filename with PNG8:.

To summarise the image production, then. I’m looking to take each colour TIFF scan and produce two files:

  1. Take the black pixels from each page and produce a Group 4-compressed TIFF
  2. Take the blue pixels from each page and produce an indexed PNG.

Each page of the PDF will be produced by rendering the blue PNG first, followed by the black TIFF, masked to allow blue pixels to show through.

Why am I bothering to produce an indexed PNG for the blue layer, when it will be rendered first, with nothing below to show through? The answer is simply that this is a generic technique that will work for multiple colour separations. This document also contains some small areas of grey highlighting that I could process using another overlay but as it happens I’ll handle them differently.

Black layer

The general technique for producing each layer is going to be:

  1. Change the colours you don’t want to capture to white.
  2. Change the colour you do want to black.
  3. Threshold.
  4. Change the colour you want back to its original colour.

In this case, the blue pixels are light enough that simply thresholding the page at 50% will cause them to threshold to white, so step 1 is redundant.

Higher threshold percentages cause more pixels to turn black, so your thresholding value will be determined by the quality of the text in your original document. More lightly-printed material will need a higher threshold to thicken up letter stems. Photocopied documentation may also need a higher threshold but with an additional despeckling step. Experimentation is necessary, but at least the same threshold value will apply across the entire source document.

Varying threshold values from 40% to 80% – clearly 80% is too high

For this document, each black layer can be produced like this:

gm convert scan.tiff -threshold 50% -compress Group4 black-layer.tiff

Blue layer

For the colour layers, we need to drop out as much black (and other colours) as possible, meaning converting them to white, and then we turn our colour to black for thresholding.

To convert all pixels from one colour to another, you use GM’s -opaque option:

gm convert source.tiff -fill green1 -opaque black destination.tiff

This converts all black pixels in the image to a particular shade of green. But this isn’t going to work by itself, as there is some natural variation in colours in the scan: we’d just get this:

We use the -fuzz option to capture colours close to the original colour:

gm convert source.tiff -fuzz 70% -fill green1 -opaque black destination.tiff

You must specify -fuzz and -fill before -opaque, as the operation occurs as soon as -opaque is parsed, using the current values of fuzz and fill. Putting the steps above together, we produce the blue layer like this:

gm convert source.tiff \
    -fuzz 99% -fill white -opaque black \
    -fuzz 10% -fill black -opaque '#b2cdd9' \
    -threshold 20% \
    -fill '#b2cdd9' -opaque black destination.png

Here we see the step-by-step production of the blue layer, starting with the source image on the left (1), ending with the blue overlay (5). The last image (6) shows the blue and black layers composited together as they will appear in the final PDF. So we have moved from a true colour source image, to one in which every pixel is just one of three colours, white, black or blue.

Note that the thresholding step produces an image with just two colours. Saving this image as a PNG will automatically produce a paletted image with just two colours, rather than a 24-bit (true colour) image.

The step that I am missing automation for is choosing the hex value of the colour that I wish to extract. Given that fuzz is a percentage of possible distance in RGB space, picking a value too far from the “mean blue” will result in you either needing to increase fuzz or risk leaving white holes in your coloured image. At the moment, I’ve been picking the colour by choosing an image with a large area of blue, pulling it into Gimp and successively resampling down until I’ve got a tiny image on which I use the colour dropper tool. There may be a way of colour “clustering” in GraphicsMagick that I’m currently missing; I’d love to hear about it.

Final results

The PDF version of this manual, Installing and Using the LA75 Companion Printer, is just 11 MiB.

Over the next few weeks, I intend to cover more topics along the same lines, showing Perl scripts for driving the GraphicsMagick conversions, adding cropping and deskewing of scans and the production of the PDF with PDF::Builder.