I have an old manual that I’d like to convert to PDF. Most of the documents I
scan are just black and white. However, this one has a lot of pictures with blue
highlights, and table backgrounds in the same blue and I’d like to preserve that
limited use of colour without keeping every page in full colour. It strikes me
that I should be able to do that by separating layers for each colour used, and
overlaying them to make a neat PDF.
To do this, I’m going to experiment with GraphicsMagick to produce the
colour separations and the Perl module PDF::Builder to make the
final PDF. Eric Smith’s tumble can also be used to construct a PDF
Assessing the pages
Firstly, let’s take a look at the pages of the manual in question. It has 142
pages, but there appear to be just a few different page types that we may have
to treat differently.
Here are the different page types I can see:
The first one is a section heading. This one will be dealt with differently
than the others, because I’ll just produce a black image by straightforward
thresholding, and make the background blue using PDF::Builder.
The second page, an image with blue highlights, is the most important case,
because the image wouldn’t make sense without colour.
The third image is a text page with a blue background to the table. It wouldn’t
be the worst crime in the world to simply threshold this page, dropping the
background colour, but we are aiming to handle this too, as it provides the real
test for whether the final composited page will be legible. An alternative
approach to this kind of page, seen in many scanned manuals, is to process this
to a black and white image treating it as greyscale and dithering the table
background to make a 1-bit-per-pixel TIFF image. However, this is much worse for
legibility and compression than simply dropping the colour altogether.
(Incidentally, if you think that the blue of the first image is slightly
different from the others, you’d be right. Our treatment of the images will take
care of subtle differences.)
Scanning the original document
I’ve scanned the original document in 24-bit colour at 600 dpi and saved
the pages as single lossless TIFF files. The only other options for colour
scanning on my Epson WorkForce-series are to save to PDF, giving me no control
over the output, or JPEG, which is a lossy format. My suspicion is that PDF
output will actually embed a JPEG.
The resultant image files are huge. I haven’t investigated what support the TIFF
format has for compressing colour files, but the scanner certainly doesn’t
attempt any compression, giving me files that are nearly 100 MiB per page!
The entire 142-page manual takes up 13 GiB on disk: I’m hoping to produce a
PDF that is a thousandth of this size.
My scanner has additional options for the type of item being scanned: “text,”
“text and images,” or “photos.” There doesn’t seem to be any information
available on what the scanner does differently for each option. It is apparent
that the scanner does a quick first pass for photographic items, probably
adjusting colour balance, which it doesn’t do for the other two. From
examination of scans, colours are somewhat more smudged on the first two
settings; this works in our favour, so I always pick “text and images” mode.
How would colour separations work?
My standard technique for preserving a black and white document is to save every
page in TIFF Group 4 format, one of the fax standard compression methods. This
is lossless and produces very good results for text. So, my starting point was
thinking that all the separations I produced for each page; black and blue in
this case, would be all compressed the same way, and “masked” onto the page, so that
the “off” pixels of each layer would become transparent, and the “on” pixels
would be rendered in the appropriate colour.
However, Group 4 encoding can only handle black and white images, not white and
some other colour. An initial reading of the specification says that Group 4 is
for “bilevel” images, which I took to mean two colours, but then it becomes
clear that only black and white can be represented, and PDF readers won’t apply
a foreground colour to them on the page.
Still, never mind. The black layers of each page can use this compression
method, and I’ll use PNG images for the other layers, which use Deflate
compression. In order to use PDF’s masking feature, to make white pixels render
transparently, I have to produce indexed (paletted) PNGs, which GraphicsMagick
will quite happily do if you prefix your output filename with
To summarise the image production, then. I’m looking to take each colour TIFF
scan and produce two files:
- Take the black pixels from each page and produce a Group 4-compressed
- Take the blue pixels from each page and produce an indexed PNG.
Each page of the PDF will be produced by rendering the blue PNG first, followed
by the black TIFF, masked to allow blue pixels to show through.
Why am I bothering to produce an indexed PNG for the blue layer, when it will
be rendered first, with nothing below to show through? The answer is simply that
this is a generic technique that will work for multiple colour separations. This
document also contains some small areas of grey highlighting that I could
process using another overlay but as it happens I’ll handle them differently.
The general technique for producing each layer is going to be:
- Change the colours you don’t want to capture to white.
- Change the colour you do want to black.
- Change the colour you want back to its original colour.
In this case, the blue pixels are light enough that simply thresholding the page
at 50% will cause them to threshold to white, so step 1 is redundant.
Higher threshold percentages cause more pixels to turn black, so your
thresholding value will be determined by the quality of the text in your
original document. More lightly-printed material will need a higher threshold
to thicken up letter stems. Photocopied documentation may also need a higher
threshold but with an additional despeckling step. Experimentation is necessary,
but at least the same threshold value will apply across the entire source
Varying threshold values from 40% to 80% – clearly 80% is too high
For this document, each black layer can be produced like this:
gm convert scan.tiff -threshold 50% -compress Group4 black-layer.tiff
For the colour layers, we need to drop out as much black (and other colours) as
possible, meaning converting them to white, and then we turn our colour to black
To convert all pixels from one colour to another, you use GM’s
gm convert source.tiff -fill green1 -opaque black destination.tiff
This converts all black pixels in the image to a particular shade of green. But
this isn’t going to work by itself, as there is some natural variation in
colours in the scan: we’d just get this:
We use the
-fuzz option to capture colours close to the original colour:
gm convert source.tiff -fuzz 70% -fill green1 -opaque black destination.tiff
You must specify
-opaque, as the operation occurs
as soon as
-opaque is parsed, using the current values of fuzz and fill.
Putting the steps above together, we produce the blue layer like this:
gm convert source.tiff \
-fuzz 99% -fill white -opaque black \
-fuzz 10% -fill black -opaque '#b2cdd9' \
-threshold 20% \
-fill '#b2cdd9' -opaque black destination.png
Here we see the step-by-step production of the blue layer, starting with the
source image on the left (1), ending with the blue overlay (5). The
last image (6) shows the blue and black layers composited together as they
will appear in the final PDF. So we have moved from a true colour source image,
to one in which every pixel is just one of three colours, white, black or blue.
Note that the thresholding step produces an image with just two colours. Saving
this image as a PNG will automatically produce a paletted image with just two
colours, rather than a 24-bit (true colour) image.
The step that I am missing automation for is choosing the hex value of the
colour that I wish to extract. Given that fuzz is a percentage of possible
distance in RGB space, picking a value too far from the “mean blue” will result
in you either needing to increase fuzz or risk leaving white holes in your
coloured image. At the moment, I’ve been picking the colour by choosing an image
with a large area of blue, pulling it into Gimp and successively resampling down
until I’ve got a tiny image on which I use the colour dropper tool. There may be
a way of colour “clustering” in GraphicsMagick that I’m currently missing; I’d
love to hear about it.
The PDF version of this manual, Installing and Using the LA75 Companion
Printer, is just 11 MiB.
Over the next few weeks, I intend to cover more topics along the same lines,
showing Perl scripts for driving the GraphicsMagick conversions, adding cropping
and deskewing of scans and the production of the PDF with PDF::Builder.