In recent days I have experimented further with tesseract version 3.03 and improved the pdf creation script that I use as a result. Here are my thoughts:
Firsly, I decided to make use of the fact that tesseract can now accurately position the majority of words with the underlying image of the word in the scan. This was not the case in the past. The benefit of this improvement is that when you find a word in the text, you can see the part of the image which the tesseract OCR algorithm associated with the word. The downside is that sometimes the spacing between the words is not correct. With a scan to just text, I found many fewer mistakes where words are run together. However, the benefit of being able to see what was hit in the search outweighs the deficit of not being about to cut and paste the scanned text freely, in my view.
Secondly, I decided to drop the resolution of scanned color images. I use Group 4 compression for black and white images, and this leads to a page of around 100k in a final OCRd pdf. This leads to a pdf size of about 20MB for around 200 pages, which is manageable. (100000*200/1000000=20, not accounting for different ideas about the size of a mega, etc.). Such files could be much smaller if one were to use jbig2, but jbig2 isn't completely widely supported yet by pdf viewers, so I decided to hold off on jbig2 for now.
For color images, reducing the resolution does not seem to hurt too much, and the improvement in viewability for color images is dramatic.
Here is the script...it assumes that you have tesseract installed correctly and TESSDATA_PREFIX set up correctly.
#!/bin/sh MAX=99999 CURRENTDIR=`pwd | sed 's#/home/person/scans/##' | sed 's#/new_method##'` NPAGES=`ls 0*.tif | wc | awk '{print $1}'` i=0 for FILE in 0*.tif do BASE=`basename $FILE .tif` i=`expr $i + 1` d=`echo $i | awk '{printf "%05d",$i}'` echo $d " $CURRENTDIR $NPAGES" tifftopnm $FILE > tmp.pnm TYPE=`pnmfile tmp.pnm | awk '{print $2}'` if [ $TYPE = "PPM" ] then pnmquant 256 tmp.pnm | pnmtotiff -lzw > tmp.tif convert tmp.tif -adaptive-resize 25% -density 150 new$d.tif else pnmtotiff -g4 tmp.pnm > tmp.tif convert tmp.tif -density 600 new$d.tif fi tesseract new$d.tif newpage$d pdf rm new$d.tif if [ $i -eq $MAX ] then break fi done pdftk newpage*.pdf cat output output.pdf rm newpage0*.pdf rm tmp.pnm tmp.tif
And a comparison of the new script's results. Here is: Macaulay's Lord Clive (new script) versus the older version: Macaulay's Lord Clive (old script). The new version is 18MB and the old version 23MB - and the new version looks nicer because it has color...!