Here is a script that I have used to create searchable PDFs on a number of occasions. The input is a set of sequential .tif files. The output is a searchable .pdf file which contains, in addition to the original .tif images, searchable ascii text.
To keep the output .pdf file size down - images (or pages) with many colors are reduced to just 4 shades of grey. It could be 50 shades of grey - and that might help provide a few more hits - but that is left as an exercise to the user.
The input files are assumed to be called 00001.tif (etc) and the output file is called output.pdf. Optical character recognition is carried out using tesseract which seems to do a good job.
Use at your own risk!
#!/bin/sh MAX=99999 CURRENTDIR=`pwd | sed 's#/home/users/Papers/##'` NPAGES=`ls 0*.tif | wc | awk '{print $1}'` i=0 for FILE in 0*.tif do BASE=`basename $FILE .tif` i=`expr $i + 1` d=`echo $i | awk '{printf "%05d",$i}'` echo $d " $CURRENTDIR $NPAGES" SIZE=`ls -ltra $FILE | awk '{print $5}'` if [ $SIZE -gt 1000000 ] then echo "FILE $FILE IS LARGE SO MINIMIZING" tifftopnm $FILE | ppmtopgm | pnmquant 4 | pnmtotiff -lzw > new.tif FILE=new.tif fi tifftopnm $FILE 2> tifftopnm.err | ppmtopgm | \ pnmtops -noturn -rle 2> pnmtops.err> tmp.ps status=$? if [ $status -ne 0 ] then echo "Initial tifftopnm pipeline failed" cat tifftopnm.err cat pnmtops.err fi ps2pdf -dEPSCrop tmp.ps tesseract $FILE $BASE 2> tesseract.err > tesseract.txt status=$? if [ $status -ne 0 ] then echo "TESSERACT FAILED" cat tesseract.err cat tesseract.txt fi sed 's/</lt/g' $BASE.txt | sed 's/>/gt/g' > tmp.txt mv tmp.txt $BASE.txt hocr2pdf -n -i $FILE -o tmp2.pdf < $BASE.txt 2> hocr2pdf.err > hocr2pdf.txt status=$? if [ $status -ne 0 ] then echo "hocr2pdf failed" cat hocr2pdf.err cat hocr2pdf.txt echo "Retrying using hocr" tesseract $FILE $BASE hocr hocr2pdf -n -s -i $FILE -o tmp2.pdf < $BASE.html fi pdftk tmp.pdf background tmp2.pdf output newpage$d.pdf if [ $i -eq $MAX ] then break fi done pdftk newpage*.pdf cat output output.pdf rm newpage0*.pdf rm tmp.pdf tmp2.pdf hocr2pdf.txt tesseract.txt