Su | Mo | Tu | We | Th | Fr | Sa |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 |
I found that my site's Google search box (implemented a while back using a form to employ a Google search), returned results that looked promising but actually all pointed to URLs that contained the initial string 'blog' rather than 'data'. Evidently the Google index was elderly...
...I had recently decided to replace the string 'blog' with the string 'data' - (the idea behind this change was go disguise the fact that I use Nano-Blogger to prepare the site's pages!). Anyway, my initial way of handling the switch over from 'blog' to 'data' was to add a line in my .htaccess file to send requests for 'blog' pages to the index for the 'data' section of the web site. I hadn't realized that the line in question was sending everything only to the main index of the site - not very useful.
A little experimentation with .htaccess rules though yields a more correct rule which changes URLs with 'blog/' into URLs which have 'data/' which is what I needed. Here is the necessary line for the .htaccess file:
RewriteRule ^blog/(.*) http://www.themolecularuniverse.com/data/$1 [L]
For the record the previous (failing) .htaccess line was:
RewriteRule ^blog/* http://www.themolecularuniverse.com/data/$1
I am not sure where the incorrect line came from - but I am learning that things in parentheses in .htaccess files are important...
I thought that these two pictures were interesting. The first photograph shows an anarchist attacking a police car on November 5, 2015. The second photograph shows a row of photographers taking a photograph of the same vehicle on the same day. It looks as though the photographers significantly outnumbered the anarchists ... and the value of a photograph outweighs the honour of performing a citizen's arrest...
In recent days I have experimented further with tesseract version 3.03 and improved the pdf creation script that I use as a result. Here are my thoughts:
Firsly, I decided to make use of the fact that tesseract can now accurately position the majority of words with the underlying image of the word in the scan. This was not the case in the past. The benefit of this improvement is that when you find a word in the text, you can see the part of the image which the tesseract OCR algorithm associated with the word. The downside is that sometimes the spacing between the words is not correct. With a scan to just text, I found many fewer mistakes where words are run together. However, the benefit of being able to see what was hit in the search outweighs the deficit of not being about to cut and paste the scanned text freely, in my view.
Secondly, I decided to drop the resolution of scanned color images. I use Group 4 compression for black and white images, and this leads to a page of around 100k in a final OCRd pdf. This leads to a pdf size of about 20MB for around 200 pages, which is manageable. (100000*200/1000000=20, not accounting for different ideas about the size of a mega, etc.). Such files could be much smaller if one were to use jbig2, but jbig2 isn't completely widely supported yet by pdf viewers, so I decided to hold off on jbig2 for now.
For color images, reducing the resolution does not seem to hurt too much, and the improvement in viewability for color images is dramatic.
Here is the script...it assumes that you have tesseract installed correctly and TESSDATA_PREFIX set up correctly.
#!/bin/sh MAX=99999 CURRENTDIR=`pwd | sed 's#/home/person/scans/##' | sed 's#/new_method##'` NPAGES=`ls 0*.tif | wc | awk '{print $1}'` i=0 for FILE in 0*.tif do BASE=`basename $FILE .tif` i=`expr $i + 1` d=`echo $i | awk '{printf "%05d",$i}'` echo $d " $CURRENTDIR $NPAGES" tifftopnm $FILE > tmp.pnm TYPE=`pnmfile tmp.pnm | awk '{print $2}'` if [ $TYPE = "PPM" ] then pnmquant 256 tmp.pnm | pnmtotiff -lzw > tmp.tif convert tmp.tif -adaptive-resize 25% -density 150 new$d.tif else pnmtotiff -g4 tmp.pnm > tmp.tif convert tmp.tif -density 600 new$d.tif fi tesseract new$d.tif newpage$d pdf rm new$d.tif if [ $i -eq $MAX ] then break fi done pdftk newpage*.pdf cat output output.pdf rm newpage0*.pdf rm tmp.pnm tmp.tif
And a comparison of the new script's results. Here is: Macaulay's Lord Clive (new script) versus the older version: Macaulay's Lord Clive (old script). The new version is 18MB and the old version 23MB - and the new version looks nicer because it has color...!
I decided to install Tesseract 3.03 on my Ubuntu box recently. (I wanted to have the text layer on my scanned PDFs correctly lined up with the underlying page image - Tesseract 3.03 does this.). So I downloaded the appropriate source and set about building.
I had to build and install leptonica first - I used version 1.72. There after, there was a problem with make in the tesseract 'api' directory. I resolved this by simply executing the required command by hand:
# /bin/bash ../libtool --tag=CXX --mode=link g++ -o tesseract tesseract-tesseractmain.o libtesseract.la -lrt -lpthread /usr/local/lib/liblept.a
This is just the original line emitted by the Makefile with the location of the leptonica library (i.e. /usr/local/lib/liblept.a) corrected.
Thereafter everything was relatively straightfoward. I had to download the English 'trained' data from the appropriate site, and then tesseract was ready to use.