[time-nuts] Gentlemen, unlimber your Broadband Connections

Joseph Gray jgray at zianet.com
Mon Apr 30 04:04:15 UTC 2007


> Yeah, "bookmark" is Adobe-ese for what I was referring to as a PDF
> index.  Even better would be the ultimate tool for finding stuff in
> doc files - the permuted index.  Unfortunately that would mean a LOT
> of OCR work.
>
> John

Coincidentally, I was just trying out a copy of Acrobat 8 Pro for the first 
time. I downloaded the PDF of "Inside The Vacuum Tube" by John F. Rider from 
that web site. The PDF already contains bookmarks, which is good, as 
creating them is tedious.

As a test, I opened that PDF in Acrobat 8 Pro and OCR'ed it. I didn't time 
the process, but it seemed about 10-15 minutes on an old 3.4GHz P4. That 
book is 424 pages. I saved the result to a file containing the original page 
images with the OCR'ed text underneath. This way, you see the exact scanned 
page, but can search on the hidden text.

Frankly, I was amazed at how well the Acrobat built-in OCR performed. I 
tried searching on several words/phrases throughout the document. The only 
time the search failed was on things like large caps at the start of a word, 
the hand written text in the illustrations, and some words that had a line 
across them.

I also tried saving just the OCR'ed text with the inline images (like a 
normal text PDF). In this case, the text was still very good, but the images 
suffered quite a bit. In this particular instance, it would take quite a bit 
of cleanup to recreate the original look and content of the book. If you 
were only interested in the text portion for export to some other format, 
the OCR results were very good, considering what the printed page looks 
like.
A high-end OCR program like FineReader or OmniPage would do a better job of 
recognizing the text, but would also take longer.

For an exact replica of a text book such as this one, having the actual page 
scans combined with hidden, searchable text is very useful. The downside to 
this is that the resultant file was almost twice as large as the original 
that I downloaded. Of course, we all have huge hard drives these days, so a 
16MB book isn't such a big deal. BTW, saving the PDF as an Acrobat 5 (PDF 
v1.4) file made things a few hunderd KB smaller, as well as making it 
compatible with older versions of Acrobat Reader. The default was to save as 
Acrobat 7 format.

BTW, if anyone wants the OCR'ed version of "Inside The Vacuum Tube", I can 
upload it somewhere.





More information about the Time-nuts_lists.febo.com mailing list