[time-nuts] Gentlemen, unlimber your Broadband Connections
Joseph Gray
jgray at zianet.com
Mon Apr 30 04:04:15 UTC 2007
> Yeah, "bookmark" is Adobe-ese for what I was referring to as a PDF
> index. Even better would be the ultimate tool for finding stuff in
> doc files - the permuted index. Unfortunately that would mean a LOT
> of OCR work.
>
> John
Coincidentally, I was just trying out a copy of Acrobat 8 Pro for the first
time. I downloaded the PDF of "Inside The Vacuum Tube" by John F. Rider from
that web site. The PDF already contains bookmarks, which is good, as
creating them is tedious.
As a test, I opened that PDF in Acrobat 8 Pro and OCR'ed it. I didn't time
the process, but it seemed about 10-15 minutes on an old 3.4GHz P4. That
book is 424 pages. I saved the result to a file containing the original page
images with the OCR'ed text underneath. This way, you see the exact scanned
page, but can search on the hidden text.
Frankly, I was amazed at how well the Acrobat built-in OCR performed. I
tried searching on several words/phrases throughout the document. The only
time the search failed was on things like large caps at the start of a word,
the hand written text in the illustrations, and some words that had a line
across them.
I also tried saving just the OCR'ed text with the inline images (like a
normal text PDF). In this case, the text was still very good, but the images
suffered quite a bit. In this particular instance, it would take quite a bit
of cleanup to recreate the original look and content of the book. If you
were only interested in the text portion for export to some other format,
the OCR results were very good, considering what the printed page looks
like.
A high-end OCR program like FineReader or OmniPage would do a better job of
recognizing the text, but would also take longer.
For an exact replica of a text book such as this one, having the actual page
scans combined with hidden, searchable text is very useful. The downside to
this is that the resultant file was almost twice as large as the original
that I downloaded. Of course, we all have huge hard drives these days, so a
16MB book isn't such a big deal. BTW, saving the PDF as an Acrobat 5 (PDF
v1.4) file made things a few hunderd KB smaller, as well as making it
compatible with older versions of Acrobat Reader. The default was to save as
Acrobat 7 format.
BTW, if anyone wants the OCR'ed version of "Inside The Vacuum Tube", I can
upload it somewhere.
More information about the Time-nuts_lists.febo.com
mailing list