[olug] Convert TIFF to PDF with OCR?

Dan Linder dan at linder.org
Sat Feb 20 15:04:48 UTC 2010


On Fri, Feb 19, 2010 at 22:20, Obi-Wan <obiwan at jedi.com> wrote:
> TIFF supports numerous methods of compression, including JPEG, LZW,
> and CCITT.  I'm pretty familiar with the TIFF format spec, but I'm not
> familiar with a standard method of attaching text in a TIFF, so it's
> probably done with a custom tag (the "T" in TIFF).  Therefore, I think
> you're unlikely to find any app (other than the one that did the
> initial encoding) that will retain your text when converting to PDF.
> I think you've got two options to pursue:

Thanks - for everyone's information it was the "Microsoft Document
Scanner" software.  It worked well enough to scan/OCR/save, but it was
buggy as heck and would cause a Dr. Watson error if you went to the
"File" menu.  Thankfully there was a "Save" button in the button bar.

> 1) Re-OCR the images using something that will write directly to a PDF.

The tiff2pdf that Ed suggested was able to convert the images but it
kept warning about unknown fields -- the text I presume.  Since the
resulting .PDF was the exact same size as the original TIFF files
(both 18MB), I'm thinking that the original TIFF images might have had
compression applied.

A quick search on "pdf ocr" returns some interesting results:
 * Make the PDF files accessible via a website, then let the Google
search engine find them.  They do the PDF-to-text, then you can view
them by seaching for "site:yoursite.com filetype:pdf".
(http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/)
 * Google has an opensource PDF-to-ORC tool: http://code.google.com/p/ocropus/

> 2) Open the TIFFs in an image processing package and then save them
>   as compressed TIFFs, in the hope that the package will pass any
>   unrecognized tags along verbatim to the output file.  You may have
>   to try or research several packages to find one that does this.

Thanks for all the help.  I think I'll probably leave them in the TIFF
file since they appear to already be compressed.

Dan

-- 
***************** ************* *********** ******* ***** *** **
"Quis custodiet ipsos custodes?"
    (Who can watch the watchmen?)
    -- from the Satires of Juvenal
"I do not fear computers, I fear the lack of them."
    -- Isaac Asimov (Author)
** *** ***** ******* *********** ************* *****************



More information about the OLUG mailing list