[olug] Convert TIFF to PDF with OCR?

Rob Townley rob.townley at gmail.com
Sat Feb 20 22:38:21 UTC 2010


On Sat, Feb 20, 2010 at 9:04 AM, Dan Linder <dan at linder.org> wrote:
>
> On Fri, Feb 19, 2010 at 22:20, Obi-Wan <obiwan at jedi.com> wrote:
> > TIFF supports numerous methods of compression, including JPEG, LZW,
> > and CCITT.  I'm pretty familiar with the TIFF format spec, but I'm not
> > familiar with a standard method of attaching text in a TIFF, so it's
> > probably done with a custom tag (the "T" in TIFF).  Therefore, I think
> > you're unlikely to find any app (other than the one that did the
> > initial encoding) that will retain your text when converting to PDF.
> > I think you've got two options to pursue:
>
> Thanks - for everyone's information it was the "Microsoft Document
> Scanner" software.  It worked well enough to scan/OCR/save, but it was
> buggy as heck and would cause a Dr. Watson error if you went to the
> "File" menu.  Thankfully there was a "Save" button in the button bar.
>
> > 1) Re-OCR the images using something that will write directly to a PDF.
>
> The tiff2pdf that Ed suggested was able to convert the images but it
> kept warning about unknown fields -- the text I presume.  Since the
> resulting .PDF was the exact same size as the original TIFF files
> (both 18MB), I'm thinking that the original TIFF images might have had
> compression applied.
>
> A quick search on "pdf ocr" returns some interesting results:
>  * Make the PDF files accessible via a website, then let the Google
> search engine find them.  They do the PDF-to-text, then you can view
> them by seaching for "site:yoursite.com filetype:pdf".
> (http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/)
>  * Google has an opensource PDF-to-ORC tool: http://code.google.com/p/ocropus/
>
> > 2) Open the TIFFs in an image processing package and then save them
> >   as compressed TIFFs, in the hope that the package will pass any
> >   unrecognized tags along verbatim to the output file.  You may have
> >   to try or research several packages to find one that does this.
>
> Thanks for all the help.  I think I'll probably leave them in the TIFF
> file since they appear to already be compressed.
>
> Dan
>
> --
> ***************** ************* *********** ******* ***** *** **
> "Quis custodiet ipsos custodes?"
>    (Who can watch the watchmen?)
>    -- from the Satires of Juvenal
> "I do not fear computers, I fear the lack of them."
>    -- Isaac Asimov (Author)
> ** *** ***** ******* *********** ************* *****************
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> https://lists.olug.org/mailman/listinfo/olug

fyi,

Be very careful experimenting with _multipage_ TIFFs as many
_graphics_ oriented software packages from Adobe, the GIMP, and
Paint.NET will delete all  but the first page of a multipage TIFF with
no warning.

TIFF has many options.  Probably no piece of software will support
them all.  The same file may look great in one program, terrible in
another, and not even openable by a third.

tessaract is very active OpenSource CrossPlatform OCR package.

ImageMagick's convert program - also OpenSource and CrossPlatform

ImageMagick's website online tool

TIFF Fax Class 4 CCITT can be the smallest b&w images files you can
get.  It is what 9600 baud fax machines use.  As small as a few kb per
page.  Adding OCR to it will make it bigger.  Unless, there are
pictures or hundreds of pages in your pdfs, 18MB seems very large.



More information about the OLUG mailing list