[olug] Convert TIFF to PDF with OCR?
Dan Linder
dan at linder.org
Sat Feb 20 15:04:48 UTC 2010
On Fri, Feb 19, 2010 at 22:20, Obi-Wan <obiwan at jedi.com> wrote:
> TIFF supports numerous methods of compression, including JPEG, LZW,
> and CCITT. I'm pretty familiar with the TIFF format spec, but I'm not
> familiar with a standard method of attaching text in a TIFF, so it's
> probably done with a custom tag (the "T" in TIFF). Therefore, I think
> you're unlikely to find any app (other than the one that did the
> initial encoding) that will retain your text when converting to PDF.
> I think you've got two options to pursue:
Thanks - for everyone's information it was the "Microsoft Document
Scanner" software. It worked well enough to scan/OCR/save, but it was
buggy as heck and would cause a Dr. Watson error if you went to the
"File" menu. Thankfully there was a "Save" button in the button bar.
> 1) Re-OCR the images using something that will write directly to a PDF.
The tiff2pdf that Ed suggested was able to convert the images but it
kept warning about unknown fields -- the text I presume. Since the
resulting .PDF was the exact same size as the original TIFF files
(both 18MB), I'm thinking that the original TIFF images might have had
compression applied.
A quick search on "pdf ocr" returns some interesting results:
* Make the PDF files accessible via a website, then let the Google
search engine find them. They do the PDF-to-text, then you can view
them by seaching for "site:yoursite.com filetype:pdf".
(http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/)
* Google has an opensource PDF-to-ORC tool: http://code.google.com/p/ocropus/
> 2) Open the TIFFs in an image processing package and then save them
> as compressed TIFFs, in the hope that the package will pass any
> unrecognized tags along verbatim to the output file. You may have
> to try or research several packages to find one that does this.
Thanks for all the help. I think I'll probably leave them in the TIFF
file since they appear to already be compressed.
Dan
--
***************** ************* *********** ******* ***** *** **
"Quis custodiet ipsos custodes?"
(Who can watch the watchmen?)
-- from the Satires of Juvenal
"I do not fear computers, I fear the lack of them."
-- Isaac Asimov (Author)
** *** ***** ******* *********** ************* *****************
More information about the OLUG
mailing list