[olug] Trying to extract bar codes from a pdf file
Adam Haeder
adam at adamhaeder.com
Wed Feb 8 01:05:21 CST 2017
I figured it out. The key was to stop trying to parse it like a pdf file
and convert it to an image, and parse it from there. Once converted to a
png (via pdftoppm) I used the tesseract program to OCR out the barcode ids
(complete with coordinate info), then used 'convert -crop' from imagemagick
to pull out each barcode. Code at
http://adamhaeder.com/extract_barcodes_from_pdf.sh if you're interested.
On Tue, Feb 7, 2017 at 10:54 PM, Adam Haeder <adam at adamhaeder.com> wrote:
> I'm running into some challenges with this and I'm wondering if any of you
> smarter people have some insight.
>
> I have a pdf file of bar codes <http://adamhaeder.com/out.pdf>. I want to
> extract the barcodes (both the actual bar code part itself, and the text
> directly beneath it which tells me what's encoded in the barcode) into
> individual image files named after the id. So for example, in the file, the
> first barcode is the data A33432008009636B. I want to extract this barcode
> into an image file named A33432008009636B.png
>
> At first glance, this file looks pretty well laid out, enough so that I
> though it would be easy to just give static coordinates for each barcode,
> and use the tool pdftoppm to pull out a certain chunk of the file. For
> example, this command:
>
> $ pdftoppm -f 1 -l 1 -r 150 -x 0 -y 160 -W 330 -H 65 -png out.pdf > foo.png
>
> will extract the section starting at x coordinate 0, y coordinate 160,
> with a width of 330 pixels and a height of 65 pixels (assuming the dpi is
> 150) and save it to the file foo.png. I can then pull the id out of foo.png
> with and ocr command like tesseract and rename the png file accordingly.
> This works, I tried it. The problem is that the file isn't regular enough:
> too many bar codes are not quite lined up right (because I can't guarantee
> how many lines of text will be above each one). The sample pdf file is only
> 1 page of 89, so there are a lot of barcodes in this. I wrote a script to
> try the static coordinate method, and then go and walk through each image I
> created, searching for the A[0-9]+B string to see how many I missed, and I
> got about a 20% error rate. Way too high.
>
> So my next step is to get something that will give me an approximate
> coordinate system for each barcode. I found the program pstotext which will
> do just that, if I run it with the -bboxes option, like so:
>
> # pstotext -bboxes out.pdf | egrep "A[0-9]{6,}"
> GPL Ghostscript 9.18: Some glyphs of the font GGICBJ+ArialMT requires a
> patented True Type interpreter.
> 19 689 147 714 A33432008009636B
> 217 689 345 714 A33432008009637B
> 415 689 543 714 A33432008009638B
> 19 617 147 642 A33432008515260B
> 217 612 345 636 A33432009199973B
> 415 612 543 636 A33432009200037B
> 19 540 147 564 A33432009200094B
> 217 540 345 564 A33432008515245B
> 415 540 543 564 A33432008513679B
> 19 473 147 498 A33432008009681B
> 217 468 345 492 A33432008009682B
> 415 468 543 492 A33432008009698B
> 19 396 147 420 A33432008898682B
> 217 401 345 426 A33432008009700B
> 415 396 543 420 A33432008512549B
> 19 329 147 354 A33432009196094B
> 217 329 345 354 A33432009196102B
> 415 329 543 354 A33432008009720B
> 19 257 147 282 A33432009196342B
> 217 257 345 282 A33432009196359B
> 415 257 543 282 A33432009196367B
> 19 185 147 210 A33432009196375B
> 217 185 345 210 A33432009196383B
> 415 185 543 210 A33432009196391B
> 19 113 147 138 A33432009196409B
> 217 113 345 138 A33432009196417B
> 415 113 543 138 A33432009196425B
> 19 41 147 66 A33432009196433B
> 217 41 345 66 A33432009196441B
> 415 41 543 66 A33432009196458B
>
> Success we think! I'll just take those coordinates, feed them to pdftoppm,
> and Bob's your uncle. However.... pdftoppm uses a different coordinate
> system. Firstly, it starts counting from upper left, unlike pstotext which
> counts from lower left. Also, while pdftoppm works in pixels (which is why
> I had to pass a dpi value), pstotext works in 'points', which I honestly
> haven't been able to figure out yet.
>
> So it seems my 2 options are:
> - Somehow convert the above coordinate system output of pstotext to a
> format that pdftoppm will be happy to read, or
> - do something completely different to programmatically get these barcodes
> out of this pdf
>
> Thanks for any advice!
>
>
> --
> Adam Haeder
> adam at adamhaeder.com
>
> Check out my latest book: LPI Linux Certification in a Nutshell from
> O'Reilly: http://bit.ly/bvQQ0I
>
--
Adam Haeder
adam at adamhaeder.com
Check out my latest book: LPI Linux Certification in a Nutshell from
O'Reilly: http://bit.ly/bvQQ0I
More information about the OLUG
mailing list