[olug] Trying to extract bar codes from a pdf file

Wed Feb 8 01:05:21 CST 2017

I figured it out. The key was to stop trying to parse it like a pdf file
and convert it to an image, and parse it from there. Once converted to a
png (via pdftoppm) I used the tesseract program to OCR out the barcode ids
(complete with coordinate info), then used 'convert -crop' from imagemagick
to pull out each barcode. Code at
http://adamhaeder.com/extract_barcodes_from_pdf.sh if you're interested.

On Tue, Feb 7, 2017 at 10:54 PM, Adam Haeder <adam at adamhaeder.com> wrote:

> I'm running into some challenges with this and I'm wondering if any of you
> smarter people have some insight.
>
> I have a pdf file of bar codes <http://adamhaeder.com/out.pdf>. I want to
> extract the barcodes (both the actual bar code part itself, and the text
> directly beneath it which tells me what's encoded in the barcode) into
> individual image files named after the id. So for example, in the file, the
> first barcode is the data A33432008009636B. I want to extract this barcode
> into an image file named A33432008009636B.png
>
> At first glance, this file looks pretty well laid out, enough so that I
> though it would be easy to just give static coordinates for each barcode,
> and use the tool pdftoppm to pull out a certain chunk of the file. For
> example, this command:
>
> $ pdftoppm -f 1 -l 1 -r 150 -x 0 -y 160 -W 330 -H 65 -png out.pdf > foo.png
>
> will extract the section starting at x coordinate 0, y coordinate 160,
> with a width of 330 pixels and a height of 65 pixels (assuming the dpi is
> 150) and save it to the file foo.png. I can then pull the id out of foo.png
> with and ocr command like tesseract and rename the png file accordingly.
> This works, I tried it. The problem is that the file isn't regular enough:
> too many bar codes are not quite lined up right (because I can't guarantee
> how many lines of text will be above each one). The sample pdf file is only
> 1 page of 89, so there are a lot of barcodes in this. I wrote a script to
> try the static coordinate method, and then go and walk through each image I
> created, searching for the A[0-9]+B string to see how many I missed, and I
> got about a 20% error rate. Way too high.
>
> So my next step is to get something that will give me an approximate
> coordinate system for each barcode. I found the program pstotext which will
> do just that, if I run it with the -bboxes option, like so:
>
> # pstotext -bboxes out.pdf | egrep "A[0-9]{6,}"
> GPL Ghostscript 9.18: Some glyphs of the font GGICBJ+ArialMT requires a
> patented True Type interpreter.
>     19     689     147     714  A33432008009636B
>    217     689     345     714  A33432008009637B
>    415     689     543     714  A33432008009638B
>     19     617     147     642  A33432008515260B
>    217     612     345     636  A33432009199973B
>    415     612     543     636  A33432009200037B
>     19     540     147     564  A33432009200094B
>    217     540     345     564  A33432008515245B
>    415     540     543     564  A33432008513679B
>     19     473     147     498  A33432008009681B
>    217     468     345     492  A33432008009682B
>    415     468     543     492  A33432008009698B
>     19     396     147     420  A33432008898682B
>    217     401     345     426  A33432008009700B
>    415     396     543     420  A33432008512549B
>     19     329     147     354  A33432009196094B
>    217     329     345     354  A33432009196102B
>    415     329     543     354  A33432008009720B
>     19     257     147     282  A33432009196342B
>    217     257     345     282  A33432009196359B
>    415     257     543     282  A33432009196367B
>     19     185     147     210  A33432009196375B
>    217     185     345     210  A33432009196383B
>    415     185     543     210  A33432009196391B
>     19     113     147     138  A33432009196409B
>    217     113     345     138  A33432009196417B
>    415     113     543     138  A33432009196425B
>     19      41     147      66  A33432009196433B
>    217      41     345      66  A33432009196441B
>    415      41     543      66  A33432009196458B
>
> Success we think! I'll just take those coordinates, feed them to pdftoppm,
> and Bob's your uncle. However.... pdftoppm uses a different coordinate
> system. Firstly, it starts counting from upper left, unlike pstotext which
> counts from lower left. Also, while pdftoppm works in pixels (which is why
> I had to pass a dpi value), pstotext works in 'points', which I honestly
> haven't been able to figure out yet.
>
> So it seems my 2 options are:
> - Somehow convert the above coordinate system output of pstotext to a
> format that pdftoppm will be happy to read, or
> - do something completely different to programmatically get these barcodes
> out of this pdf
>
> Thanks for any advice!
>
>
> --
> Adam Haeder
> adam at adamhaeder.com
>
> Check out my latest book: LPI Linux Certification in a Nutshell from
> O'Reilly: http://bit.ly/bvQQ0I
>

-- 
Adam Haeder
adam at adamhaeder.com

Check out my latest book: LPI Linux Certification in a Nutshell from
O'Reilly: http://bit.ly/bvQQ0I