Sunday, May 10, 2009

Scraping PDF's in Python

So, in the course of grabbing some additional data sources for GovCheck, I needed to scrape a few pdf's and insert the information into my database. After looking high and low, I found an acceptable solution to do this using Python - pdfminer. It's not perfect, but it's much better than the rest of the pdf to html/txt converter tools - at-least as far as scraping goes. So I figured I'd note here how I wrote my scraping code. As a reference point, I was parsing election data for the past election using this pdf file.
You start off with running the code through pdfminer and getting the resulting HTML back.

import os
from BeautifulSoup import BeautifulSoup
for page in range(9, 552):
soup = BeautifulSoup(os.popen('python ~/dev/pdfminer-dist-20090330/pdflib/pdf2txt.py -w -p %d Vol_II_LS_2004.pdf' % page).read())

Once you have said soup object, the next steps are pretty much the same as scraping any HTML page you would grab from some page on the web, except since the HTML generated by pdfminer user absolute positioning, you need to take care of those offsets. A couple ways I found to take those into account are listed below:

  1. You can analyse a few pages of the generated html from the pdf, find the various offsets that are displayed and use them like so:


  2. for left_margin in (267,271,275):
    try:
    electors_total = map(int, map(strip, soup.find('span', style = 'position:absolute; writing-mode:lr-tb; left:%dpx; top:352px; font-size:8px;' % left_margin).string.split(' ')))
    break
    except AttributeError, e:
    pass

  3. The previous method mentioned is actually a pretty bad way of doing things, not to say unreliable 'cause your code can break whenever an offset you did not know off is found. A better way to do things is to find a nearby constant text message (such as "MARGIN ") and then backtrack from there to find your relevant tag. This allows you to completely avoid hard coding offsets - unfortunately it's not usable in all the cases. Here's an example of how you can use this method:


  4. self._data['Margin']['Number'], self._data['Margin']['Percentage'] = map(strip, soup.find(text = 'MARGIN ').parent.previousSibling.previousSibling.string.split(' '))


So there you go, a quick example of how to scrape a pdf using Python. It's not perfect but it works for me, for now.

No comments: