My Life Log: coding

Showing posts with label coding. Show all posts

Saturday, May 16, 2009

Using Twitter and Google Search API's using jQuery

Google and Twitter (API docs) both have API's available for web developers to create mashups using data from them. As a way to give users a broader view of what is happening around the constituency, state, party or representative they are viewing, GovCheck shows 4 searches on each of these pages - News, Web and Video search from Google and real time search from Twitter. These searches are all performed using the jQuery JavaScript framework. This has two advantages - the searches don't hold up page loading (as they happen asynchnously - AJAX :)) and they happen on the client side which means fewer resources (and hence load) used on the server side. So how exactly is this done?
The technology making this possible (besides the languages and TCP etc.) is JSONP. What this does is allow us to send AJAX requests to remote sites (due to security issues, browser do not allow cross domain AJAX requests. JSONP is a protocol that allows us to do so - although it has it's own issues too). Both Google as well as Twitter search support JSONP, as does jQuery - which makes querying those services from the client side very easy. Here's how to go about it.


function getGoogleContent(elem, searchtype, query) {
   $.getJSON('http://ajax.googleapis.com/ajax/services/search/' + searchtype + '?v=1.0&q=' + query + '&callback=?',
       function(data) {
           if(data.responseData.results.length == 0)
               elem.html($('
').html('No results').attr('style','text-align: center; font-weight: bold; font-size: 1.2em;'));
           else {
               elemliststr = elem.attr('id') + '-list';
               elem.html($('
').attr('id',elemliststr).attr('style','list-style-type: disc; padding-left: 30px;'));
               elemlistobj = $('#' + elemliststr);
               $.each(data.responseData.results, function(i, item) {
                   elemlistobj.append($('
').html($('').html(item.title).attr('href', (searchtype == 'video') ? item.url : item.unescapedUrl).attr('target','_blank')).append((item.publisher) ? ' (' + item.publisher + ')' : ' '));
               });
               elem.append($('
').html($('').html('More Results').attr('href', data.responseData.cursor.moreResultsUrl).attr('target', '_blank')).attr('style', 'padding-top: 10px; font-weight: bold; text-align: center;'));
           }
       }
   );
}

function getTwitterContent(query) {
   elem = $('#twitter-tab');
   $.getJSON('http://search.twitter.com/search.json?q=' + query + '&rpp=4&callback=?',
       function(data) {
           if(data.results.length == 0)
               elem.html($('
').html('No results').attr('style','text-align: center; font-weight: bold; font-size: 1.2em;'));
           else {
               elemliststr = elem.attr('id') + '-list';
               elem.html($('
').attr('id',elemliststr).attr('style','list-style-type: disc; padding-left: 30px;'));
               elemlistobj = $('#' + elemliststr);
               $.each(data.results, function(i, item) {
                   elemlistobj.append($('
').html($('').html(item.from_user).attr('href', 'http://twitter.com/' + item.from_user).attr('target', '_blank')).append(': ' + item.text));
               });
               elem.append($('
').html($('').html('More Results').attr('href', 'http://search.twitter.com/search?q=' + query).attr('target', '_blank')).attr('style', 'padding-top: 10px; font-weight: bold; text-align: center;'));
           }
       }
   );
}

We use jQuery's getJSON call to get the JSON results from these services. The "&callback=?" at the end of both the JSON queries tells getJSON that these are JSONP requests - jQuery automagically substitutes the name of the function the results of the query should be fed to.
The "elem" and "searchtype" arguments being sent to the "getGoogleContent" function allow me to call the same function for all News, Web and Video searches like so:


    getGoogleContent($('#news-tab'), 'news', query);
    getGoogleContent($('#youtube-tab'), 'video', query);
    getGoogleContent($('#web-tab'), 'web', query);

Another thing to note here is the usefulness of FireBug. When I was starting out making these queries, I relied on console.log to see the returned object structure (since neither Google nor Twitter specify it) which then allowed me to make the right calls for the relevant data.

So there you go - a quick, easy and scalable way of using Google and Twitter's API's to bring in information into your website.

Tuesday, May 12, 2009

Store pickled data in a database using SQLAlchemy

I've been using SQLAlchemy (and Elixir) to write the data collection code for GovCheck. My first choice would have been to use Django's ORM itself, but that proved harder and more time consuming that I thought worth it.

One of the problems that I've had to solve is how to store lots of static data in a SQL database without adding more than 100 columns into a DB table. A good solution to this problem is to store this data with a pickled dictionary. The problem with that is the in-built PickleType defined by SQLAlchemy does not also base64 encode the result of the pickle operation on the dictionary (I wanted to do this because the custom field type I'd defined within the Django app was something similar to this, which assumes base64 encoded data coming in). However, fortunately, it's rather easy to create custom Type's for SQLAlchemy and use them in your models. Here is the custom type I created to store the data.


import base64
try:
  import cPickle as pickle
except:
  import pickle

class EncodedPickleType(types.TypeDecorator):
  """
      This class should be used whenever pickled data needs to be stored
      (instead of using the in-built PickleType). The reason for this is
      that the in-built type does not encode the pickled string using
      base64, which is what the Django field type expects.
  """
  impl = types.Text

  def process_bind_param(self, value, dialect):
      dumps = pickle.dumps
      if value is None:
          return None
      return base64.b64encode(dumps(value))

  def process_result_value(self, value, dialect):
      loads = pickle.loads
      if value is None:
          return None
      if not isinstance(value, basestring):
          return value
      return loads(base64.b64decode(value))

You can then use this within your model definitions and feed it a dictionary. The field will take care of pickling and encoding it as well as decoding and unpickling the data when accessing the data.

Sunday, May 10, 2009

Scraping PDF's in Python

So, in the course of grabbing some additional data sources for GovCheck, I needed to scrape a few pdf's and insert the information into my database. After looking high and low, I found an acceptable solution to do this using Python - pdfminer. It's not perfect, but it's much better than the rest of the pdf to html/txt converter tools - at-least as far as scraping goes. So I figured I'd note here how I wrote my scraping code. As a reference point, I was parsing election data for the past election using this pdf file.
You start off with running the code through pdfminer and getting the resulting HTML back.


import os
from BeautifulSoup import BeautifulSoup
for page in range(9, 552):
    soup = BeautifulSoup(os.popen('python ~/dev/pdfminer-dist-20090330/pdflib/pdf2txt.py -w -p %d Vol_II_LS_2004.pdf' % page).read())

Once you have said soup object, the next steps are pretty much the same as scraping any HTML page you would grab from some page on the web, except since the HTML generated by pdfminer user absolute positioning, you need to take care of those offsets. A couple ways I found to take those into account are listed below:

You can analyse a few pages of the generated html from the pdf, find the various offsets that are displayed and use them like so:


    for left_margin in (267,271,275):
        try:
            electors_total = map(int, map(strip, soup.find('span', style = 'position:absolute; writing-mode:lr-tb; left:%dpx; top:352px; font-size:8px;' % left_margin).string.split(' ')))
            break
        except AttributeError, e:
            pass

The previous method mentioned is actually a pretty bad way of doing things, not to say unreliable 'cause your code can break whenever an offset you did not know off is found. A better way to do things is to find a nearby constant text message (such as "MARGIN ") and then backtrack from there to find your relevant tag. This allows you to completely avoid hard coding offsets - unfortunately it's not usable in all the cases. Here's an example of how you can use this method:


self._data['Margin']['Number'], self._data['Margin']['Percentage'] = map(strip, soup.find(text = 'MARGIN ').parent.previousSibling.previousSibling.string.split(' '))

So there you go, a quick example of how to scrape a pdf using Python. It's not perfect but it works for me, for now.

Thursday, September 4, 2008

Why I use vim!!

It's been a while - life and other things have kept me busy. Anyways, what better way to return than to write about something I have not talked much about in the past - technology.

So, coding is my profession - I enjoy coding - nay - love it (given the right circumstances). And one of the biggest weapons in a coder's arsenal is his/her IDE. Well, mine is vim (vi improved). I have been using vim on and off for a few years now - but mainly as a text editor. I had never considered using it as my main development tool until recently (2-3 months back). Boy was I missing out. I've been using it to write everything from Python to HTML/Javascript now and been loving every bit of it.

The main (and rather obvious) reason is it's amazing ability to make text editing fast. Visual mode is the best invention mankind has made since the wheel. It allows me to remove and add text about as fast (maybe even faster) as I can think of doing it. Want a word gone - done. Want a line gone - done. Want to replace a line - line gone, cursor at beginning of line - done. I can keep going on. Gone are the days when you have to use your mouse to select a word, or a whole line. w,b,j,k,h,l allow me to move around in lesser time than it would take me to move my hand off the keyboard, onto the mouse, find the pointer, bring it down to the text I need gone, select the whole thing and hit delete. In-fact, now when I edit text in a browser (like this blogger textbox for example), I find it tedious to delete things, or add things.

But more than that, it's ability to double up as a IDE is what has me truly amazed. The amazing set of plugins around it (for example snippetsEmu) make it a breeze for me to write my code and get the right text in the right place ASAP. Other things like omnicompletion, code highlighting, line numbering, syntax checkers, class definitions etc. are icing on the cake. And the biggest win - it works on all 3 of my development platforms (Win, Max and Linux - yes I work on all 3). Setting it up on my windows machine took some doing (had to compile vim from source) but once it was done, everything worked exactly how it worked on my other machines. No more switching between IDE's and no more learning new commands (thanks, but no thanks, TextMate).

I do have a couple of gripes though - I would like to see some straighforward way to refactor my code and I would like to see integration with my vcs (git). I have tried out a plugin for the latter, but couldn't get it working within the 10 minutes I gave it a shot for.

In anycase, vim is an amazing text editor/IDE and anybody using anything else (with the exception of Eclipse for Java) should consider switching. I haven't tried Emacs, but after learning vim, I don't see any value in it - I've got all I need (and more).

My Life Log

Saturday, May 16, 2009

Using Twitter and Google Search API's using jQuery

Tuesday, May 12, 2009

Store pickled data in a database using SQLAlchemy

Sunday, May 10, 2009

Scraping PDF's in Python

Thursday, September 4, 2008

Why I use vim!!

About Me

Blog Archive

Labels

Recently Read

Shared items in Google Reader

Twitter Updates