Parsing HTML with Python – Mark's IT Blog

Came across an interesting scenario this week where I decided to try and use Python for parsing HTML. Turns out to quite straight forward and something I could imagine using more often in the future.

We were trying to grab the a dynamic link from this page at twitch http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty.

The link we were after was the most viewed channel at any particular time:

twitch.tv -SC2 — The most viewed channel is always in this location/element

First attempt was to load up that page in an <iframe> allowing our own JavaScript to grab the relevant URL then forward. Twitch.tv however forces forwarding when iframes are detected, presumably for proprietary reasons. So javascript was not really an option.

After experimenting with Python’s lxml http://lxml.de/ it was really straight forward and effective. It is apparently quite efficient using C core (http://stackoverflow.com/a/6494811/692180) too. The module I used is documented quite well here: http://lxml.de/lxmlhtml.html. With a little bit of trial an error a brief python script successfully grabs the relevant information.

Then using PHP’s popen() method I can simply call the python script and use the return value as a php variable for a header redirect.

Links to source:

get_link.py – Python script

get_link.php – PHP calling python and using return value :

error_reporting(E_ALL);
$handle = popen('python -i ./get_link.py 2>&1', 'r');
$read = fread($handle, 2096);
pclose($handle);
header('Location: '.$read);

Python source:

import urllib
from lxml import html

def getElement():
        f = urllib.urlopen("http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty")
        # Read from the object, storing the page's contents in 's'.
        s = f.read()
        f.close()# Read from the object, storing the page's contents in 's'.
        doc = html.document_fromstring(s)
        doc = doc.get_element_by_id('directory_channels')
        #rVal = trimDoc.text_content()
        doc = doc.find_class('thumb')
        rVal = html.tostring(doc[0]).split()[2]
        return makeUrl(rVal)

def makeUrl(thumbString):
        return "http://twitch.tv" +  thumbString[6:-2]

if __name__ == "__main__":
        print getElement()

Leave a Reply