Came across an interesting scenario this week where I decided to try and use Python for parsing HTML. Turns out to quite straight forward and something I could imagine using more often in the future.
We were trying to grab the a dynamic link from this page at twitch http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty.
The link we were after was the most viewed channel at any particular time:
After experimenting with Python’s lxml http://lxml.de/ it was really straight forward and effective. It is apparently quite efficient using C core (http://stackoverflow.com/a/6494811/692180) too. The module I used is documented quite well here: http://lxml.de/lxmlhtml.html. With a little bit of trial an error a brief python script successfully grabs the relevant information.
Then using PHP’s popen() method I can simply call the python script and use the return value as a php variable for a header redirect.
Links to source:
get_link.php – PHP calling python and using return value :
$handle = popen('python -i ./get_link.py 2>&1', 'r');
$read = fread($handle, 2096);
from lxml import html
f = urllib.urlopen("http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()# Read from the object, storing the page's contents in 's'.
doc = html.document_fromstring(s)
doc = doc.get_element_by_id('directory_channels')
#rVal = trimDoc.text_content()
doc = doc.find_class('thumb')
rVal = html.tostring(doc).split()
return "http://twitch.tv" + thumbString[6:-2]
if __name__ == "__main__":