Came across an interesting scenario this week where I decided to try and use Python for parsing HTML. Turns out to quite straight forward and something I could imagine using more often in the future.
We were trying to grab the a dynamic link from this page at twitch http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty.
The link we were after was the most viewed channel at any particular time:
After experimenting with Python’s lxml http://lxml.de/ it was really straight forward and effective. It is apparently quite efficient using C core (http://stackoverflow.com/a/6494811/692180) too. The module I used is documented quite well here: http://lxml.de/lxmlhtml.html. With a little bit of trial an error a brief python script successfully grabs the relevant information.
Then using PHP’s popen() method I can simply call the python script and use the return value as a php variable for a header redirect.
Links to source:
get_link.php – PHP calling python and using return value :
error_reporting(E_ALL); $handle = popen('python -i ./get_link.py 2>&1', 'r'); $read = fread($handle, 2096); pclose($handle); header('Location: '.$read);
import urllib from lxml import html def getElement(): f = urllib.urlopen("http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty") # Read from the object, storing the page's contents in 's'. s = f.read() f.close()# Read from the object, storing the page's contents in 's'. doc = html.document_fromstring(s) doc = doc.get_element_by_id('directory_channels') #rVal = trimDoc.text_content() doc = doc.find_class('thumb') rVal = html.tostring(doc).split() return makeUrl(rVal) def makeUrl(thumbString): return "http://twitch.tv" + thumbString[6:-2] if __name__ == "__main__": print getElement()