Categories
Random

MapReduce Programming Model and Hadoop

After hearing the praises of Hadoop I had a brief look it and the Map/Reduce paradigm. Most info was read from Wikipedia and references in wiki articles. In particular, this paper from Jeffrey Dean and Sanjay Ghemawat.

Hadoop is an opensource software framework that implements MapReduce and Google File System. The aim is to enable easy and robust deployment of highly parallel data set processing programs. It is important to note that the MapReduce model is applicable to embarrassingly parallel problems. Processing can occur on data that is stored in a database or filesystem.

Map/Reduce refers to the steps in the model:

Map: A master node takes an input problem and divides it into sub-problems, passing them to worker nodes. Worker nodes can further divide sub-problems.  For example, in the problem of counting word occurrence in a document the Map function will output a key/value pair every time it sees a a specified work – ie: (“searchterm”, 1).

Reduce: The reduce function takes the list of word(key)/values and sums the occurrences:

(foo, (1)

(bar, 1)

(foo, 1)

(bar, 1)

(foo, 1)

The output of reduce in this case could be: (foo, 3).

The MapReduce model becomes powerful when considering giant datasets can be processed in large clusters very efficiently using this model.

Hadoop Run-time Takes care of:

  • Parallelalization
  • Partitioning of input data
  • Scheduling programs execution across machines
  • Handling machine failures
  • Managing inter-machine communication

Hadoop aims to enable developers with little distributed programming experience to utilize compute resources such as EC2.

With the emergence of ‘big data’ and the apparent value that can be extrapolated from massive databases/datastores, many organisations have found the limits of traditional relational databases. Hadoop has such a big buzz because it can pass the processing boundaries of relational database software and enable the extrapolation of value. The video below is a decent explanation of this point by data scientists at SalesForce.com


Categories
Random

Parsing HTML with Python

Came across an interesting scenario this week where I decided to try and use Python for parsing HTML. Turns out to quite straight forward and something I could imagine using more often in the future.

We were trying to grab the a dynamic link from this page at twitch http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty.

The link we were after was the most viewed channel at any particular time:

The most viewed channel is always in this location/element

First attempt was to load up that page in an <iframe> allowing our own JavaScript to grab the relevant URL then forward. Twitch.tv however forces forwarding when iframes are detected, presumably for proprietary reasons. So javascript was not really an option.

After experimenting with Python’s lxml http://lxml.de/ it was really straight forward and effective. It is apparently quite efficient using C core (http://stackoverflow.com/a/6494811/692180) too. The module I used is documented quite well here: http://lxml.de/lxmlhtml.html. With a little bit of trial an error a brief python script successfully grabs the relevant information.

Then using PHP’s popen() method I can simply call the python script and use the return value as a php variable for a header redirect.

Links to source:

get_link.py – Python script

get_link.php – PHP calling python and using return value :

 

error_reporting(E_ALL);
$handle = popen('python -i ./get_link.py 2>&1', 'r');
$read = fread($handle, 2096);
pclose($handle);
header('Location: '.$read);

 

Python source:

import urllib
from lxml import html

def getElement():
        f = urllib.urlopen("http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty")
        # Read from the object, storing the page's contents in 's'.
        s = f.read()
        f.close()# Read from the object, storing the page's contents in 's'.
        doc = html.document_fromstring(s)
        doc = doc.get_element_by_id('directory_channels')
        #rVal = trimDoc.text_content()
        doc = doc.find_class('thumb')
        rVal = html.tostring(doc[0]).split()[2]
        return makeUrl(rVal)

def makeUrl(thumbString):
        return "http://twitch.tv" +  thumbString[6:-2]

if __name__ == "__main__":
        print getElement()

 

Categories
Random

Pascals Triangle – python

Wrote a quick script that attempts to complete pascals triangle recursively, thanks Feng.
** Note, after getting some feedback made a number of changes, thanks Fry**
see source: pascals.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#########################################################
# Python Script: Pascals triangle #
# Enter How many rows you want... #
# Mark Culhane 2011, straightIT.com #
#########################################################

#Initialize
tri =[[1],[1]]
n = 0

def pascals():
if len(tri) rowNum = len(tri) - 1
colCnt = len(tri[rowNum])

if colCnt == rowNum:
tri[rowNum].extend([1])
tri.append([1])
else:
p1 = tri[len(tri)-2][len(tri[len(tri)-1])-1]
p2 = tri[len(tri)-2][len(tri[len(tri)-1])]
tri[rowNum].extend([p1+p2])
pascals()
return

def displaySol():
i = 0
for x in tri:
if i != n:
print(tri[i])
i = i + 1

if __name__ == "__main__":
print "Pascals Triangle"
n = long(raw_input("How many rows do you want?:"))
print "Result: "
pascals()
displaySol()
Categories
Random

C Programming language

Over the summer break I wanted to get a better grounding in the C programming language. I was looking for a decent text book (preferably free) which was suitable for someone new to C but with an understanding of other languages and programming in general. After perusing a few I settled on:

http://publications.gbdirect.co.uk/c_book/

Working through the book was moderately interesting. Getting bogged down in memory allocation and array stepping did not really seem worthwhile though. Considering the C++ does not require in depth knowledge of these concepts meant that spending time on them was for interest sake only with little chance of use in the future.  Below are some links to some of the simple exercises I did for reference:

http://70.40.214.44/sourcecode/CLearning/

Some definite reasons for doing more learning in the C language are:

C

Categories
Random

Getting iPhone working on Linux (without jailbreaking)

Steps to Using iTunes and iPhone on Ubuntu (using windows virtual machine)

  1. Install git, with the following command: sudo apt-get install git-core
  2. Install iPhone connectivity the driver with the following command: git clone git://github.com/dgiagio/ipheth.git
  3. Install VitualBox PUEL (http://www.virtualbox.org/wiki/Linux_Downloads) , ensure it is PUEL, no OSE Edition
  4. Install Samba and share Music folder (instructions: https://help.ubuntu.com/9.04/serverguide/C/samba-fileserver.html)
  5. Open Sun Virtual Box, create a virtual machine (ensure in vm settings USB is enabled and iPhone filter selected).
  6. Install iTunes on the vm and access your songs through shared folder.

All up takes about 2 hrs max and is the easiest way to get your iPhone working on Ubuntu without jail breaking.