After hearing the praises of Hadoop I had a brief look it and the Map/Reduce paradigm. Most info was read from Wikipedia and references in wiki articles. In particular, this paper from Jeffrey Dean and Sanjay Ghemawat.
Hadoop is an opensource software framework that implements MapReduce and Google File System. The aim is to enable easy and robust deployment of highly parallel data set processing programs. It is important to note that the MapReduce model is applicable to embarrassingly parallel problems. Processing can occur on data that is stored in a database or filesystem.
Map/Reduce refers to the steps in the model:
Map: A master node takes an input problem and divides it into sub-problems, passing them to worker nodes. Worker nodes can further divide sub-problems. For example, in the problem of counting word occurrence in a document the Map function will output a key/value pair every time it sees a a specified work – ie: (“searchterm”, 1).
Reduce: The reduce function takes the list of word(key)/values and sums the occurrences:
The output of reduce in this case could be: (foo, 3).
The MapReduce model becomes powerful when considering giant datasets can be processed in large clusters very efficiently using this model.
Hadoop Run-time Takes care of:
- Partitioning of input data
- Scheduling programs execution across machines
- Handling machine failures
- Managing inter-machine communication
Hadoop aims to enable developers with little distributed programming experience to utilize compute resources such as EC2.
With the emergence of ‘big data’ and the apparent value that can be extrapolated from massive databases/datastores, many organisations have found the limits of traditional relational databases. Hadoop has such a big buzz because it can pass the processing boundaries of relational database software and enable the extrapolation of value. The video below is a decent explanation of this point by data scientists at SalesForce.com