MapReduce: Simplified Data Processing on Large Clusters

- September 09, 2010

Steve Ko joined our department this semester from UIUC by way of Princeton. He is offering a seminar course on data center computing. I am sitting in his seminar to catch-up on these topics. I thought I might as well write some posts about the papers we discuss in that seminar.

The first post is on the "MapReduce: Simplefied Data Processing on Large Clusters , J. Dean et al., OSDI 2004" paper.

Several companies use mapreduce. Mapreduce helps companies process large data. Mapreduce is popular because it provides a simple abstraction for parallelizing, and hide the meticuluous tasks involved in parallelizing in the datacenter under the hood.

Mapreduce has been inspired by the map reduce operations in Lisp.

(map square '(1 2 3 4))

output is (1 4 9 16)

(reduce + '(1 4 9 16))

output is 30

because of (+ 16 (+ 9 (+ 4 1)))

Google noticed that they keep doing tasks that follow the same pattern: they individually process web pages (map), and then gather-reprocess-fuse (reduce) the output of these individual webpages together.

Mapreduce should actually be called map-shuffle-reduce. Shuffle is the rarely mentioned phase that runs after the map completes. Shuffle takes output of the map, group them together, so reduce can use them. Shuffle is local to the machine's data, so shuffle is local to the individual machine does not involve networking. Reduce, on the other hand, will collect intermediate data from the machines using the network, and run another groupby before running the reduce function.

Mapreduce provides a nice abstraction to the developer, but now let's look at how mapreduce is implemented under the hood. There is a single master that knows about the map-workers and reduce-workers. The master keeps track of how the tasks are progressing. To this end, "worker pull" scheme is used: worker signals it is empty to the master, and the master finds task to assign to the worker. Note that, there is nothing stopping the master to use the same machine/node for as a reduce-worker after using it first as a map worker. For fault-tolerance, the master monitors the worker, if a worker is lagging behind, the master assigns the task to another worker.

Hadoop is the opensource implementation of mapreduce. Hadoop has several subprojects. HDFS is the open source implementation of Google File System. Hbase is bigtable. Zookeeper is the Paxos protocol for membership protocols. Other projects under Hadoop include hive, pig, etc. In contrast to Amazon which provides elastic mapreduce, the Hadoop implementation requires you to specify the number of nodes/computers you will use for your mapreduce application.

There are two directions of improvements suggested for mapreduce. The first direction is improvements in terms of simplifying/extending the programming model. The biggest problem with mapreduce is that for a typical application you don't just have one mapreduce, but a chain of dozens of mapreduces. (Some Yahoo jobs have a chain of 100 mapreduces.) So mapreduce can get complicated. Projects like pig, dryad, hive, sawzall, map-reduce-merge, try to address this problem by providing even higher level abstractions (e.g., SQL) over mapreduce.

The second direction is improvements on the performance the runtime system supporting the mapreduce abstraction. Some example projects are on better scheduling (Late scheduler), better fault handling (Intermediate Storage System), and pipelining of the mapreduces (online mapreduce).

While I don't have experience in data center computing, I can see a couple low hanging fruits from the mapreduce work. One is how to optimize the networking/data-scheduling in the datacenter, provided that we have access to the mapreduce job sourcecodes that are scheduled to run next. One straightforward idea is to choose which machines the reduce-workers (that use the output of map-workers) and the map-workers (that use the output of reduce-workers in previous roun) should run to minimize the data that needs to be moved around.