I am running behind my blogging the seminar. Here are some short notes on the papers we have been discussing in the Data Center Computing seminar.
The "MapReduce Online" paper by "Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy and Russell Sears" was a reading assignment for the first week, but we didn't have the time to discuss it. This paper is written very simple and provides a clean discussion of Hadoop's inner workings that I felt I should write about the paper just for this reason.
The paper presents a modified Hadoop mapreduce framework that allows data to be pipelined between operators, and supports online aggregation, which allows users to see "early returns" from a job as it is being computed.
In Hadoop, a downstream worker is blocked until the previous worker process the entire data and make it available in entirety. In contrast, pipelining delivers data to downstream operators more promptly. Not only the immediate downstream worker can start before the previous worker completes its work on the entire data, several other downstream workers in the chain after the immediate downstream worker can also get started as well. This increases opportunities for parallelism, improves utilization, and reduces response time. The initial performance results demonstrate that pipelining can reduce job completion times by up to 25% in some scenarios.
While this upto 25% reduction in completion time is something, online mapreduce also comes with its luggage. Adding pipelining to mapreduce introduces significant complexity. Especially, I found the fault-tolerance discussion (checkpointing) in the paper to be complicated and unclear. So, I am not sure I buy the motivation for online mapreduce. The paper mentions as a motivation that getting early approximate results before the mapreduce completes can be useful. But, somehow this feels weak. Since mapreduce is an inherently batch process, the benefits of early approximate results are not very clear to me.
The paper also mentions that online mapreduce enables answering continuous queries, and this motivation looks more promising to me. More results and "incremental results" are computed for a continuous query as more new data arrives. If we had to answer a continuous query via traditional Hadoop, we would have to start a new mapreduce job instance periodically, but then each new mapreduce job does not have access to the computational state of the last analysis run, so this state must be recomputed from scratch.
The paper's coauthor Hellerstein has worked on continuous querying in databases world. The real applications of continuous querying in the cloud are not clear to me yet. So, it will be interesting to see what type of new applications this work will enable in the cloud domain.