Intermediate storage support for mapreduce for improved fault-tolerance

- September 13, 2010

The second paper we discussed in the first week of the data center computing seminar is Steve Ko's work on improving fault-tolerance in mapreduce jobs. The notes below are just the edited version of the notes I took during Steve's presentation, so mistakes and omissions are most probably mine.

Making Cloud-Intermediate-Data Fault-Tolerant
Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta
Proceedings of the ACM Symposium on Cloud Computing 2010 (SOCC), 2010

Problem:

Yes, mapreduce is fault-tolerant in the sense that the master will reassign a failed map or reduce task to another worker to get the job done. But, this approach to fault-tolerance introduces latency to the jobs. Steve's experiments showed that 1 failure in 20 machines led to 33% increase in runtime in a single mapreduce task.

We have a worse problem in our hands when we consider mapreduce job chains. A worker failure in the middle of the chain necessitates a cascaded re-execution till the top of the chain. Since intermediate data is not saved in mapreduce, the master will need to schedule all the previous tasks till that point so that the job can continue. Given that Yahoo webmap is a chain of 100 mapreduce tasks and Google indexing is a chain of 24 mapreduce tasks, the latency due to failed tasks constitute a big problem.

And, failures do happen routinely in data centers. Google reported 5 average worker deaths per mapreduce job in 2006. (Worker thankfully refers to process not employee in this case :-) In 2009, Yahoo reported 50 machines out of 20000 machines fail on average.

Solution:

The above discussion shows that intermediate data in mapreduce is important, because when it is lost, it is very costly to regenerate. So, we should treat the intermediate data as first class citizen, and store the intermediate data as well.

But there are challenges to doing this right. Storing intermediate data requires CPU and disk access, and the challenge is to minimize the interference of replication on the overall performance of the mapreduce job. Replication, if not done carefully, increases completion time by a couple folds (see the paragraph below).

Identifying the bottleneck in replication:

We start with step1-Hadoop (no replication) and incrementally add support for replication; step2-read added, step3- read-send added, and finally step4-full-replication. The first three have been chosen as control group to identify where the bottleneck is in this increasing spectrum till full replication. The experiment results show that there is a big jump from read to read-send (where network transfer is added), and no observable jump in between other phases. So the bottleneck is in the network transfer of the data.

HDFS does synchronous replication, making this asynchronous replication helps, but since the bottleneck is the network, this help is not significant. The real problem is HDFS replicates across different racks, so that is why that network bottleneck incurs overhead on performance. And we should try to get rid of that.

ISS (Intermediate Storage System):

The proposed solution ISS is a combination of two techniques: local replication and rack-level replication.

The local replication technique makes the following observation about a single mapreduce task execution. The shuffle phase between the map and reduce phases provide a natural replication by replicating the output of the map worker on the reduce worker machine. The only exception is when map worker and reduce worker are both scheduled on the same machine, so we need only to replicate at this case explicitly. For the other case, we just keep track of where the data is replicated before the reduce-workers start.

If only the local replication is used, then the overhead of replication is very insignificant. But notice that local-only replication is applicable within one mapreduce task. In mapreduce chains, there is no natural replication between the reduce and the next map, so local replication is not applicable.

The rack-level replication technique comes into play for addressing this problem. The observation in this technique is that while shared core switches (top level switches) are over-utilized and contended, the rack switches are under-utilized. So we should replicate within the rack to eliminate the network latencies, which we had identified as the bottleneck for the performance.

Rack-level replication may have a drawback: if the entire rack fails we lost our intermediate data and need to start from the beginning. However, Google's 2008 data shows that there are only around 20 rack failures per year (mostly planned downtime), so mean time to fail for racks is very big compared to job execution times, it is insignificant.

Results from ISS implementation:

The paper includes experiments on the ISS implementation. Under no failure, the performance of Hadoop augmented with ISS (i.e., job completion time) turns out to be comparable to base Hadoop. Under 1 machine failure, Hadoop with ISS incurs only a 18% increase in completion time over no fault-case, a commendable feat. To contrast, base Hadoop incurs a 59% increase in completion time under 1 machine failure over the no fault-case.

Research ideas:

While rack level replication is suggested for reducing the overhead of networking in ISS for the sake of fault-tolerance, I am left wondering why the same rack level data transfer idea is not used for reducing the overhead of networking in the mapreduce normal operation for the sake of performance. Can we use a scheduling approach to schedule the next map task to be in the same rack as the completed reduce? Is this too constraining for mapreduce normal operation to be sustainable over long mapreduce chains?

How about a tradeoff? Some replication is also done across racks (which provides overhead for the sake of fault-tolerance), but that overhead is compensated because the next map task scheduled in the same rack as the replicated data and avoids network overhead?

Another question can we use parallel TCP streams to original and replicated data to improve the overall bandwidth of mapreduce which can provide further reductions in latency even in the normal (fault-free) mapreduce operation?