DDIA: Chp 10. Batch Processing

- November 21, 2024

Batch processing allows large-scale data transformations through bulk-synchronous processing. The simplicity of this approach allowed building reliable, scalable, maintainable applications with it. If you recall, "reliable-scalable-maintainable" was what we set out to learn when we began the DDIA book.

This story of MapReduce starts when Google engineers realize there were a lot of repetitive tasks involved when computing over large data. These tasks often involved individually processing elements and then gathering and fusing their output. Interestingly, this bores a striking resemblance to electromechanical IBM card-sorting machines from the 1940-50s. MapReduce also got some inspiration from the map reduce operations in Lisp: (map square '(1 2 3 4)) gives us (1 4 9 16), and (reduce + '(1 4 9 16)) gives us 30. The key innovation of Google's MapReduce framework was its ability to simplify parallel processing by abstracting away complex network communication and failure handling. By separating network communication from application logic, MapReduce shielded developers from intricate distributed computing challenges (partial failures, distributed state management, etc). Tasks could be automatically retried without disrupting the core application workflow.

The MapReduce framework's elegance lay in its simplicity: providing a straightforward mechanism for parallelizing computations through map and reduce operations. Though MapReduce had many limitations/inefficiencies and Google retired the approach by 2014, its impact on distributed computing still remains important. It is hard to deny that MapReduce paved the way to commodity distributed computing.

Before beginning to explain MapReduce, the book takes a detour through batch processing in Unix/Linux: modular commands where output are piped as input to the next command in sequence. That section is skippable. By the way, I think the original MapReduce paper from OSDI 2004 is worth a read itself. It is written clearly and is easy to follow.

MapReduce and Distributed Filesystems

Hadoop MapReduce leverages HDFS (Hadoop Distributed File System), which is an open-source reimplementation of Google's File System (GFS). Built on the shared-nothing principle, HDFS radically transformed traditional data storage from custom expensive hardware towards cheap and off-the-shelf nodes. HDFS also enabled indiscriminate data dumping. Unlike traditional massively parallel processing (MPP) databases requiring meticulous upfront data modeling, HDFS allowed raw data storage with interpretation deferred to consumption. This "schema-on-read" approach, dubbed the "sushi principle" (raw data is better), enabled flexible data transformation and multiple interpretations.

HDFS consists of a daemon process running on each machine to serve files stored on that machine, with a central NameNode tracking file block placements across machines. This simple design enables creation of a massive fault-tolerant filesystem by leveraging distributed storage on many machines and implementing block replication.

Ok, back to MapReduce. The MapReduce job model is centered around two callback functions: the mapper and reducer. Mappers process individual input records independently, extracting key-value pairs, while reducers aggregate and process values associated with specific keys. These two operations, when chained as multi-stage processing jobs, are sufficient to implement complex data transformations.

MapReduce Job Execution

MapReduce job execution strategically assigns map tasks to machines storing input file replicas, minimizing network data transfer and optimizing computational resource utilization. First, the framework copies necessary application code to assigned machines. Map tasks then process input files record by record, generating key-value pair outputs. Reduce tasks are partitioned using key hash algorithms to ensure consistent key processing.

Data sorting occurs in stages due to massive dataset sizes. Mappers partition output by reducer, writing sorted files locally using techniques similar to SSTables. When mappers complete, reducers fetch these sorted partition files through a process called "shuffle". Reducers merge mapper files applying user provided logic, maintaining sort order, and process records with identical keys sequentially.

MapReduce jobs are often chained together in workflows, where the output of one job becomes the input of the next. Since Hadoop MapReduce lacks native workflow support, jobs are linked by configuring output and input directory names. Large organizations did have complex workflows with hundreds of MapReduce jobs. To manage these data flows, several higher-level tools had been developed, including: Pig, Hive, Cascading, Crunch, FlumeJava.

Beyond MapReduce

Back in 2010s I was flabbergasted as to why people would use MapReduce, because it was so inefficient: So much data movement, so much need for synchronization, and such a limited language (just map and reduce) to express computation. Yes, MapReduce provided parallelism and scalability, but at what COST? Frank McSherry argued that much of that scalability came from "poor baselines and low expectations".

But people cared about usability more, and didn't care about if the MapReduce job took an extra 3 hours and 5 hours there. They were ready to wait anyways, since this is offline batch processing. MapReduce shielded them for distributed systems challenges, or thinking too hard to develop/program an efficient algorithm, and that was more important for them. MapReduce was too simple and coarse-grained to be efficient, but it paved the way to more efficient dataflow engines. Respect what came before.

New dataflow engines like Spark, Tez, and Flink overcame some of MapReduce's limitations. Unlike MapReduce's rigid map-reduce model, these engines enable more flexible operator connections and smarter data processing. They optimize computational resources by selectively sorting, intelligently scheduling tasks, reducing unnecessary data movement, and improving data locality. By explicitly modeling the data flow, these systems are able to make nuanced scheduling decisions to speed-up processing. The engines also introduce advanced join strategies and minimize computational overhead by supporting in-memory intermediate states, reducing I/O demands, and enabling faster task startup.