Pregel: a system for large-scale graph processing

For large-scale graph processing one way to go is, of course, to use Hadoop and code the graph algorithm as a series of chained MapReduce invocations. MapReduce, however, is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient (as I alluded to at the end of this summary).

Google Pregel provides a simple straightforward solution to the large-scale graph processing problems. The Pregel solution is to use round(superstep)-based synchronized computation at the vertices supported with message-passing between the rounds. Pregel keeps vertices and edges on the machines that perform computation, and uses network transfers only for messages. This way Pregel avoids the communication overhead and programming complexity incurred by MapReduce chained iterations.

In Pregel, in each iteration (superstep), a vertex can receive messages sent to it in the previous iteration, send messages to other vertices, modify its own state and its outgoing edges' states, and mutate the graph's topology. This synchronized superstep model is inspired by Valiant’s Bulk Synchronous Parallel model. More specifically, at a superstep S, each active vertex V (in parallel and in isolation) reads messages sent to itself in superstep S-1, send messages to other vertices that will be received at superstep S+1, and modify the state of V and its outgoing edges.

Messages are accumulated and sent in batch-mode along outgoing edges, but a message may be sent to any vertex whose identifier is known. For example, the graph could be a clique, with well-known vertex identifiers V1 through VN, in which case there may be no need to even keep explicit edges in the graph. (This way Pregel reduces to a distributed message-passing programming system with N nodes.) A vertex can inspect and modify the values of out-edges. Conflicts can arise due to concurrent vertex add/remove requests, and are relegated to be resolved by user-defined handlers. This introduces significant complexity and could become a source of programmer errors.

It seems like default Pregel partitioning is not locality-preserving. This was surprising to me as this could cause excessive communication across nodes lead to inefficiency/waste. From the paper: "The default partitioning function is just hash(ID) mod N, where N is the number of partitions, but users can replace it. The assignment of vertices to worker machines is the main place where distribution is not transparent in Pregel. Some applications work well with the default assignment, but some benefit from defining custom assignment functions to better exploit locality inherent in the graph. For example, a typical heuristic employed for the Web graph is to colocate vertices representing pages of the same site."

The user program begins executing on a cluster of machines over the partitioned graph data. One of the machines acts as the master and coordinates worker activity. "The master determines how many partitions the graph will have, and assigns one or more partitions to each worker machine. The number may be controlled by the user. ... Each worker is also given the complete set of assignments for all workers [so that the worker knows which other worker to enqueue messages for its outgoing edges]." Fault-tolerance is achieved by checkpointing and replaying on machine failure. Note that if you write a self-stabilizing graph algorithm, then you can disable fault-tolerance and finish early.

The key to the scalability of Pregel is batch messaging. The message passing model allows Pregel to amortize latency by delivering messages asynchronously in batches between supersteps. Pregel is said to scale to billions of vertices and edges, but I am not sure what this means.  For some graphs, I reckon superhubs would limit scalability significantly. It is not clear if Pregel has mechanisms/optimizations to handle superhubs in some graphs.

Another question that comes to my mind is that how much of the work that is currently done by Hadoop can be (should be) moved to Pregel. I guess for any job where the data can be easily/naturally modeled as a graph (pagerank, social graph analysis, network analysis), Pregel is applicable and may be preferable to Hadoop. Especially, the ability to modify vertices/edges on-the-fly makes Pregel very flexible to accommodate a rich class of applications.

A major downside for Pregel is that it offloads a lot of responsibility to the programmer. The programmer has to develop code for this decentralized vertex-mode with round-based messaging. This model leads to some race-conditions as discussed above and those conflicts are also left to the programmer to deal with.

I am working on a Maestro architecture that can alleviate these problems. (I plan to write about Maestro here soon.) Maestro accepts as input a centralized program and takes care of decentralization and synchronization/locking of shared variables in an efficient manner. Maestro also uses a master for coordinating workers (unsurprisingly). But the master has more responsibility in Maestro; it is involved in synchronizing access to shared variables. (Recall that there are no shared variables in Pregel, so master does not get involved in synchronizing and locking.) In return, Maestro relieves the programmer  from writing decentralized code and handlers for data race conditions among vertices.

Pregel already has an opensource cloud implementation (Golden Orb). My plan next is to modify Golden Orb to see whether we can quickly develop a cloud implementation for Maestro.


Murat said…
Other graph processing frameworks:

- Apache Giraph
- Phoebus
- Bagel
- Hama
- Signal-Collect
- HipG

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom