Posts

Showing posts from December, 2011

Pregel: a system for large-scale graph processing

For large-scale graph processing one way to go is, of course, to use Hadoop and code the graph algorithm as a series of chained MapReduce invocations. MapReduce, however, is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient (as I alluded to at the end of this summary ). Google Pregel provides a simple straightforward solution to the large-scale graph processing problems. The Pregel solution is to use round(superstep)-based synchronized computation at the vertices supported with message-passing between the rounds. Pregel keeps vertices and edges on the machines that perform computation, and uses network transfers only for messages. This way Pregel avoids the communication overhead and programming complexity incurred by MapReduce chained iterations. Model In Pregel, in each iteration (superstep), a vertex can receive messages sent to it in the previous iteration, send messages to other vertices...

Reverse-scooping

"scooping (v): publish a news story before (a rival reporter, newspaper, or radio or television station)." Scooping is what happens when another team beats you to publishing results on a problem. Getting scooped is very frustrating and is dreaded by many PhD students. I heard stories about poor Math PhD students who worked on a problem for years only to discover that they got scooped by a couple months. OK, then what is reverse-scooping? It is a term I coined last year. (Of course, the irony is that after a Google search I discovered that I was almost reverse-scooping someone else  ;-). In reverse-scooping, you solve a problem and publish it first. Then several months later, another team (generally from a better-known university) solves the same problem and publish it at a more visible venue. They get all the credit, their work gets cited a lot, and it is as if your work doesn't exist! Congratulations, you got reverse-scooped. I got reverse-scooped more than a couple...

Scalable SPARQL Querying of Large RDF Graphs

Due to its excellent price/performance ratio, Hadoop has become the kitchen sink of big data management and analysis. Hadoop defaults are, of course, not suitable for every kind of data analysis application. For example using Hadoop on relational data processing incurs a lot of waste/inefficiencies. Similarly for graph data processing Hadoop is very inefficient. This paper by Abadi et. al. shows that Hadoop's efficiency on graph data (for semantic web subgraph matching application) can be improved 1000 times by fixing the following defaults in Hadoop. 1. Hadoop, by default, hash partitions data across nodes. For graph processing this is very inefficient. The paper advocates using a locality-preserving partitioning ( METIS partitioner ) that maps nearby vertices to the same worker as much as possible. 2. Hadoop, by default, replicates each data 3 times. "However, ... the data that is on the border of any particular partition is far more important to replicate than the data...

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book