Scalable SPARQL Querying of Large RDF Graphs

- December 16, 2011

Due to its excellent price/performance ratio, Hadoop has become the kitchen sink of big data management and analysis. Hadoop defaults are, of course, not suitable for every kind of data analysis application. For example using Hadoop on relational data processing incurs a lot of waste/inefficiencies. Similarly for graph data processing Hadoop is very inefficient. This paper by Abadi et. al. shows that Hadoop's efficiency on graph data (for semantic web subgraph matching application) can be improved 1000 times by fixing the following defaults in Hadoop.

1. Hadoop, by default, hash partitions data across nodes. For graph processing this is very inefficient. The paper advocates using a locality-preserving partitioning (METIS partitioner) that maps nearby vertices to the same worker as much as possible.

2. Hadoop, by default, replicates each data 3 times. "However, ... the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. It is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors." For this, the paper uses a custom triple replication module.

3. Hadoop, by default, uses HDFS and HBase for storing data. These are not optimal for web semantics graph data which is of RDF form (subject-predicate-object triplet). The paper uses RDF-Store for storing web semantics graph data.

Daniel's blog mentions that each fix contributes a 10 fold improvement in efficiency which yields a 1000 fold improvement in total. The experiments results are taken using the Lehigh University Benchmark (LUBM) for semantic web querying. For queries that take less than a second to compute on a single machine, the single-machine solution was faster than both the Hadoop-default and Hadoop-optimized. Of course for these fast queries a lookup to another worker requires network communication and incurs a relatively large overhead. Therefore, Hadoop-default is at a big disadvantage for fast queries. For slow queries (that take from 10 sec to 1000 sec on a single machine) there were still cases where Hadoop-optimized was 1000 times faster than hadoop-default.

It would have been nice if the paper included Hadoop-default pseudocode as well as Hadoop-optimized pseudocode. I want to see what (if anything) changed in the code. Here is another noteworthy implementation detail from the paper. The paper had to revert to some customizations in vertex partitioning. "To facilitate partitioning a RDF graph by vertex, we remove triples whose predicate is rdf:type (and other similar predicates with meaning "type"). These triples may generate undesirable connections, because if we included these "type" triples, every entity of the same type would be within two hops of each other in the RDF graph (connected to each other through the shared type object). These connections make the graph more complex and reduce the quality of graph partitioning significantly, since the more connected the graph is, the harder it is to partition it."

So, what are the fundamental contributions in this work? After all, it is now becoming a folk theorem that it is easy to make minor modifications/configurations to Hadoop to yield large performance improvements. Gun Sirer puts it nicely: "if you have a Hadoop job whose performance you're not happy with, and you're unable to speed it up by a factor of 10, there is something wrong with you." The first technique of locality preserving distribution of graph data over workers is a pretty obvious idea because it is the most sensible thing to do. The second technique of replicating border vertices is interesting and promising. However, this technique is inapplicable for graph applications that modify the graph data. The semantic web subgraph matching application did not modify the graph; it only read the graph. If instead of subgraph-matching, had we considered a graph-subcoloring application (or any such application that modified the graph), the replication would not be valid because it would be very hard to maintain consistency among the replicas of the boundary vertices.

For applications that modify the graph, even after fixing the inefficient defaults to sensible alternatives, there would still be inherent inefficiency/waste in Hadoop due to the functional nature of MapReduce programming. For such graph-modifying applications, MapReduce is not a good fit as it neccessitates numerous iterations over the graph data and is wasteful. People don't care about this waste, because in batch execution mode this waste is not obvious/visible. Also, in return for this waste Hadoop enables hassle-free scale-out, which makes it acceptable. However, for real-time tight-synchronization-requiring applications this waste becomes clear by way of unacceptable latency and has to be dealt with. Obviously, there are other data processing tools for graphs, such as Google Pregel. The paper plans to compare with Pregel, and I also plan to write a summary of Pregel soon.