Optimizations & analysis of BSP graph processing models on public cloudsMapreduce/Hadoop is not very suitable for graph processing (which requires iterating over and over on the same graph), and this led to the Pregel graph processing framework by Google. Pregel is based on the Bulk Synchronous Parallelism (BSP) model. Here is a summary of Pregel if you are not familiar with it. In short, Pregel uses a vertex-centric graph processing model, where the same code is executed on all vertices concurrently. Pregel uses message passing along edges and barrier synchronization at supersteps (i.e., rounds) to iterate over the graph. This paper looks at optimizations and analysis of BSP graph processing frameworks.
This group had access to Microsoft Azure cloud computing framework, and they wanted to experiment with Pregel there, so they implemented Pregel (following the description in the Pregel paper) from scratch in .NET environment. They early on noticed that Pregel gets fairly memory intensive as it holds on to all the messages sent to all the vertices in the worker. They started analyzing it further to see how the memory usage changes in the lifetime of a Pregel program over many supersteps. They discovered that there is a camel hump in the middle of the program lifetime (i.e., in the middle supersteps) for most traditional graph programs, such as all-pairs shortest path and betweenness centrality. This is because these graph programs tend to exchange more messages towards the middle supersteps as the computation flourishes and the number of messages exchanged subdues again as the computation comes closer to termination. (It turns out this hump is not present for PageRank.) This hump, of course, is important as it has an impact on how many workers you need to provision, because you need to provision for the worst-case memory usage of the workers.
So the group goes on to look into how they can constrain this hump to have a predictable memory usage through all supersteps and to facilitate managing the memory constraints of the workers. To this end, they come up with the swath concept. Swath is a subset of vertices of entire graph on which the algorithm is being initiated. Their goal is to pick the swath size that is the best fit into the main memory (amplitude) of the workers. They work on identifying swath initiation heuristics (when are subsequent swaths of verices activated) and swath size heuristics (how many vertices are active concurrently in a swath). They experiment with two approaches, sampling approach and adaptive approach, to determine when the next swath is initiated. By breaking computation into swaths of vertices and using our sizing heuristics, they are able to achieve up to 3.5x speedup over the maximum swath size that does not cause the a failure. Of course, a limitation of the swath approach is that it assumes that the program execution is embarassingly parallel and you can execute the program over swaths distributed in time without causing any correctness issues. So this approach is applicable only to those type of graph programs, such as all-pairs shortest path and betweenness centrality.
The hump observation in memory usage of BSP-based graph processing frameworks is a nice and useful one. We have also worked on BSP-based graph processing frameworks and focused on improving the opensource Giraph implementation of Pregel. We provided serializability to Giraph by introducing an optimization: internal vertices in a worker do not message each other but rather read each others' state directly from the memory of the worker they reside. Our optimization not only provides a stronger serializability to Giraph, but it also prevents this memory camel-hump haunting BSP programs as well. Our paper has been accepted to EuroPar13, and I will write a detailed post about our paper soon.
Multithreaded graph partitioning
This paper describes the design and development of an extension to the Metis partitioner to enable multithreading, and while doing so the paper also thoroughly explores the Metis design space.
The Metis partitioner is a tool for divide a graph into minimally connected and roughly equal parts. While partitioning the graph, the constraint is that the largest partition produced should be smaller than a given size (the worker memory size), and this makes the problem an NP-hard problem. The way Metis works is via using a multilevel paradigm of 1) coarsening, 2) initial partitioning, and 3) uncoarsening. In the first two steps an approximate partitioning is made via coarsening, and then Metis does a refinement on "the bordering vertices" to find better partitioning in the last step. Since the coarse partitioning works over all vertices instead of just border vertices it is generally the bottleneck step.
The paper investigated several alternatives for each of the 3 steps above. For the coarsening step they looked at fine-grain matching (locking based), multi-pass matching, and unprotected matching (which requires a conflict resolution at the end, but this is no problem because only a small percentage of conflicts occurs). For the initial partitioning they tried parallel recursive bisectioning, and parallel k-sectioning. For the refinement step they tried coarse-grain and fine-grain approaches. They give an experimental evaluation of all these different approaches on graph datasets (roadmap and vlsi circuit) that consist of millions of vertices. They evaluated for performance and cut quality, and showed that their multithreaded metis is a clear winner.
One of the lessons learned from the multithreaded metis is that using unprotected operations (for coarsening step) is not that dangerous or crazy, because cleaning up after race conditions turned out to be faster than preventing them. This group made their code open source at http://www.cs.umn.edu/~metis/
Finally, some ramblings
I never understood the people that go all that trouble to travel to a conference who only then sit in the lobby or the room to websurf and do emails hunching on their laptops. If there is a sensible explanation for this behavior can someone tell me, so I can stop wondering about this? Yes, presenters are not always doing a great job at explaining things, but after all that trouble to traveling to the conference, those people owe it to themselves to get the most out of the conference by being there as a whole and not just physically.
My theory about the low quality of some presentations is that the presenters often give presentations to impress and not to teach/educate/communicate. (Ted Herman once warned me about this when I was a graduate student, and I tried to do my best not to fall into that trap ever since.) I believe that by just focusing on the message to communicate and being committed to communicating it, most presenters would be able to do a decent job. Instead presenters seem to feel like they need to show off how technically through and how clever they had been, how sophisticated they are, and the result is a dry, defensive, and incomprehensible presentation. Look, you have been thinking about that problem for the last one year at least, and I am hearing/seeing about it the first time here, and you expect me to understand all the subtleties in that problem space in 10 minutes? If you are committed to teaching/educating/communicating in your allotted 25 minute time slot, you should focus on explaining the insight and the most important technical results (not all of them) in the simplest of terms. You can mention that the problem has several subtleties and you are referring the audience to the technical paper for the full details. I am not saying that you shouldn't be technical; you can be technical but not to the expense of being cryptic or exclusive.