Scalability, but at what COST

- June 14, 2017

This paper is by Frank McSherry, Michael Isard, Derek G. Murray and appeared in HotOS 2015. The authors are all listed as unaffiliated because this is around the time where Microsoft Research Silicon Valley lab was closed, where they used to work. Michael and Derek are at Google working on TensorFlow framework, but Frank McSherry is still at large and unaffiliated. Frank has a great blog, where you will learn more than you ever wanted to know about dataflow, Rust, differential privacy, and the art of influencing people and making friends.

COST, defined per system for a problem, is the configuration required before the system outperforms a competent single-threaded implementation. They show that many big data systems have surprisingly large COST, often hundreds of cores.

Let's repeat this again: some single threaded implementations were found to be more than an order of magnitude faster than published results (at SOSP/OSDI!) for systems using hundreds of cores.

The paper's goal is to shed light on this issue so that "future research is directed toward distributed systems whose scalability comes from advances in system design rather than poor baselines and low expectations." (That has gotta be one of the snarkiest lines in a computer science paper. Well, discounting those from Dijkstra, that is.)

What does better baselines mean? It means using better graph layout and better algorithms for performance. The paper gives as an example the label propagation algorithm. The paper argues that label propagation is used for graph connectivity not because it is a good algorithm, but because it fits within the "think like a vertex" computational model, whose implementations scale well. The paper claims the appealing scaling properties are largely due to the algorithm's sub-optimality, as label propagation does more work than better algorithms.

Yes, and on the other hand, I can also see the appeal in the Giraph "think like a vertex" approach (or for that matter the MapReduce and Spark approaches). Giraph optimized for simplicity and ease-of-development. If you make it simple and easy to use, people will be happy to use it, adapt it, and throw it cluster resources when needed. One may argue this is a good tradeoff. Instead of letting people think harder and make programming harder for them, make it easy but wasteful on computing resources. After all, humans are much more expensive than computers, and scalability in terms of human cost is also a factor for practical/industrial systems. A similar argument for BlockChains has been made here, arguing social scalability is more important than computational-efficiency or even computational-scalability.

Of course this can be a false dichotomy, there are (and will be) systems/frameworks that provide both scalability in terms of human cost (by being easy-to-develop-with) and also computationally efficient. And we should strive to design such systems.

The analysis and evaluation was given/studied in the context of graph algorithms: pagerank and connected components. For embarrassingly parallel algorithms, such as SGD, this analysis and the results would not apply.