Wednesday, April 13, 2016

Paper review. GraphLab: A new Framework for Parallel Machine Learning

This GraphLab paper is from 2010. It is written elegantly and it does a good job of explaining the GraphLab abstraction and foundations. A later VLDB 2012 paper presents how to extend this basic GraphLab abstraction to a distributed GraphLab implementation.

It seems like GraphLab has since took off big time. There are workshops and conferences about GraphLab. And GraphLab is now developed by the company Dato (formerly GraphLab inc.). Dato Inc. raised 6.75M$ from Madrona and New Enterprise Associates in A round, and 18.5M$ in B round from Vulcan Capital and Opus Capital, as well as Madrona and New Enterprise Associates.


The motivation for GraphLab was to hit the sweetspot in development of machine learning solutions. "(Then in 2010) Existing high-level parallel abstractions like MapReduce are insufficiently expressive, while low-level tools like MPI and Pthreads leave machine learning (ML) experts repeatedly solving the same design challenges."

GraphLab aimed to express asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving good parallel performance. This paper provides an efficient multicore parallel (but not distributed) shared-memory implementation of GraphLab. (The C++ reference implementation of the GraphLab is at

Related work

MapReduce works well on embrassingly data parallel tasks, but is not efficient for graph processing and iterative algorithms. This line from the paper gave me a chuckle. "By coercing efficient sequential ML algorithms to satisfy the restrictions imposed by MapReduce, we often produce inefficient parallel algorithms that require many processors to be competitive with comparable sequential methods." (Frank's laptop also took on against GraphLab.)

DAG abstraction represents parallel computation as a directed acyclic graph with data flowing along edges between vertices that correspond to computation/function. Dryad, Storm, Heron fit here. DAG does not naturally express iterative algorithms.

Dataflow abstraction in Naiad is a related work. This being a 2010 paper there is no comparison to Naiad in the paper. This is what Naiad (2013) has to say about GraphLab. "GraphLab and PowerGraph offer a different asynchronous programming model for graph computations, based on a shared memory abstraction. These asynchronous systems are not designed to execute dataflow graphs so the notion of completeness of an epoch or iteration is less important, but the lack of completeness notifications makes it hard to compose asynchronous computations. Although GraphLab and PowerGraph provide a global synchronization mechanism that can be used to write a program that performs one computation after another, they do not achieve task- or pipeline-parallelism between stages of a computation. Naiad allows programs to introduce coordination only where it is required, to support hybrid asynchronous and synchronous computation."

Petuum is also a closely related work. Petuum uses BSP, with SSP relaxation and nonuniform convergence. It looks like  GraphLab allows more fine-granular scheduling and supports data-graph computations and dependency modeling better. On the other hand, Petuum was designed clean-slate to support machine learning algorithms better, and introduced model-parallelism concept. GraphLab, as you will see in the summary below is principally a graph processing framework, with good applicability for graph-based machine learning tasks. Petuum paper says "Petuum ML implementations can run faster than other platforms (e.g. Spark, GraphLab4), because Petuum can exploit model dependencies, uneven convergence and error tolerance". The paper provides performance comparison experiments with GraphLab.

GraphLab abstraction

The minimalism yet generality of the GraphLab framework remind me of LISP. "The data graph G = (V, E) encodes both the problem specific sparse computational structure and directly modifiable program state. The user can associate arbitrary blocks of data (or parameters) with each vertex and directed edge in G." (Also the set scheduler that enables users to compose custom update schedules is clever and LISPy.)

Figure 3 illustrates the overall GraphLab framework. A GraphLab program is composed of the following parts:
1. A data graph which represents the data and computational dependencies.
2. Update functions which describe local computation
3. A Sync mechanism for aggregating global state
4. A data consistency model (i.e., Fully Consistent, Edge Consistent or Vertex Consistent), which determines the extent to which computation can overlap.
5. Scheduling primitives which express the order of computation and may depend dynamically on the data.

Data Model

The GraphLab data model consists of two parts: a directed data graph and a shared data table. The data graph G = (V, E) encodes both the problem specific sparse computational structure and directly modifiable program state. The user can associate arbitrary blocks of data (or parameters) with each vertex and directed edge in G. To support globally shared state, GraphLab provides a shared data table (SDT) which is an associative map, T: [Key] --> Value, between keys and arbitrary blocks of data. Here, Key can be a vertex, edge, result of a sync computation as explained below, or a global variable, say model parameter.

User defined computation

Computation in GraphLab can be performed either through an update function which defines the local computation, or through the sync mechanism which defines global aggregation. The Update Function is analogous to the Map in MapReduce, and sync mechanism is analogous to the Reduce operation. A GraphLab program may consist of multiple update functions and it is up to the scheduling model to determine which update functions are applied to which vertices and in which parallel order.


The GraphLab update schedule describes the order in which update functions are applied to vertices and is represented by a parallel data-structure called the scheduler. The scheduler abstractly represents a dynamic list of tasks (vertex-function pairs) which are to be executed by the GraphLab engine. This is minimalist yet flexible and general model. Since writing a scheduler is hard, GraphLab provides a default BSP style scheduler or a round-robin scheduler.

Since many ML algorithms (e.g., Lasso, CoEM, Residual BP) require more control over the tasks that are created and the order in which they are executed, GraphLab also allows update functions to add and reorder tasks. For this, GraphLab provides 1) FIFO schedulers that only permit task creation but do not permit task reordering, and 2) prioritized schedules that permit task reordering at the cost of increased overhead.


The paper  demonstrates the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing.


GraphLab Wikipedia page

Paper Review. Petuum: A new platform for distributed machine learning on big data


Friday, April 8, 2016

Book review: The War Of Art by Steven Pressfield

I read this book recently and liked it a lot. The book is written by Steven Pressfield. He is also the writer of "The Legend of Bagger Vance" and "Gates of Fire" (arguably the best book about Spartans, and is being used as recommended reading in Army academies). Pressfield definitely knows and respects his craft.

This book is a call for all people, and creative people and writers in particular, to wake up and realize their calling. The book says: "Most of us have two lives. The life we live, and the unlived life within us." Yeah... About that... I know "self-help" books, and books that use "new-agey" language rub many people the wrong way. I am a pragmatic about that. The way I see it, if I can learn some good paradigms, tips, strategy to become more productive, effective, I can look past some of the turn-offs.

The book is organized in 3 parts. The first part talks about resistance, the enemy of creating anything meaningful and worthwhile. This part gives an extended and spot on description and analysis of resistance. You have to know the enemy to have a chance to beat it.

The second part talks about turning professional to beat resistance. I liked this part best. This part has very positivist/pragmatist advice in contrast to the romantic/mysticist theme dominating Part 3, which is also somewhat evident in Part 1.

The third part talks about beyond resistance, and living the life in a higher realm. This part is mostly mysticist --done tastefully, I think. (What do you expect? Pressfield is the author of The Legend of Bagger Vance after all.) There are still several useful spot on observations. One is about territorial versus hierarchical orientation.

Below I am going to paste quotes from the book to give you a taste of it. I liked this book, and recommend it for anyone who wants to understand resistance and improve her craft.

Part 1: Resistance, Defining The Enemy

Any act that rejects immediate gratification in favor of long-term growth, health or integrity … will elicit Resistance.

Resistance has no strength of its own. Every ounce of juice it possesses comes from us. We feed it with power by our fear of it. Master that fear, and we conquer Resistance.

Procrastination is the most common manifestation of Resistance because it's the easiest to rationalize. We don't tell ourselves, "I'm never going to write my symphony." Instead we say, "I am going to write my symphony; I'm just going to start tomorrow."

There's a secret that real writers know that wannabe writers don't, and the secret is this: It's not the writing part that's hard. What's hard is sitting down to write. What keeps us from sitting down is Resistance.

Resistance is invisible, internal, insidious, fueled by fear, only opposes in one direction.

Like a magnetized needle floating on a surface of oil, Resistance will unfailingly point to true North–meaning that calling or action it most wants to stop us from doing. We can use this. We can use it as a compass. We can navigate by Resistance, letting it guide us to that calling or action that we must follow before all others. Rule of thumb: The more important a call or action is to our soul’s evolution, the more Resistance we will feel toward pursuing it.

Part 2: Turning Pro, Combating Resistance

The conventional interpretation is that the amateur pursues his calling out of love, while the pro does it for money. Not the way I see it. In my view, the amateur does not love the game enough. If he did, he would not pursue it as a sideline, distinct from his "real" vocation. The professional loves it so much he dedicates his life to it. He commits full-time. That's what I mean when I say turning pro. Resistance hates it when we turn pro.

Grandiose fantasies are a symptom of Resistance. They're the sign of an amateur. The professional has learned that success, like happiness, comes as a by-product of work. The professional concentrates on the work and allows rewards to come or not come, whatever they like.

Professional is patient, prepared, acts in face of fear, dedicated to mastering technique.

The professional dedicates himself to mastering technique not because he believes technique is a substitute for inspiration but because he wants to be in possession of the full arsenal of skills when inspiration does come. The professional is sly. He knows that by toiling beside the front door of technique, he leaves room for genius to enter by the back.

Part 3: Beyond resistance, the higher realm

We come into this world with a specific, personal destiny. We have a job to do, a calling to enact, a self to become. We are who we are from the cradle, and we're stuck with it. Our job in this lifetime is not to shape ourselves into some ideal we imagine we ought to be, but to find out who we already are and become it.

The most important thing about art is to work. Nothing else matters except sitting down every day and trying.

When we sit down day after day and keep grinding, something mysterious starts to happen. A process is set into motion by which, inevitably and infallibly, heaven comes to our aid. Unseen forces enlist in our cause; serendipity reinforces our purpose.

The territory provides sustenance. -- Runners know what a territory is. So do rock climbers and kayakers and yogis. Artists and entrepreneurs know what a territory is. The swimmer who towels off after finishing her laps feels a hell of a lot better than the tired, cranky person who dove into the pool 30 minutes earlier.

Territory can only be claimed by work. -- When Arnold Schwarzenegger hits the gym, he's on his own turf. But what made it his own are the hours and years of sweat he put in to claim it. The territory doesn't give, it gives back.

The artist must operate territorially. He must do his work for its own sake. To labor in the arts for any reason other than love is prostitution.

Of any activity you do, ask yourself: If I were the last person on earth, would I still do it? If you're all alone on the planet, a hierarchical orientation makes no sense. There's no one to impress. So, if you’d still pursue that activity, congratulations. You're doing it territorially.

The book ends with the following. I think this is a useful perspective/paradigm to cultivate:

If you were meant to cure cancer or write a symphony or crack cold fusion and you don't do it, you not only hurt yourself, even destroy yourself. You hurt your children. You hurt me. You hurt the planet.

Creative work is not a selfish act or a bid for attention on the part of the actor. It's a gift to the world and every being in it. Don't cheat us of your contribution. Give us what you’ve got.

Wednesday, April 6, 2016

Consensus in the Cloud: Paxos Systems Demystified

This our most recent paper, still under submission. It is available as a technical report here. We felt we had to write this paper because we have seen misuses/abuses of Paxos-based coordination services. Glad this is off our chests.

Here is the short pitch for the paper. I hope you like it and find it helpful.

Coordination and consensus play an important role in datacenter and cloud computing. Examples are leader election, group membership, cluster management, service discovery, resource/access management, and consistent replication of the master nodes in services.

Paxos protocols and systems provide a fault-tolerant solution to the distributed consensus problem and have attracted significant attention but they also  generated substantial confusion. Zab, Multi-Paxos, Raft are examples of Paxos protocols.  ZooKeeper, Chubby, etcd are examples of Paxos systems. Paxos systems and Paxos protocols reside in different planes, but even that doesn't prevent these two concepts to be confused. Paxos protocols are useful for low-level components for server replication, whereas Paxos systems have been often shoehorned to that task. The proper use case for Paxos systems is in highly-available/durable metadata management, under the conditions that all metadata fit in main-memory and are not subject to frequent changes.

In order to elucidate the correct use of distributed coordination systems, we compare and contrast popular Paxos protocols and Paxos systems and present advantages and disadvantages for each. Zab and Raft protocols differ from Paxos as they divide execution into epochs: Each epoch begins with a new election, goes into the broadcast phase and ends with a leader failure. Similarly, Paxos systems also have nuances. Chubby uses the MultiPaxos algorithm to achieve linearizability, while Zab lies at the heart of ZooKeeper and provides not only linearizability, but also FIFO order for client requests, enabling the developers to build complex coordination primitives with ease. Etcd system uses Raft as the consensus protocol, and adopts a stateless design and implements certain features very differently than ZooKeeper and Chubby.

We also categorize the coordination use-patterns in cloud into nine broad categories: server replication (SR), log replication (LR), synchronization service (SS), barrier orchestration (BO), service discovery (SD), group membership (GM), leader election (LE), metadata management (MM) and distributed queues (Q). Using these categories, we examine Google and Facebook infrastructures, as well as Apache top-level projects to investigate how they use Paxos protocols and systems.

Finally, we analyze tradeoffs in the distributed coordination domain and identify promising future directions for achieving more scalable distributed coordination systems.

See the paper for more information.

Related links

Paper summary: ZooKeeper: Wait-free coordination for Internet-scale systems

Monday, April 4, 2016

Paper Review. Petuum: A new platform for distributed machine learning on big data

First there was big data. Industry saw that big data was good. Industry made big data storage systems, NoSQL datastores, to store and access the big data. Industry saw they were good. Industry made big data processing systems, Map Reduce, Spark, etc., to analyze and extract information and insights (CRM, business logistics, etc.) from big data. Industry saw they were good and popular, so machine learning libraries are added to these big data processing systems to provide support for machine learning algorithms and techniques.

And here is where this paper makes a case for a redesign for machine learning systems. The big data processing systems produced by the industry are general analytic systems, and are not specifically designed for machine learning from the start. Those are data analytics frameworks first, with some machine learning libraries as add on to tackle machine learning tasks. This paper considers the problem of a clean slate system design for a big data machine learning system: What if we designed and built a big data framework specifically for  machine learning systems, what would it look like?

This paper is from CMU and appeared in KDD 2015.

Machine Learning (ML) objectives and features

Naturally the paper starts by first identifying the objective and features of ML systems. "ML defines an explicit objective function over data where the goal is to attain optimality of this function in the space defined by the model parameters and other intermediate variables."

Thus, the paper argues, ML algorithms have an iterative-convergent structure and  share these principles:
  • nonuniform convergence: model parameters converge at different speeds
  • error tolerance: iterative-convergent algorithms are resilient against errors and converge/heal towards the solution 
  • dynamic structure dependency: iterative-convergent algorithms often have changing correlation strengths between model parameters during the course of execution

The paper proposes Petuum, a new distributed ML framework, to leverage these principles.

Stale Synchronous Parallel (SSP) consistency

Petuum leverages error-tolerance by introducing SSP (Stale Synchronous Parallel) consistency. SSP reduces network synchronization costs among workers, while maintaining bounded staleness convergence guarantees.

While the BSP (Bulk Synchronous Parallel) approach, used in Map Reduce, Giraph, etc., require the workers to synchronize state and exchange messages after each round, SSP cuts some slack. The SSP consistency model guarantees that if a worker reads from parameter server at iteration c, it is guaranteed to receive all updates from all workers computed at least at iteration c-s-1, where s is the staleness threshold. If there is a straggler more than s iterations behind, the reader will stop until the straggler catches up and sends its updates. How do you determine slack, s? That requires ML expertise and experimenting.

Big data, meet Big model!

Another significant idea in the Petuum paper is the dichotomy between big data and big model. Yes, big data is big, on the order of terabytes or petabytes. And the paper observes, we now also have a big model problem: ML programs for industrial-scale big data problems use big models that are 100s of billions of parameters (Figure 1 gives a nice summary). And this big model needs special attention as well.

Here I will use snippets from the paper to give the proper definition of these concepts.

Essentially, model-parallelism provides ability to invoke dynamic schedules that reduce model parameter dependencies across workers, leading to faster convergence. So Petuum uses model-parallelism to leverage nonuniform convergence and dynamic structure dependency to improve the efficiency/performance of ML tasks.

Petuum design

Petuum consists of three main components: Scheduler, parameter server, and workers.

The scheduler is responsible for enabling model parallelism. Scheduler sends subset of parameters to workers via parameter exchange channel. The parameter server stores and updates model paramerters., which can be accessed via a distributed shared memory API by both workers and scheduler. Each worker is responsible for performing operations defined by a user on a partitioned data set and a parameter subset specified by scheduler.

Here is the Petuum API:
schedule: specify the subset of model parameters to be updated in parallel.
push: specify how individual workers compute partial results on those parameters.
pull [optional]: specify how those partial results are aggregtaed to perform the full parameter update.

To illustrate the use of Petuum API, the paper presents the code for a data-parallel Distance Metric Learning (DML) algorithm and for a model parallel Lasso algorithm.

It looks like writing an efficient/optimized schedule would need significant expertise. This will often require running the algorithm on test dataset to see relations between model parameters, convergence speeds, etc.


The paper provides evaluation results on the performance of Petuum.

Petuum versus TensorFlow

How does Petuum compare with Google's TensorFlow? TensorFlow framework can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models. And yet, TensorFlow has a completely different design approach than Petuum. Tensorflow uses dataflow graphs. Implementing a ML  algorithm/recipe on TensorFlow has a more distributed nature. An algorithm/recipe consists of many operations, and TensorFlow maps one or multiple operations in this algorithm to a node/worker. In Petuum the entire algorithm/recipe is mapped on a node/worker and efficiency is achieved via data-parallel and model-parallel partitioning.

There would be advantages/disadvantages for each approach, and it will be interesting to watch how this plays out in the coming years.

Related links

Short review of Google TensorFlow