SOSP19 Day 1, machine learning session

In this post, I cover the papers that appeared in the first session of the conference: machine learning. (Here is my account of Day 0 and the opening remarks if you are interested.)

I haven't read any of these papers yet and I go by my understanding of these papers from their presentations. I might have misunderstood some parts. The good news is all of these papers are available as open access, and I include links to the papers in my notes. Please check the papers when in doubt about my notes.

The machine learning session contained four papers. I found all of them very interesting. They applied principled systems design techniques to machine learning and provided results that have broader applicability than a single application. I wanted to enter the machine learning research area three years ago. But I was unsuccessful and concluded that the area is not very amenable for doing principled systems work. It looks like I had admitted defeat prematurely. After seeing these papers in this section, I am excited again for the fusion and synergy of machine learning and principled distributed systems areas.

PipeDream: Generalized Pipeline Parallelism for DNN Training

This paper is by Deepak Narayanan (Stanford University), Aaron Harlap (Carnegie Mellon University), Amar Phanishayee (Microsoft Research), Vivek Seshadri (Microsoft Research), Nikhil R. Devanur (Microsoft Research), Gregory R. Ganger (CMU), Phillip B. Gibbons (Carnegie Mellon University), Matei Zaharia (Stanford University).

There are two prevalent approaches to parallelizing DNN training: data parallelism and model parallelism. This paper proposes pipeline-parallel training that combines data and model parallelism with pipelining. In this approach, the model is first divided into sequential stages, and then these stages are pipelined over the workers. To optimize the pipelining over the workers, the bottleneck stages are identified, and those stage are suitably data parallelized across multiple workers to prevent the pipeline from stalling, waiting for results from previous stages.

The authors identify three big challenges to realize this idea and develop techniques for addressing these challenges.
  • Partitioning and load balancing operators across workers: They developed a profiler and optimizer to load balance computing and reduce communication. 
  • Scheduling of forward and backward passes of different inputs: They use a bi-directional pipeline, where an input minibatch proceeds through the computation pipeline first forward and then backward. Each active minibatch in the pipeline may be in a different stage, either in the forward pass or backward pass. Once in steady state, each stage alternates between performing
  • its forward pass for a minibatch and its backward pass for an earlier minibatch. This one-forward-one-backward (1F1B) scheduling ensures that every GPU is occupied with a minibatch in a balanced pipeline, with each stage producing outputs in aggregate at roughly the same rate.  
  • Managing weights and activation versions for effective learning: The presenter mentioned that naive pipelining leads to weight version mismatches. To prevent this, they store multiple <weight, activation> versions, which they call stashed weights.

They integrated pipedream with PyTorch using 3000 lines of Python code. They hooked to PyTorch's communication library. The project is opensource and accessible at

Their evaluation results show that pipedream can provide 5 times faster training than data parallel training. The reason for speedup is pipedream reduces communication among workers, and can achieve up to a magnitude smaller communication than incurred by data parallelism.

There were some good questions from the audience following the talk.
  • "Pipedream is evaluated on models that are sequential. What about models that are branched, where multiple things need to finish before the next phase?" The presenter answered that the techniques/setup in pipedream can generalize to handle them but also added that most models are sequential.
  • "What about pipedream's memory footprint?" The presenter said that they are looking for to reduce this.
  • "As sparsity changes, it may be possible to do asynchronous training faster than synchronous training. Would it be possible to beat these results using asynchronous training rather than the synchronous training pipedream performs?" I don't have notes for the answer, but I think the answer is that for DNN training synchronous training is the method that works best. 

A Generic Communication Scheduler for Distributed DNN Training Acceleration

This paper is by  Yanghua Peng (The University of Hong Kong), Yibo Zhu (ByteDance Inc.), Yangrui Chen (The University of Hong Kong), Yixin Bao (The University of Hong Kong), Bairen Yi (ByteDance Inc.), Chang Lan (ByteDance Inc.), Chuan Wu (The University of Hong Kong), Chuanxiong Guo (ByteDance Inc.).

The paper presents a scheduler for deep learning training, so this has some relevance to the pipedream paper as well. This paper focuses only on data parallelism for scaling out training and shows how to optimize it.

The FIFO scheduling strategy does not overlap communication with computation well. Although there has been work to improve scheduling, such as p3 and tictac, these were limited because they are coupled with specific framework implementations, MxNet and Tensorflow, respectively. In contrast this work presents a generic tensor scheduling framework, Bytescheduler, which was implemented and evaluated for MxNet, PyTorch, and Tensorflow. The project is available as opensource at

The basic insights in this work are:
  • Communication of former layers of a neural network has higher priority and can preempt communication of latter layers
  • It is beneficial to partition large layers and merge small layers
This work uses bayesian optimization for auto tuning for partitioning and scheduling control. The presenter mentioned a technique called "dependency proxy" for geting the scheduling control right. The presenter also mentioned that Bytescheduler adapts to different bandwidths, and  different environments and tasks.

Parity Models: Erasure-Coded Resilience for Prediction Serving Systems

This work is by Jack Kosaian (Carnegie Mellon University), K. V. Rashmi (Carnegie Mellon University), Shivaram Venkataraman (University of Wisconsin-Madison).

Tail latency in inference serving is a problem, which results in to missed deadlines. This work shows how to use erasure codes for reducing tail latency in ML inference. If one replica is slow, this work prescribes using  the parity model and decoding it quickly to make the deadline.

In this setup the inference servers have multiple identical models, and a query needs to talk to k of them. The system jointly codes queries from different users and sends to different inference servers and a parity server, which holds the parity model. In case an answer is missing, the system decodes the results of inference servers and the parity server to reconstruct/approximate the missing answer and still make the deadline. The reconstructed output only comes into play when the original predictions are slow or fail. To meet the deadline, fast encoding and decoding is needed and the key for this is the design of the parity model.

The challenge here is that, while  handcrafting erasure codes is straightforward for linear operations, it is hard to achieve for neural networks which are nonlinear and complex. To solve this problem, the authors apply a learning based approach to achieve erasure coded resilience for NNs. This approach reconstruct approximations which is appropriate for machine learning inference.

The code used for training and evaluating parity models is available at The paper showcases parity models in the presence of resource contention, and includes extensive evaluation.

One of the questions for the presenter was "For tasks that are mission critical such as self-driving cars, would this modal accuracy difference from the parity model be confusing?" The answer is yes, it could be. So this is may be more suitable for ad serving like applications, rather than mission critical applications.  I personally think the technique here is more impressive than the problem solved. The parity models idea bring benefits of erasure codes to ML inference, and this technique should be applicable and customizable/specializable to a rich set of problems in the domain. I was wondering if this could be applicable to decomposing two images from a given stereoscopic image.

Jack, a third year PhD student at CMU, presented this paper. He gave one of the best presentations of the conference, with confidence, great stage presence, and great communication skills. I later learned that this was his first presentation at any conference ever.

TASO: Optimizing Deep Learning Computation with Automated Generation of Graph Substitutions

This paper is by Zhihao Jia (Stanford University), Oded Padon (Stanford University), James Thomas (Stanford University), Todd Warszawski (Stanford University), Matei Zaharia (Stanford University), Alex Aiken (Stanford Univeristy).

Existing rule-based DNN optimizations rewrite/substitute a graph with a more efficient version. In Tensorflow this is implemented by 2000 rewrite rules in 53K lines of code. Unfortunately, with this approach, addition of new operators and graph structures require escalation in the number of rules. Moreover, these heuristics do not apply for all DNN hardware. The rewrite rules miss subtle optimizations for specific DNNs and hardware. Different hardware need different optimizations, one optimization does not work for all. (To motivate the hardware dependence of the optimizations, the presentation gave an example with TASO where the final graph is 30% faster on V100 but 10% slower on K80).

To address these pain points in optimizing deep learning, this work proposes TASO, a tensor algebra super-optimizer. TASO replaces manually designed graph optimizations with automated generation and verification. TASO can then be used to feasibly produce optimizations that are based on hardware backend, rather than generic general rules. While TensorFlow currently contains approximately 53,000 lines of manual optimization rules, the operator specifications needed by TASO are only 1,400 lines of code.

The challenge in developing TASO is two folds: how to generate potential substitutions and how to verify their correctness.

There are 66 Million graphs with up to 4 operators. To make this number manageable, the graph substitution generator computes output fingerprints and considers pairs of graphs with identical output fingerprints. It finds 29K substitutions, out of which only 743 substitutions remain after applying pruning to eliminate redundant substitutions. These 743 substitutions are generated in 5 minutes, and verified against 43 operators in 10 minutes. This is done per hardware per operator, so supporting a new operator requires a few hours of the engineer providing the operator specifications.

In sum, TASO provides less engineering effort, better performance, and formal verification for optimizing deep learning computation graphs. The code for TASO is available on github at


Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

My Distributed Systems Seminar's reading list for Fall 2020

Silent data corruptions at scale

I have seen things

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Facebook's software architecture