## Thursday, January 12, 2017

### Google DistBelief paper: Large Scale Distributed Deep Networks

This paper introduced the DistBelief deep neural network architecture. The paper is from NIPS 2012. If you consider the pace of progress in deep learning, that is old and it shows. DistBelief doesn't support distributed GPU training which most modern deep networks (including TensorFlow) employ. The scalability and performance of DistBelief has been long surpassed.

On the other hand, the paper is a must read if you are interested in distributed deep network platforms. This is the paper that applied the distributed parameter-server idea to Deep Learning. The parameter-server idea is still going strong as it is suitable to serve the convergent iteration nature of machine learning and deep learning tasks. The DistBelief architecture has been used by the Microsoft Adam project, Baidu Deep Image, Apache Hama, and Petuum's Bosen. Google, though, has since switched from the DistBelief parameter-server to TensorFlow's hybrid dataflow architecture, citing the difficulty of customizing/optimizing DistBelief for different machine learning tasks. And of course TensorFlow also brought support for distributed GPU execution for deep learning, which improves performance significantly.

I think another significance of this paper is that it established connections between deep-learning and distributed graph processing systems. After understanding the model-parallelism architecture in DistBelief, it is possible to transfer some distributed graph processing expertise (e.g., locality-optimized graph partitioning) to address performance optimization of deep NN platforms.

## The DistBelief architecture

DistBelief supports both data and model parallelism. I will use the Stochastic Gradient Descent (SGD) application as the example to explain both cases. Let's talk about the simple case, data parallelism first.

### Data parallelism in DistBelief

In the figure there are 3 model replicas. (You can have 10s even 100s of model replicas as the evaluation section of the paper shows in Figures 4 and 5.) Each model replica has a copy of the entire neural network (NN), i.e., the model. The model replicas execute in a data-parallel manner, meaning that each replica works at one shard of the training data, going through its shard in mini-batches to perform SGD. Before processing a mini-batch, each model replica asynchronously fetches from the parameter-server service an update copy of its model parameters $w$. And after processing a mini-batch and computing parameter gradients, $\Delta w$, each model replica asynchronously pushes these gradients to the parameter-server upon which the parameter-server applies these gradients to the current value of the model parameters.

It is OK for the model replicas work concurrently in an asynchronous fashion because the $\Delta$ gradients are commutative and additive with respect to each other. It is even acceptable for the model replicas to slack a bit in fetching an updated copy of the model parameters $w$. It is possible to reduce the communication overhead of SGD by limiting each model replica to request updated parameters only every nfetch steps and send updated gradient values only every npush steps (where nfetch might not be equal to npush). This slacking may even be advantageous in the beginning of the training when the gradients are steep, however, towards converging to an optima when the gradients become subtle, going like this may cause dithering. Fortunately, this is where Adagrad adaptive learning rate procedure helps. Rather than using a single fixed learning rate on the parameter server, Adagrad uses a separate adaptive learning rate $\eta$ for each parameter. In Figure 2 the parameter-server update rule is $w' := w - \eta \Delta w$.  An adaptive learning with large learning rate $\eta$ during convergence, and small learning rate $\eta$ closer to the convergence is most suitable.

Although the parameter-server is drawn as a single logical entity, it is itself implemented in a distributed fashion, akin to how distributed key value stores are implemented. In fact the parameter server may even be partitioned over the model replicas so each model replica becomes the primary server of one partition of the parameter-server.

### Model parallelism in DistBelief

OK now to explain model-parallelism, we need to zoom in each model replica. As shown in the Figure, a model-replica does not need to be a single machine. A five layer deep neural network with local connectivity is shown here, partitioned across four machines called model-workers (blue rectangles). Only those nodes with edges that cross partition boundaries (thick lines) will need to have their state transmitted between machines. Even in cases where a node has multiple edges crossing a partition boundary, its state is only sent to the machine on the other side of that boundary once. Within each partition, computation for individual nodes will be parallelized across all available CPU cores.

When the model replica is sharded over multiple machines as in the figure, this is called *model-parallelism*. Typically the model replica, i.e. the NN, is sharded upto 8 model-worker machines. Scalability suffers when we try to partition the model replica among more than 8 model-workers. While we were able to tolerate slack between the model-replicas and the parameter-server, inside the model-replica the model-workers need to act consistently with respect to each other as they perform forward activation propagation and backward gradient propagation.

For this reason, proper partitioning of the model-replica to the model-worker workers is critical for performance. How is the model, i.e., the NN, partitioned over the model-workers? This is where the connection to distributed graph processing occurs. The performance benefits of distributing the model, i.e., the deep NN, across multiple model-worker machines depends on the connectivity structure and computational needs of the model. Obviously, models with local connectivity structures tend to be more amenable to extensive distribution than fully-connected structures, given their lower communication requirements.

The final question that remains is the interaction of the model-workers with the parameter-server. How do the model workers, which constitute a model-replica, update the parameter-server? Since the parameter-server itself is also distributedly implemented (often over the model replicas), each model-worker needs to communicate with just the subset of parameter server shards that hold the model parameters relevant to its partition. For fetching the model from the parameter-server, I presume the model-workers need to coordinate with each other and do this in a somewhat synchronized manner before starting a new mini-batch.

[Remark: Unfortunately the presentation of the paper was unclear. For example there wasn't a clean distinction made between the term "model-replica" and "model-worker". Because of these ambiguities and the complicated design ideas involved, I spent a good portion of a day being confused and irritated with the paper. I initially thought that each model-replica has all the model (correct!), but each model-replica responsible for updating only part of the model in parameter-server (incorrect!).]

## Experiments

The paper evaluated DistBelief for a speech recognition application and for ImageNet classification application.
The speech recognition task used a deep network with five layers: four hidden layer with sigmoidal activations and 2560 nodes each, and a softmax output layer with 8192 nodes. The network was fully-connected layer-to-layer, for a total of approximately 42 million model parameters. Lack of locality in the connectivity structure is the reason why the speech recognition application did not scale for more than 8 model-worker machines inside a model-replica. When partitioning the model on more than 8 model-workers, the network overhead starts to dominate in the fully-connected network structure and there is less work for each machine to perform with more partitions.

For visual object recognition, DistBelief was used for training a larger neural network with locally-connected receptive fields on the ImageNet data set of 16 million images, each of which we scaled to 100x100 pixels. The network had three stages, each composed of filtering, pooling and local contrast normalization, where each node in the filtering layer was connected to a 10x10 patch in the layer below. (I guess this is a similar set up to convolutional NN which become an established method of image recognition more recently. Convolutional NN has good locality especially in the earlier convolutional layers.) Due to locality in the model, i.e., deep NN, this task scales better to partitioning up to 128 model-workers inside a model replica, however, the speedup efficiency is pretty poor: 12x speedup using 81 model-workers.

Using data-parallelism by running multiple model-replicas concurrently, DistBelief was shown to be deployed over 1000s of machines in total.