Paper Summary. Proteus: agile ML elasticity through tiered reliability in dynamic resource markets

- September 25, 2017

This paper proposes an elastic ML system, Proteus, that can add/remove transient workers on the fly for exploiting the transient availability of cheap but revocable resources in order to reduce costs and latency of computation. The paper appeared in Eurosys'17 and is authored by Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, and Phillip B. Gibbons.

Proteus has two components: AgileML and BidBrain. AgileML extends the parameter-server ML architecture to run on a dynamic mix of stable and transient machines, taking advantage of opportunistic availability of cheap but preemptible AWS Spot instances. BidBrain is the resource allocation component that decides when to acquire and drop transient resources by monitoring current market prices and bidding on new resources when their addition would increase work-per-dollar.

Before delving into AgileML and BidBrain, let's first review the AWS Spot model.

See Spot run

AWS provides always available compute instances, called "on-demand" instances: you get them when you like and keep them as much as you like provided that you pay their fixed hourly rate.

AWS also offers transient compute-instances via the AWS Spot market. You specify a bid price, and if the current market price is under your bid, you get the instance. You only pay the market price, and not your bid-price. In other words, your bid-price is an upperbound on how much you are comfortable for paying hourly. And if the AWS spot market price for the instance goes above your upperbound rate, AWS pulls the instance from you with only a 2-minute advance warning. Even in this case, the silverlining is that the last incomplete hour of computing is not charged for you, so you get some free computing.

As seen in Figure 3, you can save a lot of money if your computing job can exploit AWS Spot instances. (It is peculiar how the peak prices are sometimes up to 10 times higher than the fixed on-demand instances. This is speculated to prevent a high bid that secures long running instances at AWS Spot at a rate lower than EC2.)

Jobs that are especially suitable for AWS Spot's transient/preemptible computing style are embarrassingly parallel data processing tasks, where pieces are not related and where there is no need to maintain long-lived state. For example, for "shallow computing", such as thumbnail generation, there is no harm done with an instance eviction, as there is no need for continuity across the computation. The question the paper investigates is how to make the AWS Spot model work for "deeper computing", such as ML jobs.

While the paper considers this question for the AWS Spot market, the motivation also applies to enterprise computing in the datacenter. As disclosed in the Google Borg paper, Google distinguishes and prioritizes its production services over analytic/batch services. If the production services need more resources, they will be given resources to the extent of preempting them from analytic/batch jobs if need be. On the other hand, when there is an excess of resources, analytic/batch jobs can enjoy them opportunistically.

Stages of AgileML

AgileML has 3 modes/stages as shown in Figure 4. To provide a shorter and more cost-effective computation, AgileML dynamically changes modes based on the availability of cheap transient instances. As the transient to on-demand ratio increases from 1:1 to beyond 15:1, AgileML shifts up from mode 1 up to mode 3. As the ratio decreases, AgileML shifts down from mode 3 down to mode 1.

Stage 1: Parameter Servers Only on Reliable Machines. Stage 1 spreads the parameter-server across reliable machines only, using transient nodes only for stateless workers. This works for most ML applications including K-means, DNN, Logistic Regression, Sparse Coding, as the workers are stateless while the parameter-servers contain the current solution state.
Stage 2: ActivePSs on Transient Machines and BackupPSs on Reliable Machines. For transient to reliable node ratio greater than 1:1, AgileML switches to stage 2. Stage 2 uses a primary-backup model for parameter servers, using transient nodes for an active server (ActivePS) and reliable nodes for the hot standby (BackupPS). This relieves the heavy network load at the few reliable resources by spreading it across the many transient resources. The model parameters are sharded across the set of ActivePS instances. Workers send all updates and reads to the ActivePSs, which push updates in bulk to the BackupPSs. The solution state affected by transient node failures or evictions is recovered from BackupPSs. (For backing up ActivePS to BackupPS, it may be possible to explore a threshold-based update mechanism as outlined in the Gaia paper.)
Stage 3: No Workers on Reliable Machines. Workers colocated with BackupPSs on reliable machines were found to cause straggler effects at transient-to-reliable ratios beyond 15:1. Stage 3 removes these workers, and acts like a sub-case of Stage 2.

Handling elasticity

The elasticity controller component is responsible for changing modes based on the transient-to-reliable ratio and the network bandwidth. It tracks which workers are participating in the computation, assigns a subset of input data to each worker, and starts new ActivePSs.

For stage 2 and stage 3, half of the transient instances are recruited as ActivePSs, as that performed best in the evaluations. This one-half ratio is likely to be specific to using transient instances, as with reliable instances the more PSs the merrier it is.

During start-up, AgileML divides the parameter state into N partitions, where N is the maximum number of ActivePSs that can exist at any one point. By using partitions in this way, AgileML avoids the need to re-shard the parameter state when adding or removing servers, instead re-assigning partitions as needed.

As the ActivePS instances increase and decrease, the elasticity controller re-assigns the parameter-server shards across the ActivePS instances appropriately. If all the ActivePSs are evicted, AgileML transfers to Stage 1. It seems like using a level of indirection was sufficient to get this working.

BidBrain

BidBrain keeps track of historical market prices for transient instances and makes allocation decisions to minimize cost-per-work. An allocation is defined as a set of instances of the same type acquired at the same time and price. Before the end of an allocation's billing hour, BidBrain compares the cost-per-work ratios to decide whether the allocation is renewed or terminated.

Evaluation

The experiments were performed with 3 ML applications.

Matrix Factorization (MF) is a technique (a.k.a. collaborative filtering) commonly used in recommendation systems, such as recommending movies to users on Netflix. The goal is to discover latent interactions between the two entities (e.g., users and movies). Given a partially filled matrix X (e.g., a matrix where entry (i, j) is user i’s rating of movie j), MF factorizes X into factor matrices L and R such that their product approximates X.
Multinomial Logistic Regression (MLR) is a popular model for multi-way classification, often used in the last layer of deep learning models for image classification or text classification. The MLR experiments use the ImageNet dataset with LLC features, containing 64k observations with a feature dimension of 21,504 and 1000 classes.
Latent Dirichlet Allocation (LDA) is an unsupervised method for discovering hidden semantic structures (topics) in an unstructured collection of documents, each consisting of a bag (multi-set) of words. The evaluated LDA solver implements collapsed Gibbs sampling.

The baseline runs all instances on Spot market machines and uses checkpointing to recover progress if evicted. The experiments show about 17% overhead for MF due to checkpointing. Figure 1 illustrates the cost and time benefits of Proteus over the MLR application. Compared to all on-demand, the baseline improves on cost significantly as expected but increases the runtime by 25%. Proteus improves on cost and also manages to achieve reduced runtime.

On average 32% of Proteus's computing is free computing. But aggressively chasing free computing by bidding very close to market price results in high overhead: 4x increase in runtime and higher costs due to frequent evictions.