Paper summary. SLAQ Quality-driven scheduling for distributed machine learning

- November 06, 2017

This paper (by Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J. Freedman) appeared at SOCC'17.

When you assign a distributed machine learning (ML) application resources at the application level, those resources are allotted for many hours. However, loss improvements usually occur during the first part of the application execution, so it is very likely that the application is underutilizing the resources for the rest of the time. (Some ML jobs are retraining of an already trained DNN, or compacting of a DNN by removing unused parameters, etc., so blindly giving more resources at the beginning and pulling some back later may not work well.)

To avoid this, SLAQ allocates resources to ML applications at the task level, leveraging the iterative nature of ML training algorithms. Each iteration of the ML training algorithm submits tasks to the scheduler with running times around 10ms-100ms. This is how Spark based systems operate readily anyways. (The Dorm paper criticized this iteration-based task scheduling approach saying that it causes high overhead for scheduling and introduces delays for waiting to get scheduled, but there was no analysis on those claims.)

SLAQ collects "quality" (measured by "loss" really) and resource usage information from jobs, and using these it generates quality-improvement predictions for future iterations and decides on future iteration task scheduling based on these predictions.

The paper equates "quality" with "loss", and justifies this by saying:
1) "quality" cannot be defined unless at the application level; so to keep it general let's use "loss"
2) for exploratory training jobs, reaching 90% accuracy is sufficient for quality, and SLAQ enables to get there in a shorter time frame.

On the other hand, there are drawbacks to that. While delta improvements on loss may correspond to improvements on the quality, the long-tail of the computation may still be critical for "quality", even when loss is decreasing very slowly. This is especially true for non-convex applications.

The paper normalizes quality/loss metrics as follows: For a certain job, SLAQ normalizes the change of loss values in the current iteration with respect to the largest change it has seen for that job so far.

SLAQ predicts an iteration's runtime simply by how long it would take the N tasks/CPUs can process through S the size of data processed in an iteration. (minibatch size.)

For scheduling based quality improvements, the paper considers couple metrics, like maximizing the total quality and maximizing the minimum quantity. The paper includes a good evaluation section.

In conclusion, SLAQ improves the overall quality of executing ML jobs faster, particularly under resource contention, by scheduling at a finer granularity task-level based on the observed loss improvements.