Auto-WLM: machine learning enhanced workload management in Amazon Redshift
This paper appeared in Sigmod'23.
What?
Auto-WLM is a machine learning based *automatic workload manager* currently used in production in Amazon Redshift.
I thought this would be a machine learning paper, you know deep learning and stuff. But this paper turned out to be a practical/applied data systems paper. At its core, this paper is about improving query performance and resource utilization in data warehouses, possibly the first for a database system in production at scale.
They are not using deep learning, and rightly so! The main take-away from the paper is that locally-trained simple models (using XGBoost, a decision tree-based model built from the query plan trees) outperformed globally trained models, likely due to their ability to "instance optimize" to specific databases and workloads. They are using simple ML. And it works.
Why?
This is an important problem. If tuning is done prematurely, resources are unnecessarily wasted, and if it is done too late, overall query performance is negatively impacted. Both scenarios would result in customer loss!
But this is also a hard problem. Cloud data-warehouses need to run mixed workloads (read/write, long/short, local and/federated queries) which have very different resource requirements. Moreover, workloads are dynamic. In the morning when analysts start their work, there is a peak of short-running dashboard queries. ETL queries tend to run during the night, but can vary widely in complexity and size depending on the accumulated data. Urgent ad-hoc business needs can put a lot of load on a system at unexpected times. As Figure 1 shows there are significant spikes in the number of concurrent queries. Determining the optimal time to scale horizontally (i.e., add new nodes), increase concurrency, or change scheduling policies is a hard problem. Moreover, as Auto-WLM is in the critical path of every query, it is of utmost importance that the time taken to make decisions does not add significant overhead, especially for short-running queries.
Finally, there is operational concerns! Reliably integrating ML into the critical path of any database management system is difficult. How do you ensure there are no surprises when faced with the long-tail of workloads? Whenever queries queue there is a risk of disproportional increase in the response time of short running queries compared to their execution time; a phenomenon often referred to as head of the line blocking (HLB). It seems like the query execution system can be a great breeding ground for metastable failures! How do you trust handing this over to a ML-based system?
Components
The system comprises several key components: the query performance model, a query prioritization strategy, an admission controller, a utilization monitor, and an ML trainer.
At the core of Auto-WLM is a query performance model that uses locally-trained XGBoost decision tree-based models to predict query execution time and memory needs. This model transforms physical query plan trees into feature vectors, collecting and aggregating information about query operators, estimated costs, and cardinalities. As queries are executed, their observed features and latency are added to a sliding window training set, ensuring the model stays current without growing unbounded. Local training instance/workload optimizes the cost estimation bootstrapping from Postgres query plan trees. Postgres cost model is still hard-coded and goes to 1990s, but I guess it gives a good relative cost estimation to bootstrap the process, and the local training fits these costs to the instance/workload at hand.
The other components support/complement the core query performance model. The query prioritization assigns priorities based on predicted resource needs and implements a weighted round-robin algorithm for query selection with different priority bins, balancing priorities with starvation prevention. The admission controller determines query placement (main cluster, short query queue, or concurrency scaling cluster). The utilization monitor tracks resource usage and informs admission decisions. Finally, the ML Trainer periodically updates the performance model using recent/sliding-window query data.
To provide operational stability, Redshift has put a lot of scaffolding, slack, and safeguard around Auto-WLM. They made many smart engineering decisions to simplify the problem. There is an always-on main cluster, which is augmented by an auxiliary concurrency-scaling cluster. If some queries cannot be executed using currently available resources, Auto-WLM horizontally scales database resources to execute them. Once a concurrency scaling cluster has been idle for a set period, it's detached to minimize unnecessary costs.
Auto-WLM supports 6 levels of user-specified priority and uses a weighted round-robin algorithm for query selection. This approach ensures that even low-priority queries have a non-zero probability of being selected, preventing starvation. The system also supports preemption of lower priority queries when high-priority queries arrive, with safeguards to prevent excessive resource waste from repeated preemptions. Auto-WLM also allows users to specify query monitoring rules and automatically aborts mispredicted queries which violate them, to protect against unnecessary abuse of resources.
Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries and helps to mitigate HLB issues. SQA reserves a small amount of resources on the user's main cluster dedicated to executing short queries. Auto-WLM dynamically adjusts the definition of a "short" query based on the 70th percentile execution time observed each week, ensuring the system adapts to each customer's unique workload characteristics.
A queuing theory model (based on predicted execution time of queries) determines the right number of queries to execute concurrently, to achieve the highest throughput. This algorithm is based on Little's Law, a fundamental theorem of queuing theory. By performing simple what-if analysis, Auto-WLM can determine if admitting another query will increase overall throughput. (See Section 4.2.2)
Evaluation
Figure 3 plots average query latency, along with the 50th, 90th, and 99th percentile of latency. For each scale factor, Auto-WLM is able to achieve optimal or near-optimal latency at each percentile. The largest difference between Auto-WLM and a manual configuration (MP5, MP15, MP20) is in P99 latency for the largest workload, where Auto-WLM performs significantly worse than MP20. This is because Auto-WLM chooses to trade some tail latency for significantly improved median latency. Figure 3 shows that Auto-WLM can automatically set an appropriate multiprocessing level for workloads at multiple scales, relieving the user from a complex and resource-intensive tuning process.
Overall experimental evaluations of Auto-WLM show that it can achieve optimal or near-optimal latency across various workload scales. The system's ability to differentiate between short and long-running queries and prioritize accordingly significantly reduces latency for quick queries. Although the query predictor's accuracy decreases for longer-running queries, it provides sufficient differentiation for practical applications. The scaffolding/slack/safeguarding in place helps weather mispredications.
Lessons
I really liked the section on the lessons learned and advice for future academic research.
The findings show that simpler, lower-overhead models often sufficed for practical applications, and challenges the notion that more complex models are always better. XGBoost builds decision trees as base learners and uses regularization to improve model generalization. Its computational efficiency, clear feature importance insights, and smooth handling of missing data made XGBoost a good choice for Auto-WLM. Using XGBoost, Redshift was able to instance optimize the decisions to specific databases and workloads. So the Redshift team suggests that future research should focus on techniques for quickly adapting and efficiently learning the most important data and performance characteristics for common query patterns. They also emphasize the importance of considering resource overhead and latency, not just model accuracy, when developing ML-enhanced database components.
Another crucial lesson was the need for explainability and debugging tools. When ML-driven decisions lead to unexpected outcomes, it's essential to have insights into why a decision was made and direct ways to override or fix it. This emphasizes the importance of holistic evaluation and development of robust systems for monitoring and managing ML-enhanced components in production environments.
Finally, the Redshift team says they observed synergies/efficiencies from sharing models across different database components (I guess this would have to be within the same cluster given that learning is workload specific), and call for more research on this area.
Maybe my one followup question to the team would be this. Was there an unintuitive thing that Auto-WLM did which when looked under the broader system lens made sense holistically? This would be something that shows up as suboptimal in the narrow evaluation experiments, but the decision turns out to be a holistically optimal one in an unintuitive manner?
One lesson I draw from this paper is that I think the query concurrency problem is simpler, because this is OLAP rather than OLTP. There is no lock contention, transaction abortion concerns. So the workload manager can train on independent query-plan-trees to learn CPU and memory usage, without concern as to whether the queries will contend/conflict with each other. Therefore the multiprogramming/query concurrency problem is solved simply by using a what-if analysis on Little's Law to see if throughput would increase by accepting this other query with predicted CPU and memory usage needs. For OLTP a different WLM model may be needed.
Comments