Posts

Showing posts with the label analytics

Auto-WLM: machine learning enhanced workload management in Amazon Redshift

Image
This paper appeared in Sigmod'23. What? Auto-WLM is a machine learning based *automatic workload manager* currently used in production in Amazon Redshift. I thought this would be a machine learning paper, you know deep learning and stuff. But this paper turned out to be a practical/applied data systems paper. At its core, this paper is about improving query performance and resource utilization in data warehouses, possibly the first for a database system in production at scale.  They are not using deep learning, and rightly so! The main take-away from the paper is that locally-trained simple models (using XGBoost , a decision tree-based model built from the query plan trees) outperformed globally trained models, likely due to their ability to "instance optimize" to specific databases and workloads. They are using simple ML. And it works. Why? This is an important problem. If tuning is done prematurely, resources are unnecessarily wasted, and if it is done too late, overall...

TiDB: A Raft-based HTAP Database

Image
This paper is from VLDB 2020. TiDB is an opensource Hybrid Transactional and Analytical Processing (HTAP) database, developed by PingCap. The TiDB server, written in Go, is the query/transaction processing component; it is stateless, in the sense that  it does not store data and it is for computing only. The underlying key-value store, TiKV, is written in Rust, and it uses RocksDB as the storage engine. They add a columnar store called TiFlash, which gets most of the coverage in this paper. In this figure PD stands for Placement Driver (PD), which is responsible for managing Raft ranges, and automatically moving ranges to balance workloads. PD also hosts the timestamp oracle (TSO), which provides strictly increasing and globally unique timestamps to serve as transaction IDs. Each timestamp includes the physical time and logical time. The physical time refers to the current time with millisecond accuracy, and the logical time takes 18 bits. If you know about CockroachDB/CRDB ( he...

Amazon Redshift Re-invented

Image
This paper (SIGMOD'22) discusses the evolution of Amazon Redshift since 2015 when it launched. Redshift is a cloud data warehouse. Data warehouse basically means a place where analysis/querying/reporting is done for shitload of data coming from multiple sources. Tens of thousands of customers use Redshift to process Exabytes of data daily. Redshift is fully managed to make it simple and cost-effective to efficiently analyze BIG data. The concept art in this blog post are creations of Stable Diffusion. Since its launch in 2015, the use cases for Redshift have evolved, and the teams focused on meeting the following customer needs High-performance execution of complex analytical queries using innovative query execution via C++ code generation Enabling fast scalability in response to changing workloads by disaggregating storage and compute layers Ease of use via incorporating machine learning based autonomics Seamless integration with the AWS ecosystem and other AWS purpose built serv...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems