Posts

Showing posts with the label aws

Auto-WLM: machine learning enhanced workload management in Amazon Redshift

Image
This paper appeared in Sigmod'23. What? Auto-WLM is a machine learning based *automatic workload manager* currently used in production in Amazon Redshift. I thought this would be a machine learning paper, you know deep learning and stuff. But this paper turned out to be a practical/applied data systems paper. At its core, this paper is about improving query performance and resource utilization in data warehouses, possibly the first for a database system in production at scale.  They are not using deep learning, and rightly so! The main take-away from the paper is that locally-trained simple models (using XGBoost , a decision tree-based model built from the query plan trees) outperformed globally trained models, likely due to their ability to "instance optimize" to specific databases and workloads. They are using simple ML. And it works. Why? This is an important problem. If tuning is done prematurely, resources are unnecessarily wasted, and if it is done too late, overall...

Looming Liability Machines (LLMs)

As part of our zoom reading group ( wow, 4.5 years old now ), we discussed a paper that uses LLMs for automatic root cause analysis (RCA) for cloud incidents. This was a pretty straightforward application of LLMs. The proposed system employs an LLM to match incoming incidents to incident handlers based on their alert types, predicts the incident's root cause category, and provides an explanatory narrative. The only customization is through prompt-engineering. Since this is a custom domain, I think a more principled and custom-designed  machine learning system would be more appropriate rather than adopting LLMs. Anyways, the use of LLMs for RCAs spooked me vicerally. I couldn't find the exact words during the paper discussion, but I can articulate this better now. Let me explain. RCA is serious business Root cause analysis (RCA) is the process of identifying the underlying causes of a problem/incident, rather than just addressing its symptoms. One RCA heuristic is asking 5 Why...

Beyond the Code: TLA+ and the Art of Abstraction

Image
I have been teaching a TLA+ miniseries inside AWS. I just finished the 10th week, with a one hour seminar each week. I wanted to pen down my learnings from this experience. The art of abstraction Let's start with abstraction. Abstraction is a powerful tool for avoiding distraction. The etimology of the word abstract comes from Latin for cut and draw away. With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. For example, if you are interested in the consistency model of your distributed system, you can abstract away the mechanics of communication in the system when that is an unnecessary distraction. In his 2019 talk, Leslie Lamport said : Abstraction, abstraction, abstraction! That's how you win a Turing Award. Inside Amazon abstraction is also very valuable. We see skilled engineers look at large systems, slice out a protocol/subsystem, cut to the essence of it, and dive deep to make th...

Distributed Transactions at Scale in Amazon DynamoDB

Image
This paper appeared in July at USENIX ATC 2023. If you haven't read about the architecture and operation of DynamoDB, please first read my summary of the DynamoDB ATC 2022 paper . The big omission in that paper was discussion about transactions. This paper amends that. It is great to see DynamoDB, and AWS in general, is publishing/sharing more widely than before. Overview A killer feature of DynamoDB is predictability at any scale. Do read Marc Brooker's post to fully appreciate this feature. Aligned with this predictability tenet, when adding transactions to DynamoDB, the first and primary constraint was to preserve the predictable high performance of single-key reads/writes at any scale. The second big constraint was to implement transactions using update in-place operation without multi-version concurrency control. The reason for this was they didn't want to mock with the storage layer which did not support multi-versioning. Satisfying both of the above constraints may s...

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service (USENIX ATC 2022)

Image
This paper, which appeared in USENIX ATC'22 last week, describes the evolution of the design and implementation of DynamoDB in response to experiences operating it since its launch in 2012. DynamoDB has massive scale. In 2021, during the 66-hour Amazon Prime Day shopping event, Amazon systems made trillions of API calls to DynamoDB, peaking at 89.2 million requests per second. DynamoDB powers Alexa, Amazon.com sites, and all Amazon fulfillment centers. Many AWS services such as AWS Lambda, AWS Lake Formation, and Amazon SageMaker are built on DynamoDB. Moreover, hundreds of thousands of customer applications also use DynamoDB. First some clarification is in order. DynamoDB != Dynamo. DynamoDB's architecture does not share much with that of the Dynamo system (2007) . DynamoDB uses MultiPaxos for replication, for God's sake. Dynamo was a single-tenant system and teams were responsible for managing their own Dynamo installations. The resulting operational complexity becam...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems