Posts

Showing posts from November, 2025

Mitigating Application Resource Overload with Targeted Task Cancellation

Image
The Atropos paper (SOSP'25) argues that overload-control systems are built on a flawed assumption. They monitor global signals (like queue length or tail latency) to adjust admission control (throttling new arrivals or dropping random requests). This works when the bottleneck is CPU or network, but it fails when the real problem is inside the application. This considers only the symptoms but not the source. As a result, it drops the victims rather than the culprits. Real systems often run into overload because one or two unlucky timed requests monopolize an internal logical resource (like buffer pools, locks, and thread-pool queues). These few rogue whales have nonlinear effects. A single ill-timed dump query can thrash the buffer pool and cut throughput in half. A single backup thread combined with a heavy table scan can stall writes in MySQL as seen in Figure 3. The CPU metrics will not show this. Atropos proposes a simple fix to this problem. Rather than throttling or dropping ...

Disaggregated Database Management Systems

Image
This paper is based on a panel discussion from the TPC Technology Conference 2022. It surveys how cloud hardware and software trends are reshaping database system architecture around the idea of disaggregation. For me, the core action is in Section 4: Disaggregated Database Management Systems. Here the paper discusses three case studies (Google AlloyDB, Rockset, and Nova-LSM) to give a taste of the software side of the movement. Of course there are many more. You can find Aurora , Socrates , and Taurus , and TaurusMM reviews in my blog. In addition, Amazon DSQL (which I worked on) is worth discussing soon. I’ll also revisit the PolarDB series of papers , which trace a fascinating arc from active log-replay storage toward simpler, compute-driven designs. Alibaba has been prolific in this space, but the direction they are ultimately advocating remains muddled across publications, which reflect conflicting goals/priorities. AlloyDB AlloyDB extends PostgreSQL with compute–storage disagg...

Taurus MM: A Cloud-Native Shared-Storage Multi-Master Database

Image
This VLDB'23 paper presents Taurus MM, Huawei's cloud-native, multi-master OLTP database built to scale write throughput in clusters between 2 to 16 masters. It extends the single-master TaurusDB design (which we reviewed yesterday) into a multi-master design while following its shared-storage architecture with separate compute and storage layers. Each master maintains its own write-ahead log (WAL) and executes transactions independently; there are no distributed transactions. All masters share the same Log Stores and Page Stores, and data is coordinated through new algorithms that reduce network traffic and preserve strong consistency. The system uses pessimistic concurrency control to avoid frequent aborts on contended workloads. Consistency is maintained through two complementary mechanisms: a new clock design that makes causal ordering efficient, and a new hybrid locking protocol that cuts coordination cost. Vector-Scalar (VS) Clocks A core contribution is the Vector-Scal...

Taurus Database: How to be Fast, Available, and Frugal in the Cloud

Image
This SIGMOD’20 paper presents TaurusDB, Huawei's disaggregated MySQL-based cloud database. TaurusDB refines the disaggregated architecture pioneered by Aurora and Socrates, and provides a simpler and cleaner separation of compute and storage.  In my writeup on Aurora , I discussed how "log is the database" approach reduces network load, since the compute primary only sends logs and the storage nodes apply them to reconstruct pages. But Aurora did conflate durability and availability somewhat and used quorum-based replication of six replicas for both logs and pages. In my review of Socrates , I explained how Socrates (Azure SQL Cloud) separates durability and availability by splitting the system into four layers: compute, log, page, and storage. Durability (logs) ensures data is not lost after a crash. Availability (pages/storage) ensures data can still be served while some replicas or nodes fail. Socrates stores pages separately from logs to improve performance but the e...

TLA+ Modeling of AWS outage DNS race condition

Image
On Oct 19–20, 2025, AWS’s N. Virginia region suffered a major DynamoDB outage triggered by a DNS automation defect that broke endpoint resolution. The issue cascaded into a region-wide failure lasting nearly a full day and disrupted many companies’ services. As with most large-scale outages, the “DNS automation defect” was only the trigger; deeper systemic fragilities ( see my post on the Metastable Failures in the Wild paper ) amplified the impact. This post focuses narrowly on the race condition at the core of the bug, which is best understood through TLA+ modeling. My TLA+ model builds on Waqas Younas’s Promela/Spin version . To get started quickly, I asked ChatGPT to translate his Promela model into TLA+, which turned out to be a helpful way to understand the system’s behavior, much more effective than reading the postmortem or prose descriptions of the race. The translation wasn’t perfect, but fixing it wasn’t hard. The translated model treated the enactor’s logic as a single atom...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Learning about distributed systems: where to start?

Advice to the young

Distributed Transactions at Scale in Amazon DynamoDB

Disaggregation: A New Architecture for Cloud Databases

Making database systems usable

Use of Time in Distributed Databases (part 1)