Posts

Disaggregated Database Management Systems

Image
This paper is based on a panel discussion from the TPC Technology Conference 2022. It surveys how cloud hardware and software trends are reshaping database system architecture around the idea of disaggregation. For me, the core action is in Section 4: Disaggregated Database Management Systems. Here the paper discusses three case studies (Google AlloyDB, Rockset, and Nova-LSM) to give a taste of the software side of the movement. Of course there are many more. You can find Aurora , Socrates , and Taurus , and TaurusMM reviews in my blog. In addition, Amazon DSQL (which I worked on) is worth discussing soon. I’ll also revisit the PolarDB series of papers , which trace a fascinating arc from active log-replay storage toward simpler, compute-driven designs. Alibaba has been prolific in this space, but the direction they are ultimately advocating remains muddled across publications, which reflect conflicting goals/priorities. AlloyDB AlloyDB extends PostgreSQL with compute–storage disagg...

Taurus MM: A Cloud-Native Shared-Storage Multi-Master Database

Image
This VLDB'23 paper presents Taurus MM, Huawei's cloud-native, multi-master OLTP database built to scale write throughput in clusters between 2 to 16 masters. It extends the single-master TaurusDB design (which we reviewed yesterday) into a multi-master design while following its shared-storage architecture with separate compute and storage layers. Each master maintains its own write-ahead log (WAL) and executes transactions independently; there are no distributed transactions. All masters share the same Log Stores and Page Stores, and data is coordinated through new algorithms that reduce network traffic and preserve strong consistency. The system uses pessimistic concurrency control to avoid frequent aborts on contended workloads. Consistency is maintained through two complementary mechanisms: a new clock design that makes causal ordering efficient, and a new hybrid locking protocol that cuts coordination cost. Vector-Scalar (VS) Clocks A core contribution is the Vector-Scal...

Taurus Database: How to be Fast, Available, and Frugal in the Cloud

Image
This SIGMOD’20 paper presents TaurusDB, Huawei's disaggregated MySQL-based cloud database. TaurusDB refines the disaggregated architecture pioneered by Aurora and Socrates, and provides a simpler and cleaner separation of compute and storage.  In my writeup on Aurora , I discussed how "log is the database" approach reduces network load, since the compute primary only sends logs and the storage nodes apply them to reconstruct pages. But Aurora did conflate durability and availability somewhat and used quorum-based replication of six replicas for both logs and pages. In my review of Socrates , I explained how Socrates (Azure SQL Cloud) separates durability and availability by splitting the system into four layers: compute, log, page, and storage. Durability (logs) ensures data is not lost after a crash. Availability (pages/storage) ensures data can still be served while some replicas or nodes fail. Socrates stores pages separately from logs to improve performance but the e...

TLA+ Modeling of AWS outage DNS race condition

Image
On Oct 19–20, 2025, AWS’s N. Virginia region suffered a major DynamoDB outage triggered by a DNS automation defect that broke endpoint resolution. The issue cascaded into a region-wide failure lasting nearly a full day and disrupted many companies’ services. As with most large-scale outages, the “DNS automation defect” was only the trigger; deeper systemic fragilities ( see my post on the Metastable Failures in the Wild paper ) amplified the impact. This post focuses narrowly on the race condition at the core of the bug, which is best understood through TLA+ modeling. My TLA+ model builds on Waqas Younas’s Promela/Spin version . To get started quickly, I asked ChatGPT to translate his Promela model into TLA+, which turned out to be a helpful way to understand the system’s behavior, much more effective than reading the postmortem or prose descriptions of the race. The translation wasn’t perfect, but fixing it wasn’t hard. The translated model treated the enactor’s logic as a single atom...

Barbarians at the Gate: How AI is Upending Systems Research

Image
This recent paper from the Berkeley Sky Computing Lab has been making waves in systems community. Of course, Aleksey and I did our live blind read of it, which you can watch below. My annotated copy of the paper is also available here. This is a fascinating and timely paper. It raises deep questions about how LLMs will shape the research process, and how that could look like. Below, I start with a short technical review, then move to the broader discussion topics. Technical review The paper introduces AI-Driven Research for Systems (ADRS) framework. By leveraging the OpenEvolve framework ,  ADRS integrates LLMs directly into the systems research workflow to automate much of the solution-tweaking and evaluation process. As shown in Figure 3, ADRS operates as a closed feedback loop in which the LLM ensemble iteratively proposes, tests, and refines solutions to a given systems problem. This automation targets the two most labor-intensive stages of the research cycle, solution tweaking...

Academic chat: On PhD

Image
This week, Aleksey and I met not to dissect a research paper, but to chat about "the process of PhD". I had recently wrote a post titled "The Invisible Curriculum of Research" , where I framed research as an iceberg, with the small visible parts (papers, conferences) resting on the hidden 5 Cs: Curiosity/Taste: what problems are worth solving. Clarity: how to ask precise and abstracting questions. Craft: writing, experimentation, presentation. Community: collaboration and contribution. Courage: resilience through setbacks. Above is the video of our chat, with a lot of personal anecdotes and a few rants. But if you want to cut to the chase, the highlight reel is below. What a PhD Really Produces The real product of a PhD is not the thesis, but you, the researcher! The thesis is just the residue of this long internal transformation. Like martial arts, the training breaks you and rebuilds you into someone who sees and thinks differently. This transformation cannot be ...

Tiga: Accelerating Geo-Distributed Transactions with Synchronized Clocks

Image
This paper (to appear at SOSP'25) is one of the latest efforts exploring the dream of a one-round commit for geo-replicated databases. TAPIR tried to fuse concurrency control and consensus into one layer. Tempo and Detock went further using dependency graphs.  Aleksey and I did our usual thing. We recorded our first blind read of the paper. I also annotated a copy while reading, which you can access here . We liked the paper overall. This is a thoughtful piece of engineering, not a conceptual breakthrough. It uses  future timestamps to align replicas  in a slightly new way, and the results are solid. But the presentation needs refinement and stronger formalization. (See our livereading video about how these problems manifested themselves.)  Another study to add to my survey , showing how, with modern clocks, time itself is becoming a coordination primitive. The Big Idea Tiga claims to do strictly serializable one-shot (multi-shot ok with reconnaissance queries t...

The Invisible Curriculum of Research

Image
Courses, textbooks, and papers provide the formal curriculum of research. But there is also an invisible curriculum. Unwritten rules and skills separate the best researchers from the rest. I did get an early education on this thanks to my advisor, Anish . He kept mentioning "taste", calling some of my observations and algorithms "cute", and encouring me to be more curious and creative and to develop my "taste".  Slowly, I realized that what really shapes a research career isn't written in any textbook or taught in any course. You learn it by osmosis from mentors, and through missteps: working on the wrong problem, asking shallow questions, botching a project, giving up too soon. But if you can absorb these lessons faster, you will find research more fulfilling. The visible curriculum teaches you how to build a car. The invisible curriculum teaches you where to go, who to ride with, and how to keep going when the road turns uphill. After 25 years of exp...

Four Ivies. Two days.

Image
This is my long-overdue trip report from last summer: July 10–11, 2024. We toured Ivy League campuses to help our rising senior son weigh his options, with our two daughters (our kids are four years apart each) tagging along for an early preview. Day one was Yale and Brown, followed by a night in New Jersey. Day two took us to Princeton and UPenn, then the long drive back to Buffalo. Of course we drove , that's how we roll . Prelude Lining up campus tours is its own sport. They are booked months in advance. Pro-tip: when your kid is born, call the colleges to reserve their campus visit. We lucked into two open slots, then hacked together a Python script to snipe cancellations and grabbed the other two. Not proud of this, but that's what it takes if you don't book months in advance. The U.S. college admissions process is Byzantine. It is a weird mix of ritual and performance. There are entire books about how to write the college essay . I have plenty to say about the so-cal...

Supporting our AI overlords: Redesigning data systems to be Agent-first

Image
This Berkeley systems group paper opens with the thesis that LLM agents will soon dominate data system workloads. These agents, acting on behalf of users, do not query like human analysts or even like the applications written by them. Instead, the LLM agents bombard databases with a storm of exploratory requests: schema inspections, partial aggregates, speculative joins, rollback-heavy what-if updates. The authors calls this behavior agentic speculation . Agentic speculation is positioned as both the problem and the opportunity. The problem is that traditional DBMSs are built for exact intermittent workloads and cannot handle the high-throughput redundant and inefficient querying of LLM agents. The opportunity also lies here. Agentic speculation has recognizable properties and features that invite new designs. Databases should adapt by offering approximate answers, sharing computation across repeated subplans, caching grounding information in an agentic memory store, and even steering...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Learning about distributed systems: where to start?

Advice to the young

Distributed Transactions at Scale in Amazon DynamoDB

Disaggregation: A New Architecture for Cloud Databases

Making database systems usable

Use of Time in Distributed Databases (part 1)