Best of Metadata in 2024

- December 11, 2024

I can't believe we wasted another good year. It is time to reflect back on the best posts at Metadata blog in 2024. (I think you guys should tip me just because I didn't call this post "Metadata wrapped".)

Distributed systems posts

Transactional storage for geo-replicated systems(SOSP11): I like this paper because it asked the right questions, and introduced parallel snapshot isolation. No individual part is novel (vector clocks, csets) but their composition together and application to WAN web applications have been novel. Walter showed how to limit WAN coordination, while still developing useful applications.

An Hourglass Architecture for Distributed Systems (SRDS24 Keynote): This work successfully bridges theoretical research and practical implementation in large-scale distributed systems in Facebook/Meta control plane. The shared log abstraction proposes to separate the system into two layers: the database layer (proposers/learners) and the log layer (acceptors). The shared log provides a simple API for appending entries, checking the tail, reading entries, and trimming the log. This separation allows the SMR layer to focus on higher-level concerns without dealing with the complexities of consensus protocols.

Linearizability: A Correctness Condition for Concurrent Objects (TOPLAS90) : This is a foundational paper on linearizability. I gave this paper a critical read to point out things it did well versus some shortcomings.

Unanimous 2PC: Fault-tolerant Distributed Transactions Can be Fast and Simple (PAPOC24): This paper brings together the work/ideas around 2-phase commit and consensus protocols. It is thought provoking to consider the tradeoffs between the two.

FlexiRaft: Flexible Quorums with Raft (CIDR23): The paper talks about how they applied Raft to MySQL replication, and used the flexible quorums in the process. This is not a technically deep paper, but it was interesting to see a practical application of flexible quorums idea to Raft rather than Paxos. The most technically interesting part is the adoption of flexible quorums to Raft rather than Paxos, which needs to impose an extra requirement on quorums in order to guarantee Leader Completeness: "the new leader must already have all log entries replicated by a majority of nodes in the previous term."

Amazon MemoryDB: A Fast and Durable Memory-First Cloud Database (SIGMOD24): Amazon MemoryDB is a fully managed in-memory database service that leverages Redis's performance strengths while overcoming its durability limitations. It uses Redis as the in-memory data processing engine but offloads persistence to an AWS-internal transaction log service (internally known as the journal). This decoupled architecture provides in-memory performance with microsecond reads and single-digit millisecond writes, while ensuring across availability zone (AZ) durability, 99.99% availability, and strong consistency in the face of failures.

Fault-Tolerant Replication with Pull-Based Consensus in MongoDB (NSDI21): Raft provides fault-tolerant state-machine-replication (SMR) over asynchronous networks. Raft (like most SMR protocols) uses push-based replication. But MongoDB uses pull-based replication scheme, so when integrating/invigorating MongoDB's SMR with Raft, this caused challenges. The paper focuses on examining and solving these challenges, and explaining the resulting MongoSMR protocol.

Tunable Consistency in MongoDB (VLDB19): This paper discusses the tunable consistency models in MongoDB and how MongoDB's speculative execution model and data rollback protocol enable a spectrum of consistency levels efficiently.

Databases posts

I participated in reading groups that covered two database books in 2024, and blogged about the chapters. I had 7 posts related to Transaction Processing book by Gray and Reuters and 14 posts related to the Designing Data Intensive Applications book. Both books were great for getting good understanding of databases.

Scalable OLTP in the Cloud: What’s the BIG DEAL? (CIDR24): In this paper Pat Helland argues that the answer lies in the joint responsibility of database and the application. The BIG DEAL splits the scaling responsibilities between the database and the application. *Scalable DBs* don’t coordinate across disjoint TXs updating different keys. *Scalable apps* don’t concurrently update the same key. So, snapshot isolation is a BIG DEAL!

Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases (OSDI23): Chardonnay provides strictly serializable general read-write transactions via 2PC+2PL in-memory quickly for a single-datacenter deployment. It also provides lock-free (contention-free) serializable read-only transactions from snapshots (Chardonnay is multiversion in that sense) that are taken every epoch (10ms).

Looking back at Postgres(2019): This article covers Postgres's origin story, and provides a nice retrospective and context about development and features.

DBSP: Automatic Incremental View Maintenance for Rich Query Languages (VLDB23): DBSP is a simplified version of differential dataflow. It assumes linear synchronous time, and in return provides powerful compositional properties. If you apply two queries in sequence and you want to incrementalize that composition, it's enough to incrementalize each query independently. This allows independent optimization of query components, enabling efficient and modular query processing.

Understanding the Performance Implications of Storage-Disaggregated Databases (SIGMOD24): This paper has conducted a comprehensive study to investigate the performance implications of storage-disaggregated databases. The work addresses several critical performance questions that were obscured due to the closed-source nature of these systems.

AI posts

Auto-WLM: machine learning enhanced workload management in Amazon Redshift (SIGMOD23): Auto-WLM is a machine learning based automatic workload manager currently used in production in Amazon Redshift. This paper turned out to be a practical/applied data systems paper rather than a deep learning and/or advanced machine learning paper. At its core, this paper is about improving query performance and resource utilization in data warehouses, possibly the first for a database system in production at scale.

The demise of coding is greatly exaggerated: This is my response to NVDIA CEO Jensen Huang's remarks: "Over the course of the last 10 years, 15 years, almost everybody who sits on a stage like this would tell you that it is vital that your children learn computer science, and everybody should learn how to program. And in fact, it’s almost exactly the opposite."

Looming Liability Machines (LLMs): The use of LLMs for automatic root cause analysis (RCA) for cloud incidents spooked me vicerally. I am not suggesting a Butlerian Jihad against LLMs. But I am worried, we are enticed too much by LLMs. Ok, let's use them, but maybe we shouldn't open the fort doors to let them in. I have worries about automation surprise, systemic failures, and getting lazy trading thinking with superficial answers.

TLA+ posts

Exploring the NaiadClock TLA+ model in TLA-Web: I have been impressed by the usability of TLA-Web from Will Schultz. Recently I have been using it for my TLA+ modeling of MongoDB catalog protocols internally, and found it very useful to explore and understand behavior. This got me thinking that TLA-Web would be really useful when exploring and understanding an unfamiliar spec I picked up on the web.

TLA+ modeling of a single replicaset transaction modeling: For some time I had been playing with transaction modeling and most recently with replicaset modeling by way of a single log. While playing with these, I realized I can build something cool on top of these building blocks. I just finished building snapshot isolated transaction modeling that sit on top of a replicaset log. This is also a high level approximation of MongoDB-style snapshot isolated transactions on a single replicaset.

TLA+ modeling of MongoDB logless reconfiguration: This is a walkthrough of the TLA+ specs for the MongoDB logless reconfiguration protocol we've reviewed here.