Posts

Showing posts from 2022

Best of Metadata in 2022

It is time to reflect back on the year. Here are some selected papers I covered in 2022. Production systems Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes Aurora provides a a high-throughput relational database on AWS cloud by decoupling the storage to a multi-tenant scale-out storage service. Each database instance acts as a SQL endpoint and supports query processing and transactions. Many database functions, such as redo logging, materialization of data blocks, garbage collection, and backup/restore, are offloaded to the storage nodes. Aurora performs replication among the storage nodes by pushing the redo log; this reduces networking traffic and enables fault-tolerant storage that heals without database involvement. Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service (USENIX ATC 2022) DynamoDB has massive scale; it powers Alexa, Amazon.com sites, and all Amazon fulfillment c

Noria: dynamic, partially-stateful dataflow for high-performance web applications

Image
This paper, which appeared in OSDI 2018 , considers the question of how to design a read-optimized RDBMS database with built-in support for caching and fast lookup answers to complex queries. The resulting system, Noria is implemented in Rust, and is available as opensource at https://github.com/mit-pdos/noria . There are already two good summaries written of the paper (from Adrian Colyer and Micah Lerner ), so my job today will be simpler. Artwork created by Stable Diffusion Motivation and overview Databases (mostly relational databases) are overwhelmingly used with read-heavy workloads due to all those SQL querying they serve, but they fail to take advantage of this and improve their designs to optimize for read-heavy workloads. Some crutch solutions was added to address this: memcached, read-only replicas etc., but these have shortcomings. Many deployments employ an in-memory key-value cache (like Redis, memcached, or TAO) to speed up common-case read queries. Such a cache avoids r

NoSQL: The Hangover of Dropping ACID

Image
Created by Stable Diffusion The 70s were a time of excess and rebellion, a carefree era where people embraced their wildest impulses and let their hair down. But as the years went by, people realized that the true freedom lies not in the absence of rules, but in the discipline to follow them. And so, they put away their acid-washed jeans and neon-colored hair, and embraced a more structured, disciplined way of life. As we hit 2010, the 70s spirit of rebellion spawned a new breed of databases known as "NoSQL". The rise of NoSQL databases was a rebellion against the strict rules and constraints of SQL databases. NoSQL databases promised to free users from the rigid, structured world of SQL, with their strict schemas and complex query languages. Instead, NoSQL databases offered a simple, flexible approach to data storage and retrieval, allowing users to store and access data in whatever way they saw fit. In contrast to SQL databases, which provide strong consistency guarantees t

Vertical Paxos and Primary-Backup Replication

Image
This is a 2009 paper by Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. This paper was very well written and it was clear, and easy to follow (at least for me after having read so many Paxos papers). I finished reading the paper in one hour -- a record time for a slow reader like me. I knew about this work, but had not read the paper carefully before. I think I was overindexing on the reconfiguration aspect of Vertical Paxos, but by reading this paper (by going to the source) I found that the primary-backup replication application of Vertical Paxos was as much the point of emphasis as the reconfiguration aspect. The paper talks about the connection between Paxos and primary-backup replication applications, and presents vertical Paxos as a way of bridging these two. The paper delves down in to the leader handover and reconfiguration, and shows a practical application of these in the context of primary-backup replication protocols. Although it is possible to let the state machine reconf

OceanBase: A 707 Million tpmC Distributed Relational Database System

Image
This paper appeared in VLDB2022. It presents the design of OceanBase, a scale-out multitenant relational database deployed on tens of thousands of servers at Alipay.com. The design of OceanBase is not impressive. It looks like OceanBase uses a standard run-off-the-mill Paxos groups architecture. Maybe there are some interesting work going on at integration with storage. Unfortunately, the paper is poorly written. It talks and boasts about many things, some repeatedly. But it doesn't give enough detail about any parts. For example, even the type of concurrency control protocol used is not discussed in the paper. Turns out it uses a "MVCC" based approach to concurrency control. ( DBDB's catalog on OceanBase gives a good overview of OceanBase.) The paper would be much more useful and enjoyable if it focused on a certain component of OceanBase and did a thorough walk through on that part. For those interested in diving deeper, OceanBase has a Github repo here . The open

Highly Available Transactions: Virtues and Limitations

Image
This is a VLDB 2014 paper from Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. This is a great paper, because it asked an important question that helped advanced our understanding of isolation properties with respect to high-availability. So here is the context. As part of his PhD, Peter Bailis started think about the benefits of relaxed consistency guarantees, and how can we leverage them. Peter's Probabilistic Bounded Staleness work is how things started. He also investigated eventual consistency limitations, and bolt-on causal consistency on that front. This paper brings that relaxed guarantees effort from the per-key consistency to the transaction isolation domain, and investigates Highly Available Transactions (HAT).  Remember this clickable chart from Jepsen I refer to often? The origin of that chart is the Figure 2 in this paper. This is what I meant when I said this paper advanced our understanding. This is a very subtle topic

Practical uses of synchronized clocks in distributed systems

This paper is from 1991, by Barbara Liskov. To set the context, viewstamped replication paper was published at 1988, and the Paxos tech report (which failed to publish) came in 1989. This paper also comes only 3 years after the NTP RFC was written. Although this is an old paper, it is a timely (!) one. The paper discusses how clocks can be used for improving the performance of distributed systems. This paper is a visionary paper, ahead of its time. It is possible to criticize the paper by saying it was too optimistic in assuming time synchronization has arrived (it took another 2 decades or so for time synchronization to make it make it). But going off with that optimistic premise, the paper made so many remarkable contributions. The next noteworthy update on the practical use of synchronized clocks came 21 years later in 2012 with Spanner . Sanjay Ghemawat on the Spanner paper was Liskov's student that worked on the Harp project. This brings me to the next impressive thing about

Timestamp-based Algorithms for Concurrency Control in Distributed Database Systems

Image
This paper by Bernstein and Goldman appeared in VLDB 1980 . Yes, 1980,  more than 40 years ago. This may be the oldest paper I wrote a summary for. The paper is written by a typewriter! I really didn't like reading 10+ pages in coarse grained typewritten font. I couldn't even search the pdf for keywords. Yet, the paper talked about distributed databases, and laid out a forward looking distributed database management system (DDBMS) architecture with distributed transaction managers (TMs) and data managers (DMs), which is very relevant and in use even today. I really loved the forward looking vision and clear thinking and presentation in the paper. There are two main approaches to concurrency control, locking based (two phase locking) or timestamp order based. This paper builds up on many previous work (all of which happened in 1976-1980) and provides a framework to reason about and categorize timestamp based concurrency control protocols (basic, Thomas Write Rule, Multiversion,

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Hints for Distributed Systems Design

Foundational distributed systems papers

Learning about distributed systems: where to start?

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The demise of coding is greatly exaggerated

SIGMOD panel: Future of Database System Architectures

Dude, where's my Emacs?

There is plenty of room at the bottom