Best of Metadata in 2022

- December 21, 2022

It is time to reflect back on the year. Here are some selected papers I covered in 2022.

Production systems

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes
Aurora provides a a high-throughput relational database on AWS cloud by decoupling the storage to a multi-tenant scale-out storage service. Each database instance acts as a SQL endpoint and supports query processing and transactions. Many database functions, such as redo logging, materialization of data blocks, garbage collection, and backup/restore, are offloaded to the storage nodes. Aurora performs replication among the storage nodes by pushing the redo log; this reduces networking traffic and enables fault-tolerant storage that heals without database involvement.

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service (USENIX ATC 2022)
DynamoDB has massive scale; it powers Alexa, Amazon.com sites, and all Amazon fulfillment centers. Hundreds of thousands of customer applications also use DynamoDB. DynamoDB's architecture does not share much with that of the Dynamo system (2007). DynamoDB uses MultiPaxos for replication. This paper does not focus on the DynamoDB architecture anyways; it is more about achieving operational excellence while operating arguably the largest scale cloud database ever. DynamoDB needs to deal with operational issues related to fairness, traffic imbalance across partitions, monitoring, and automated system operations without impacting availability or performance. Reliability and predictability of performace is essential, as even the slightest disruption can significantly impact customers.

RAMP-TAO: Layering Atomic Transactions on Facebook’s Online TAO Data Store (VLDB'21)
This paper discusses the TAO graph datastore system used by Facebook and the motivation for adding a read transaction API to it. It also covers the design, implementation, and evaluation of the RAMP-TAO protocol, including the determination of the most appropriate isolation level for it.

SQLite: Past, Present, and Future
SQLite is the most widely deployed database engine with more than one trillion SQLite databases in active use. SQLite is a single node and (mostly) single threaded online transaction processing (OLTP) database. It has an in-process/embbedded design, and a standalone (no dependencies) codebase ...a single C library consisting of 150K lines of code. Being designed as an OLTP database, SQLite employs row-oriented execution and a B-tree storage format. However, there is a growing need for efficient in-process online analytical processing (OLAP). This paper delves into analytical data processing on SQLite, identifying key bottlenecks and implementing suitable solutions. As a result of the optimizations implemented, SQLite is now up to 4.2X faster on the Star Schema Benchmark (SSB). This is a sweet little paper (befitting SQLite's fame).

CockroachDB: The Resilient Geo-Distributed SQL Database
CockroachDB consists of a distributed SQL layer on top of a distributed key-value store. The key value store is laid out in a single monolithic ordered key space divided it into ranges. The ranges are the unit of replication. CockroachDB deploys a Raft group per each range for replication. CockroachDB uses multi-version concurrency control. This paper explains CockroachDB's 2PC+2PL based distributed transactions protocol and its serializable multiversion concurrency control protocol.

FoundationDB: A Distributed Unbundled Transactional Key Value Store (Sigmod 2021)
FoundationDB is a transactional key-value store that supports multi-key strictly serializable transactions across its entire key-space. It is used at Apple and Snowflake, due to its consistency, robustness and availability for storing user data, system metadata and configuration, and other critical information. The main idea in FDB is to decouple transaction processing from logging and storage. Such an unbundled architecture enables the separation and horizontal scaling of both read and write handling. The transaction system combines optimistic concurrency control and multi-version concurrency control and achieves strict serializability or snapshot isolation if desired.

Database Research

Decoupled Transactions: Low Tail Latency Online Transactions Atop Jittery Servers (CIDR 2022)
This is a Pat Helland paper. Pat's papers are always remarkable and full of wisdom. The paper proposes a novel design to dampen application visible jitter in transactional databases, and to scale OLTP SQL databases to tens of servers with predictable response time while running in a largely unpredictable environment. The design leverages quorums to solve the jitter problem. It is a very technical and dense paper, despite the colloquial writing style. Appendix D, Sections 15.3 and 15.4 show how to address the jitter free liveness check via log and the jitter free concurrent log fencing. I had gotten these wrong about the paper, and criticized the paper with my limited understanding of the design. It took many hours of discussion with Pat to get a better understanding of the protocols in the paper. I owe Pat a revised write-up but I have been procrastinating on this. Well, I think 2023 will be the year. This paper deserves very careful studying.

Anna: A Key-Value Store For Any Scale
This paper introduces Anna, a CALM/CRDT implementation of a distributed key-value system both at the data structure level as well as system architecture and transaction protocol levels. Conventional wisdom says that software designed for one scale point needs to be rewritten when scaling up by 10x. Anna aims to disprove this by showing how a key-value storage system can be architected to scale across many orders of magnitude. Anna is a partitioned, multi-mastered key-value system that achieves high performance and elasticity via wait-free execution and coordination-free consistency. Anna employs coordination-free actors that perform state update via merge of lattice-based composite data structures. Anna can provide consistency up to causal consistency. It cannot provide strong consistency at key level, and nothing stronger than read-committed at the multi-key level.

Calvin: Fast Distributed Transactions for Partitioned Database Systems
The name Calvin alludes to the predestination doctrine in Calvinism. Calvin is all about pre-ordering transactions in logs in batches, and then deterministically executing this log in the same order in replicas. The key technical feature that allows Calvin scalability in the face of distributed transactions is a deterministic ordering/locking mechanism that enables the elimination of distributed commit protocols. Since the schedulers get a good look of which transactions are coming next in this log, they can perform optimizations by pulling some independent transactions locally and executing them in parallel in worker partitions with no ill effect. Calvin achieves good performance thanks to the batching and optimizations pulled by the schedulers by looking ahead to transactions in the log. The downsides are the loss of conversational transaction execution as in SQL and increased latency due to batch processing. FaunaDB implements Calvin, and there has been several followup work to Calvin. It is an important technique to learn.

Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems
The paper describes the addition of Multi-Version Concurrency Control (MVCC) to the Hyper database, and discusses at length how they implement serializability efficiently with little overhead compared to snapshot isolation. The paper says that Hyper is like Hekaton a closely related in-memory OLTP database in that, in order to preserve scan performance, new transactions are not allocated a separate place. The most uptodate version of the database is kept all in the same place. But, you can reach older versions of a record (that are still in use by active transactions) by working backwards from the record using the delta versions maintained. The biggest improvement in Hyper over Hekaton is the more efficient and tidier implementation of multiversion concurrency control, especially for providing serializability with little overhead.

Timestamp-based Algorithms for Concurrency Control in Distributed Database Systems
This paper is from 1980, yet it talked about distributed databases, and laid out a forward looking distributed database management system architecture, which is very relevant and in use even today. I really loved the forward looking vision and clear thinking and presentation in the paper. The paper presents several timestamp order based optimistic concurrency control protocols. As Doug Terry talks about in his Fast19 keynote, DynamoDB transactions employ this timestamp-ordered transactions idea.

Seeing is Believing: A Client-Centric Specification of Database Isolation
The paper provides a dense but rewarding read. It presents a new way to reason about transaction-isolation based on application-observable states, in lieu of prior history-based approaches. While history-based approaches prescribed operational reasoning on transaction isolation semantics, the proposed state-based approach opened the way for invariant-based reasoning for this. This approach freed isolation definitions from implementation-specific details (timestamps, replicas) and opens the way to find alternate implementations. A followup work provided TLA+ formalization of the state-based isolation model introduced here.

Misc

Efficient Replication via Timestamp Stability (EuroSys 2021)
This paper introduces Tempo, a simpler more streamlined leaderless Paxos variant for implementing a state machine replication protocol. Recently we have been seeing a convergence of the leaderless SMR ideas with loosely synchronized clocks. Tempo brings us closer to reaping the benefits of synchronized system model, while still being safe to the face of bouts of full asynchrony in the system. Accord protocol improves on Tempo and is implemented in Cassandra to provide distributed/leaderless transactions.

TLA+ Conference and StrangeLoop 2022
This is a writeup of the TLA+ conference held as part of Strange Loop. There were several awesome talks on recent developments in the TLA+ space.

Checking statistical properties of protocols using TLA+
The TLA+ model checker has been extended with a new mode (called "generate") to enable obtaining statistical properties from the models. Checking statistics is very useful, because ensuring correctness addresses only one part of our concerns with practical systems.