Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases (OSDI'23)
This paper appeared in OSDI'23. I haven't read the paper, and I am going by the two presentations I have seen of the paper, and by a quick skim of the paper. The paper has a cool idea, and I decided I should not procrastinate more waiting to read the paper in detail, and instead just capture and communicate this idea.
Datacenter networking got very fast, much faster than SSD/disk I/O. That means, rather than two-phase commit (2PC), cold reads from SSD/disk is the dominant latency. Chardonnay banks on this trend in an interesting manner. It executes read/write transactions in "dry run" mode first, just to discover the readset and writeset and pin them to main memory. Then in "definitive" mode, it executes these transactions using two-phase-locking (2PL) and 2PC swiftly because everything (mostly) is in-memory. Having learned the readset and writeset during dry run also allows definitive execution achieve deadlock-freedom (and prevent aborts due to deadlock avoidance) just by acquiring locks in ascending order of object-ids.
This way Chardonnay provides strictly serializable general read-write transactions via 2PC+2PL in-memory quickly for a single-datacenter deployment (read at the end how in CIDR'24 they generalized this multi datacenter transactions). In addition, Chardonnay also provides lock-free (contention-free) serializable read-only transactions from snapshots (Chardonnay is multiversion in that sense) that are taken every epoch (10ms). To explain how this works, we need to check the architecture and algorithm next.
With this brief overview, let's put Chardonnay in context. It leverages ideas explored by deterministic databases with that dry-run mode. Calvin (and sequel papers) also suggested scout queries, and has epochs. I see Chardonnay as better operationalizing those ideas.
Architecture and algorithm
Chardonnay has four main components.
The Epoch Service is responsible for maintaining and updating a single, monotonically increasing counter called the epoch. The epoch is only read, not incremented, by each transaction. The epoch is essential for the lock-free strongly consistent snapshot reads as we discuss next.
The KV Service stores the user key-value data. It uses a replicated shared-nothing range-sharded architecture. Each range is assigned to 3 replicas with the database and a WAL (that is implemented via Paxos and is placed on a fast NVMe device for low latency). One of the range replicas is designated as a leader, which holds a leader lease. All reads and writes go through the leader.
The Transaction State Store is responsible for authoritatively storing the transaction coordinator state in a replicated/fault-tolerant manner so that client failures do not cause transaction blocking. Each transaction can be in one of the following states: Started, Committed, Aborted, and Done. (Note that being Prepared is not of concern here.) Using the presumed abort optimization, the service replies Aborted to a participant’s inquiry about the state of a transaction unknown to the service. Being in Done state means that all transaction participant ranges have learned about the commit outcome of the transaction so that the service can safely forget about it.
The Client Library is what the applications link to in order to access Chardonnay. It is the 2PC coordinator, and provides APIs for executing transactions.
Most transactions only touch keys within a single range, so they do not need 2PC. First, the client reads the epoch. Then, it sends a Commit message to the leader, which checks that the epoch falls within the lease’s epoch interval. If so, the leader appends to the WAL and if successful, returns success. If not, it aborts.
If multiple shards are involved, the general workflow shown in the figure is used.
A small yet interesting point is about how client reads epoch from the epoch service (which is Multi Paxos replicated for fault-tolerance). These are done as quorum reads including followers (see our Paxos quorum reads paper for why this is needed for linearizability and how this works). If the leader had a leader lease, this could have just been a leader read, but using leader leases could mean extra unavailability waiting for lease expiration when there is a leader fail-over.
Read-only queries
Snapshot reads are essential to efficiently support read-only (and especially long-running read-only) queries. These queries have to be declared as read-only from the start.
A transaction can get a consistent snapshot as of the beginning of the current epoch ec by ensuring it observes the effects of all committed transactions that have a lower epoch. That is realized by waiting for all the transactions with an epoch e < ec to release their write locks. Hence, the snapshot read algorithm would simply work by reading the epoch ec, then reading the appropriate key versions.
What is cool is that, the epochs allow for lots of simultaneous snapshot reads.
This algorithm does not guarantee strict serializability, because a transaction T would not observe the effects of transactions in epoch ec that committed before T started. If desired, ensuring linearizability is easy at the cost of some latency; after T starts, it waits for the epoch to advance once and then use the new epoch.
Evaluation
As expected, Chardonnay achieves high performance for high contention workloads by automatically and transparently loading and pinning data from slow storage to main memory prior to acquiring any locks, and avoids deadlocks by ordering its lock requests. Chardonnay’s throughput under extremely high contention is only 15% lower than under extremely low contention. In contrast in traditional architectures that drop could go up to 85% of throughput. The tradeoff is that the dry run phase adds overhead for low contention workloads.
Extending to multiple datacenters
The Chablis (ah another wine) paper (CIDR 2024) extended Chardonnay to multiple datacenters. The extension is done in a relatively straightforward manner by adding another epoch, called a global epoch, over Chardonnay's existing single-datacenter local epochs. The cross-datacenter read-only transactions execute using the global epoch serialized snapshot points. They still need to wait write-locks to be released by transactions that is executing using the current global epoch. Single datacenter local transactions keep executing using the Chardonnay logic as we explained above.
Comments