Paper summary. CORFU: A shared log design for flash clusters

- July 25, 2019

By: Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, John D. Davis, Appeared in NSDI'2012
This paper applies VPaxos ideas (of using an auxiliary Paxos box for reconfiguration) and chain replication ideas in the context of Flash SSDs. The vision is that Corfu's novel client-centric design eliminates storage servers in favor of simple, efficient and inexpensive flash chips that attach directly to the network. The clients directly write to storage nodes, similar to what happens in Dynamo/Cassandra/Voldemort replication, but linearizability is still guaranteed.

Previously I had summarized the Tango paper, for maintaining distributed data structures over a shared log. Tango builds on the Corfu log abstraction.

Corfu involves three main functions:

A mapping function (maintained at the VPaxos box) from logical positions in the log to flash pages on the cluster of flash units
A tail-finding mechanism (using a sequencer node) for finding the next available logical position on the log for new data
A replication protocol (chain replication!) to write a log entry consistently on multiple flash pages

Mapping in Corfu

Each Corfu client maintains a local, read-only replica of a data structure called a projection that carves the logical log into disjoint ranges. Each such range is mapped to a list of extents within the address spaces of individual flash units.

The example above maps each log position to a single flash page; for replication, each extent is associated with a replica set of flash units rather than just one unit. For example, for two-way replication the extent F0: 0:20K would be replaced by F0/F0′:0:20K and the extent F1:0:20K would be replaced by F1/F1':0:20K.

When some event occurs that necessitates a change in the mapping --for example, when a flash unit fails, or when the tail of the log moves past the current active range-- a new projection (a new view with a new epoch number) has to be installed on all clients in the system.

To maintain and reconfigure this mapping, Corfu uses VPaxos. There is a mapping from logical log to physical SSD extents/ranges. VPaxos keeps that mapping, and updates that mapping on failures, and on extent full.

This VPaxos-based auxiliary-driven reconfiguration involves two distinct steps:
1. Sealing the current projection: When a client Cr decides to reconfigure the system from the current projection Pi to a new projection Pi+1, it first seals Pi; this involves sending a seal command to a subset of the flash units in Pi. Sealing ensures that flash units will reject in-flight messages --writes as well as reads-- sent to them in the context of the sealed projection.

2. Writing the new projection at the VPaxos box: Once the reconfiguring client Cr has successfully sealed the current projection Pi, it attempts to write the new projection Pi+1 at the (i + 1)th position in the VPaxos box. If some other client has already written to that position, client Cr aborts its own reconfiguration, reads the existing projection at position (i + 1), and uses it as its new current projection.

Finding tail in Corfu

To eliminate contention at the tail of the log, Corfu uses a dedicated sequencer that assigns clients 'tokens', corresponding to empty log positions. To append data, a client first goes to the sequencer, which returns its current value and increments itself. The sequencer is merely an optimization to reduce contention in the system and is not required for either safety or progress.

Replication in Corfu

Corfu uses a simple chaining protocol (a client-driven variant of Chain Replication) to achieve safety-under-contention and durability. When a client wants to write to a replica set of flash pages, it updates them in a deterministic replica order, waiting for each flash unit to respond before moving to the next one. If two clients attempt to concurrently update the same replica set of flash pages, one of them will arrive second at the first unit of the chain and receive an error overwrite.

To read from the replica set, clients go to the last unit of the chain. If the last unit has not yet been updated, it will return an error unwritten.

To fill holes (which is important for RSM maintenance from log), the client starts by checking the first unit of the chain to determine if a valid value exists in the prefix of the chain. If such a value exists, the client walks down the chain to find the first unwritten replica, and then completes the append by copying over the value to the remaining unwritten replicas in chain order. Alternatively, if the first unit of the chain is unwritten, the client writes the junk value to all the replicas in chain order.

MAD questions

1. Do SSDs still work this way?
I am not current on my SSD knowledge. The paper makes use of properties of Flash SSDs: it assumes specific error codes to be returned for "no item", "item", and "junk", and in effect, it treats the SSDs as write-once registers for the purpose of the log. In return, it also tries to account for some of its limitations like uneven wear problem, and tries to load-balance the wear.

Did anything change in the way SSDs work that change these assumptions/requirements?

2. How can we improve on some drawbacks?
A big drawback in Corfu is that any time a fault occurs, everything stalls and a reconfiguration is performed before reads/writes can proceed on the active extent. This is also a problem with chain replication based protocols in general.

Would there be some simple solutions to amend Corfu to address this?
For example, would it be possible to come up with a more clever, single node crush tolerant mapping? Ceph had a clever hierarchical hashing called Crush, maybe something along those lines.

As I have mentioned in the previous blog post, MAD questions, Cosmos DB has operationalized a fault-masking streamlined version of replication via nested replica-sets deployed in fan-out topology. Rather than doing offline updates from a log, Cosmos DB updates database at the replicas online, in place, to provide strong consistent and bounded-staleness consistency reads among other read levels. On the other hand, Cosmos DB also maintains a change log by way of a witness replica, which serves several useful purposes, including fault-tolerance, remote storage, and snapshots for analytic workload.