Virtual Consensus in Delos

This paper has been awarded a best paper award at OSDI20, and is authored by Mahesh Balakrishnan, Jason Flinn, Chen Shen, Mihir Dharamshi, Ahmed Jafri, Xiao Shi, Santosh Ghosh, Hazem Hassan, Aaryaman Sagar, Rhed Shi, Jingming Liu, Filip Gruszczynski, Xianan Zhang, Huy Hoang, Ahmed Yossef, Francois Richard, and Yee Jiun Song at Facebook, Inc.

Delos is the database that stores control plane state for Facebook, building over a shared log system. Each Delos server stores a local copy of a full database, which keeps materialized state of the log, as a Delos table. This DB/ is maintained over the shared log as RSM by just reading commands from the log in sequence and applying them for materializing on the table.

Virtual consensus

This high level setup is familiar, so let's dive deeper to get to the novelty in this work. Delos uses virtual consensus, which is a fancy way of saying the following. The DB is maintained over a virtual log, which consists of loglets, each of which is implemented as loglets, using ZooKeeper at first. 

The virtual log stores the mapping from the address space to the underlying loglets using a metastore. The metastore is a versioned key value store, maintained in a fault-tolerant and consistent way using a classic Multi-Paxos implementation. 

To change this mapping, we seal the current loglet, and write a new mapping to the metastore pointing to the new loglet, and then we route the new writes to the new loglet. This new loglet could be entirely different log implementation. And that is exactly what the authors did as part of this project. Instead of using ZKloglets, they switched to a native loglet implementation without disturbing the upper virtual log. 

This is achievable without stopping deployment thanks to the virtualized log design. In this design, the only source of consensus is the metastore, which stores the mapping to the loglets. Since the virtual log handles all reconfiguration,  we avoid the need for reconfiguration and leader election within the loglet.

The native loglet implementation is also simplified by this virtual log design. Since we isolate reconfiguration into the highly available metastore, we can afford to keep the loglets less available, because they are dispensible and substitutable. The loglet does not need to implement fault-tolerant consensus, instead it only implements fault-tolerant seal. When a loglet becomes unavailable the virtual log can use this fault-tolerant seal API to seal it, and switch to a new loglet. Ok, let's zoom in again. 

Native loglet

One of the servers maintaining the loglet is pre-appointed as the sequencer (aka leader). To append to the loglet, the update is sent to the sequencer, which appends it to the loglet and waits for majority response for completion. Any server can find the tail of the log by reading from the quorum of logservers. After a check tail, the client can read committed entries directly from its local log server, and then can apply the changes that are under the tail/watermark.

Since the sequencer is not involved, the checktail is fault-tolerant. Upon suspected failure of the sequencer any client can seal the log by talking to the quorum. 

That the sequencer is pre-appointed and constant/fixed is the trick to the efficiency of the native loglet implementation. We don't care about availability/fault-tolerance of the loglets, they are dispensable and substitutable.

Ok, let's talk about sealing the loglets. Unfortunately this is not very straightforward. Sealing the loglet needs to take care of filling holes in the loglet. If not, this can cause a progress issue for the virtual log above. Another problem is that sealing the loglet is not really an atomic operation, that can be done in one shot. It is best to think of sealing as closing the valve, there can still be drips, and we need to capture them using the checkTail operation at the last stage of sealing. If we see a higher progress after valve is closed, we need to do a "read repair" because that higher progress point might have been chosen in the other replica and committed unbeknownst to us, so consistency should not be jeopardized.

In theory, yes once we change to the new loglet, we can't let client store things successfully on the old loglet.  So, any updates sent after a seal finishes will not be successful but they might still be chosen and even placed on the virtual log due to the seal closing not being an instant operation. 

So in practice sealing leads to some unavailability time. When you read from the virtual log, you must read from the valid loglet defined in metastore. That means you have to wait for the sealing to conclude first. During this time,  writes may be staged to the new loglet, but they are not acknowledged or become available for reading until sealing is finished. 

Michael gave a great explanation of the sealing protocol in our zoom reading group. Please check his presentation to learn more. 

Looking on the bright side, this virtual log design allows us to implement new features quickly. Consider a a  striped loglet implementation. When the client appends to the log, we like the update go round robin to the underlying loglets.

Loglets stay alive after they have been sealed, so they can continue to serve reads (not writes though!). Old loglets can also be trimmed to be garbage collected.


The graph above shows how native loglet going live has reduced the latency of puts and gets, because the native loglet implementation is lean and is not encumbered like ZooKeeper implementation with the need for consensus.

In the converged setup, each server runs the db and loglet. This allows fast local log reads as well as  fate-sharing. On the other hand, in a disaggregated deployment, It is possible to scale each tier separately, and also achieve  less I/O contention. The paper mentions that when they disaggregate the logs, they can scale the IO bottleneck via sharding/striping and reach 1 Million appends per sec by using 30 stripes on 90 machines. 

The paper also mentions that Delos has been used in production for more than 2 years now, serving  1.8 Billion transactions per day.

I like this paper. It tries to make consensus boring, which is how it should be. We are still not there though, so our work is not done. 


Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)