Log structured protocols in Delos (SOSP21)

- October 29, 2021

This paper picks up from where the "Virtual Consensus in Delos" paper finishes, and talks about using Delos to build control plane databases at Facebook. These are my notes from Mahesh's excellent conference presentation.

These control plane databases at Facebook were required to support multiple APIs: some SQL, some key-value pairs, and some ZooKeeper namespaces. Each such API typically requires a separate database to be built. But implementing and operating even a single zero dependency control plane system is difficult.

To cut this dilemma, they observe that these databases have a similar structure: a consensus protocol at the bottom, and a replicated state machine on top. A lot of that state machine uses generic logic that can be reused across different APIs. This implies that there is an hourglass architecture at work here, where Delos platform is the narrow waist. This paper focuses on protocols which allow us to layer multiple apis on this common Delos platform.

This idea extends on the "Tango: Distributed Data Structures over a Shared Log" idea of writing to a shared log and replaying it to reconstruct/materialize state at the machines. This base idea is extended with a layered state-machine approach to make the common platform more flexible independently expandable. The paper then reports results from Facebook scale implementation of this idea.

Why did they need a layered-state machine approach? In 2020, at Facebook they started building Zelos, a Zookeeper clone control plane database that builds on Delos. They noticed that Zelos needed extra guarantees, features, performance not yet supported by the Delos platform. The platform became a bottleneck not in terms of scaling or throughput but in terms of developer productivity.

To solve this problem they came up with the idea of breaking the platform state machine into lots of fine-grained state machines, and layered these in a protocol state, much like network packet layering.

When the command is played back the bottom most engine called the base engine creates a local transactional context and applies the command to the engine above it and so on. Once the top level finishes execution, it sends a return value back down the stack to the log entry. And then if the node playing the command happens to be the one that proposed the command we also relay the return value back to application.

This allowed them to build several control plane databases independently, not bottlenecking developer productivity. Teams can build these layers on their own, and other teams can reuse these layers easily.

Delos Table= base engine + view tracking (membership tracking)

LogBackupEngine= base engine + view tracking + log backup (to cold storage)

Zelos= base engine + view tracking + log backup + session ordered

Zelos2= base engine + view tracking + log backup + batching+ session ordered

The cloud/datacenter environment allows many opportunities to extend on the basic RSM idea. (There is plenty of room at the bottom.) A takeaway for me here is that log ingestion and replaying is not that costly at the cloud/datacenter environment. This then affords us to use this simple principled approach to solve more pressing bottleneck problems, like developer productivity, control plane reliability, etc.

To my commentary above, Mahesh responded with this excellent point:

There’s also some broader point about “the log is the network”; systems are easier to understand if nodes use a durable ordered bus rather than RPCing each other willy-nilly.

I agree fully. Our job should be to slash complexity, not to add to it.