### Millions of tiny databases

This paper is by Marc Brooker, Tao Chen, and Fan Ping from Amazon Web Services. The paper appeared at USENIX NSDI 2020 at the end of February, which was held  on-site at Santa Clara. Right after that, all conferences got canceled due to the COVID-19 outbreak. Let's hope things stabilize for NSDI 2021.

## What is this paper about?

This paper is about improving the availability of Amazon Elastic Block Storage (EBS).

EBS allows users to create block devices on demand and attach them to their AWS EC2 instances. EBS is maintained using chain replication, one of my favorite distributed algorithms. Chain replication takes consensus off the data path so it does not constitute a bottleneck for throughput. Data is replicated at one server after the other in the chain, without needing a leader ---when there is a leader, there is an incast bottleneck problem. Consensus is only needed when a fault occurs (or is presumed to occur) and the chain needs to be reconfigured by means of the configuration box, which implements fault-tolerant consensus via Paxos.

This paper is about that configuration box, Physalia, which oversees the chain replication systems. The specific problem considered in this paper is this: *How do you design and operate Physalia so that the availability of the EBS is maximized?*

In other words this paper is about the second order effects of the EBS replication system. But these second order effects still become very important at the AWS scale. If you have millions of nodes in EBS that need configuration boxes, you cannot rely on a single configuration box. Secondly, yes, the configuration box should not see much traffic normally, but when it does see traffic, it is bursty traffic because things went wrong. And if the configuration box layer also caves in, things will get much much worse. The paper gives an account of "21 April 2011 cascading failure and loss of availability" as an example of this.

Rather than describing new consensus protocols, in the spirit of Paxos Made Live, this paper describes the details, choices and tradeoffs that are required to put the Physalia consensus system into production.  The higher order message from the paper is that "infrastructure aware placement and careful system design can significantly reduce the effect of network partitions, infrastructure failures, and even software bugs".

## Physalia architecture

It is especially important for Physalia to be available during partitions, because that is when the chain replication will require a configuration change. Physalia offers both consistency and high availability, even in the presence of network partitions, as well as minimized blast radius of failures.

Yeah, yeah, CAP impossibility result and all that. But CAP forbids the consistency-availability combination only at the very margins. There are many ways to circumvent CAP, and Physalia's idea is to not require all keys to be available to all clients. Each key needs to be available at only three points in the network: the AWS EC2 instance that is the client of the volume, the primary copy, and the replica copy. (I had made a similar point in September at this blog post.)

Each EBS volume is assigned a unique partition key at creation time, and all operations for that volume occur within that partition key. Within each partition key, Physalia offers a transactional store with a typed key-value schema, supporting strict serializable reads, writes and conditional writes over any combination of keys.

To realize this idea for reducing the blast radius for the configuration box implementation, Physalia divides a colony into a large number of cells. Each node is only used by a small subset of cells, and each cell is only used by a small subset of clients. This is why the paper is titled "millions of tiny databases". *The Physalia configuration store for chain replication of EBS is implemented as key-value stores maintained over a large number of these cells.*

In the EBS installation of Physalia, the cell performs Paxos over seven nodes. Seven was chosen to balance several concerns: durability, tail latency, availability, resource usage.

When a new cell is created, Physalia uses its knowledge of the power and network topology of the datacenter to choose a set of nodes for the cell. The choice of nodes balances two competing priorities. Nodes should be placed close to the clients to ensure that failures far away from their clients do not cause the cell to fail. They must also be placed with sufficient diversity to ensure that small-scale failures do not cause the cell to fail. Physalia tries to ensure that each node contains a different mix of cells, which reduces the probability of correlated failure due to load or poison pill transitions.

Physalia---the reconfiguration box for chain replication in EBS--- also reconfigures its cells. For this Physalia uses the Paxos reconfiguration approach presented in Lampson's 1996 paper. (I think there is a need for more research on reconfiguration in Paxos systems to make progress on realizing more adaptive and dynamic Paxos deployments.) A significant factor in the complexity of reconfiguration is the interaction with pipelining: configuration changes accepted at log position $i$ must not take effect logically until position $i + \alpha$, where $\alpha$ is the maximum allowed pipeline length.

Physalia employs reconfiguration frequently to move cells closer to their clients. It does this by replacing far-away nodes with close nodes using reconfiguration. The small data sizes in Physalia make cell reconfiguration an insignificant portion of overall datacenter traffic. Figure 7 illustrates this process of movement by iterative reconfiguration, which complete quickly typically within a minute.

When nodes join or re-join a cell, they are brought up to speed by teaching, implemented in three modes outside the core consensus protocol.
"In the bulk mode, most suitable for new nodes, the teacher (any existing node in the cell) transfers a bulk snapshot of its state machine to the learner. In the log-based mode, most suitable for nodes re-joining after a partition or pause, the teacher ships a segment of its log to the learner. We have found that this mode is triggered rather frequently in production, due to nodes temporarily falling behind during Java garbage collection pauses. Log-based learning is chosen when the size of the missing log segment is significantly smaller than the size of the entire dataset."
This is funny. In classes, I always give the Java garbage collection example for how  synchrony assumptions may be violated.

## Testing

The authors used three different methods for testing.

1. They built a test harness, called SimWorld, which abstracts networking, performance, and other systems concepts. The goal of this approach is to allow developers to write distributed systems tests, including tests that simulate packet loss, server failures, corruption, and other failure cases, as unit tests in the same language as the system itself. In this case, these unit tests run inside the developer’s IDE (or with junit at build time), with no need for test clusters or other infrastructure. A typical test which tests correctness under packet loss can be implemented in less than 10 lines of Java code, and executes in less than 100ms.
2. As another approach they used a suite of automatically-generated tests which run the Paxos implementation through every combination of packet loss and reordering that a node can experience. This testing approach was inspired by the TLC model checker (the model checker for TLA+), and helped them build confidence that our implementation matched the formal specification. They also used the open source Jepsen tool to test the system, and make sure that the API responses are linearizable under network failure cases. This testing, which happens at the infrastructure level, was a good complement to the lower-level tests as it could exercise some under-load cases that are hard to run in the SimWorld.
3. The team used TLA+ in three ways: writing specifications of the protocols to check that they understand them deeply, model checking specifications against correctness and liveness properties using the TLC model checker, and writing extensively commented TLA+ code to serve as the documentation of the distributed protocols. While all three of these uses added value, TLA+’s role as a sort of automatically tested (via TLC), and extremely precise, format for protocol documentation was perhaps the most useful. The code reviews, SimWorld tests, and design meetings frequently referred back to the TLA+ models of our protocols to resolve ambiguities in Java code or written communication.
Yay, for TLA+. Here are some motivation and examples of TLA+ use from my blog.

## Evaluation

The paper provides these graphs from production for evaluating the performance Physalia.

1. I am more of a protocols/algorithms guy. This paper investigates realization and application of protocols in production rather that introducing new protocols. But it was still a good read for me, and I enjoyed it. I think another very good work relevant to this is Facebook's Delos.

2. This is from the beginning of Section 2: The design of Physalia. If I was just given this paragraph, I could easily tell this paper is coming from industry rather than academia. Academia cares about novelty and intellectual merits. It is hard to find concerns for "easy and cheap to operate", "easy to use correctly" as part of priorities of academic work.
Physalia’s goals of blast radius reduction and partition tolerance  required careful attention in the design of the data model, replication mechanism, cluster management and even operational and deployment procedures. In addition to these top-level design goals, we wanted Physalia to be easy and cheap to operate, contributing negligibly to the cost of our dataplane. We wanted its data model to be flexible enough to meet future uses in similar problem spaces, and to be easy to use correctly. This goal was inspired by the concept of misuse resistance from cryptography (GCM-SIV, for example), which aims to make primitives that are safer under misuse. Finally, we wanted Physalia to be highly scalable, able to support an entire EBS availability zone in a single installation.
3. The paper provides the following discussion about why they implemented Physalia via independent cells, rather than cells coupling in a peer-to-peer manner like Scatter. Although they don't elaborate much on this, I agree on this point. I think a Scatter-like approach may still be [made] tolerant to partitions, but I very much agree on the complexity point.
We could have avoided implementing a separate control-plane and repair workflow for Physalia, by following the example of elastic replication or Scatter. We evaluated these approaches, but decided that the additional complexity, and additional communication and dependencies between shards, were at odds with our focus on blast radius. We chose to keep our cells completely independent, and implement the control plane as a separate system.
4. An opportunity here is that the cells are distributed to nodes, and it is possible to balance the load on each node by controlling how many leader/proposer versus followers are placed on that node. I think Physalia might already be doing this. To relieve the stress on the leader/proposer of the cell, our work on linearizable Paxos quorum reads may be applicable.