Fine-Grained Replicated State Machines for a Cluster Storage System

This paper appeared in NSDI 2020 and was authored by Ming Liu and Arvind Krishnamurthy, University of Washington; Harsha V. Madhyastha, University of Michigan; Rishi Bhardwaj, Karan Gupta, Chinmay Kamat, Huapeng Yuan, Aditya Jaltade, Roger Liao, Pavan Konka, and Anoop Jawahar, Nutanix.

The paper presents the design and implementation of a consistent and fault-tolerant metadata index for a scalable block storage system via distributed key-value abstraction. The key idea is to use fine-grained replicated state machines (fRSM), where every key-value pair in the index is treated as a separate RSM to reduce tail-latency in key-value access and provide robustness to key access skews.


The problem arised from Nutanix's business in building private clouds for enterprises to enable them to instantiate VMs that run legacy applications. A cluster management software determines which node to run each VM on, migrating them as necessary. And Stargate provides a virtual disk abstraction to these VMs translating virtual accesses to physical disk accesses. This work focuses on how Stargate stores the metadata index that maps virtual disk blocks to physical locations across the cluster.

To minimize the probability of data loss, any update to the metadata must be committed to stable storage on multiple nodes before Stargate acknowledges the write to the client. In the baseline/traditional design Paxos would be employed for this. More specifically, the traditional design shards the metadata index across multiple nodes, and then uses Paxos for ordering operations on any given shards. The operations are then executed on a durable data structure such as a log-structured merge tree. For example, Spanner and CockroachDB designs are similar to this.

But there are drawbacks to the baseline design. The use of a per-shard consensus operation log introduces inefficiencies, more specifically,  head-of-line (HOL) blocking which arises due to bursty operations on the same shard. Even when the operations are for different keys, the latter operations would have to wait for the earlier to complete, as requests have to commit and execute in sequence as specified in the log order for the shard. This can increase latency and cause unpredictable tail-latency. The paper reports factor of 4x between loads on shards, which turns HOL blocking into a significant problem.

I think this is application dependent, and by choosing smaller shards (rather than abolishing shards all together as fRSM suggests) the problem can be remedied. For example, CockroachDB uses small shards of 64Mb, and it looks like that granularity works good for them. In the discussion section at the end, I will revisit this shard-size discussion.

Side-quest: going beyond linearizable operations 

As a side-quest, the paper sets up this problem and it later demonstrates how this is resolved by fRSM read operation. Under linearizability, even when a read issued after a failure does not reflect a write issued before the failure, this does not mean that the write failed. It may be that the update could have been arbitrarily delayed and might get applied later, causing subsequent reads to observe the updated value. This is the ghost-writes problem in Cassandra, Mongo, and is annoying to deal with. The paper argues that they would need to provide stronger guarantees to client VMs so that they can reason about operation failures.

They require that any subsequent read of the metadata after an operation timeout must confirm whether the prior operation succeeded or not. As a result, successive reads of a piece of metadata should return the same value as long as there are no concurrent updates initiated by other agents in the system.

Paxos can satisfy this if the leader keeps track of client submission order and uses read-in-log execution. But the problem is you have to wait commit, before you acknowledge. If you do speculative acks of dirty writes, you are prone to this problem again. Below we will see how fRSM read operations resolve this problem, by fate-sealing writes before themselves.

Fine-grained replicated state machine (fRSMs)

Using fRSMs, each key-value pair is represented as a separate RSM and can operate independently. fRSM uses no operation logs and maintains only a small amount of consensus state along with the perceived value of a key.

Associated with each key is a clock attribute storing
  • epoch number to represent the generation for the key 
  • timestamp within an epoch is advanced whenever the key's value is updated. The epoch number and the timestamp together represent a Paxos instance number 
  • promised proposal number and accepted proposal number associated with the key's value
  • chosen bit as the commit flag in Paxos

CAS-Update operation

CAS updates are built using the clock the client obtained via the key read. With each read, a client also receives the current epoch (e) and timestamp (t) for the value. The client specifies the new value for timestamp t+1 having read the value previously at timestamp t. The request is routed to the leader of the replica group responsible for the key.

Here are the phases of the CAS-update operation:
  • Retrieve key's consensus state: The leader reads its local state for key k and retrieves the key's local clock: pp for promised proposal number, pa for accepted proposal number
  • Prepare request If pp is for a prepare issued by a different node, then the leader generates a higher proposal number, sends prepare messages to other nodes. (The leader skips this step if pp and pa are the same.)
  • Prepare handler: Same as in Paxos phase1b. The replicas durably store the prepare proposal number as part of the key’s clock attribute
  • Accept request  The key's value and the corresponding clock are recorded in the commit log and Memtable at each node
  • Accept response processing
    • If a quorum of successful accept responses is received at the leader, the leader considers the operation to be completed, sets chosen bit, and acks the client.
    • If the request is rejected because the (epoch, timestamp) tuple at a replica is greater than the client-supplied epoch and timestamp, then a CAS error is sent to the client. Further, accept messages are initiated to commit the newly learned value and timestamps at a quorum.
This is very much the classical Paxos protocol.

The video presentation of fRSM is misleading here, as it uses the term "Fast path" in discussing the update operation. In Paxos literature, fast path means reaching a supermajority to get a commit quickly, as in Fast Paxos or EPaxos. However, in fRSM the authors are using fast path to refer to the stable leader (aka multiPaxos) optimization of skipping the prepare phase and just doing Accept phase.

The big difference from the classical Paxos operation is that fRSM is logless, the nodes do not maintain per-key or per-shard operation logs. This required some subtle changes to the protocol. The nodes skip over missed operations and directly determine and apply the accepted value with the highest associated proposal number (with a possibly much higher timestamp). They still send a CAS error, but also adopt the value. Say a replica got update t, missed t+1, and received t+2. Since it didn't see t+1, it will send a CAS error, but will also accept t+2, and use it, so it is caught up for t+3. If the leader does not get a quorum of accepts due to CAS-errors, the leader will not be able to commit the value.

The processing logic speculatively updates the LSM tree and relies on subsequent operations to fix speculation errors. The leader checks its chosen bit, but the replicas/followers don't wait for a phase-3 commit. They just go with what they accepted. The read operation does some fate-sealing if necessary to take care of finalization in the presence of logless operation.

Read operation 

What do I mean by fate-sealing? The read needs to satisfy two properties:
  • the value returned should be the most recent chosen value for a given key
  • other accepted values with higher <epoch, timestamp> than the returned value are not chosen
This second property is required to ensure that any other CAS operations that are in-progress but aren't visible will not be committed in the future; this is similar to a view change in the Viewstamped Replication protocol, and achieved by increasing the epoch number as discussed below as part of mutating quorum reads. This way operations that are not deemed complete at the end of a view are prevented from committing in a subsequent view.

I still think a much cooler name for this would be "a fate-sealing read for Shroedinger's database". This read operation takes care of the ghost-writes problem we discussed in the side-quest as part of background section.

The reads are processed in one of three modes
  1. leader-only reads
  2. quorum reads
  3. mutating quorum reads (read-repair mode)
When the operation is routed to the leader, the leader checks whether it is operating in theleader-only mode. If the check is successful, then the leader will serve the request from its Memtable or one of the SSTables. I am assuming some leader-lease is also involved here. The system employs ZooKeeper as a hint for key to leader mapping -- which is mentioned in one passing sentence in Section 3.1.

If the leader is not operating in the leader-only mode, then it has to poll the replica set for a quorum read and identify the most recent accepted value for the key. If this value is not available on a quorum of nodes, the leader has to propagate the value to a quorum of nodes (i.e., performing a mutating quorum read as in read-repair operation in Cassandra and ABD protocol).

Further, if there is an unreachable replica that might have a more recent accepted value (i.e., the promise is made to a node that did not respond with a clock attribute), then the mutating quorum read performs an additional quorum-wide update to just the timestamp to prevent such a value from being chosen (i.e., fate sealing against ghost-writes).
  • Let pp be the promise associated with an unreachable node, and let v, e, and t be the value, epoch, and timestamp associated with the highest accepted proposal
  • The leader issues prepare commands to the replica nodes to obtain a promise greater than pp, and then sends accept commands to the replica nodes to update their value, epoch, and timestamp fields to v, e, and t+1, respectively. The higher timestamp value prevents older CAS operations from succeeding.


The paper argues that fRSM allows for flexible and dynamic scheduling of operations on the metadata service and enables effective use of the storage and compute resources, and uses the evaluation to show benefits of fine grain RSM approach over coarse-shard grained RSMs.

The evaluations show that compared with coarse-grained RSMs, fRSMs achieve 5.6× and 2.3× higher throughput for skewed and uniform scenarios in controlled testbeds. The paper argues this is because fRSM (1) allows requests accessing different keys to be reordered and committed as soon as they complete; (2) eliminates the computation cost associated with scanning the RSM log to identify and retry uncommitted entries; (3) avoids unnecessary head-of-line blocking caused by other requests; (4) achieves better load balance across SSDs and cores even in skewed workloads.


Here is the YouTube video of the discussion of the paper from our Zoom DistSys Reading Group. I include some of our discussion in the below section. Here is a link to the Google Docs for recording questions for the discussion.

Next week, we will discuss "Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log" paper from NSDI 2020. To engage in the paper discussion and get the Zoom Link for the meeting, join our Slack channel.


1. Why not use smaller shards? Why go to per-key RSM?

Smaller shards (in the limit one-key per shard as in fRSM) helps for more hardware utilization, but it also has more overhead. The paper does not talk about shard size selection much, but I suspect choice is application dependent. The paper has this to say about smaller shards, but it should have elaborated more on this topic.

Sub-dividing the shards into even smaller shards would mitigate the load imbalance issue. However, it suffers from three drawbacks. First, it doesn’t address the request head-of-line blocking issue. Requests still have to commit and execute in sequence as specified in the log order. Second, it further reduces batching efficiency for storage devices. Third, it doesn't provide the benefit of fast node recovery, as a recovering node cannot immediately participate in the protocol.

Our previous work WPaxos also provides per-key RSM, and in an efficient manner. Moreover, it allows efficient and safe key-stealing so that the system can self-optimize and adapt to access patterns and provide local yet strongly consistent updates even in WAN settings.

2. Why does fRSM not support blind writes? 

The paper says this on that: "We do not support blind writes, i.e., operations that merely update a key’s value without providing the current value. Since all of our operations are CAS-like, we can provide at-most-once execution semantics without requiring any explicit per-client state as in RIFL."

It would be nicer, if this point is elaborated on. It is still not very clear. Hermes also had a logless design, but it allowed blind writes as well.

3. What does Mad Danny say about the paper?

I asked Mad Danny about his views on the paper. This is what he had to say.

This protocol is part of a system that has been running for 8 years in deployment. fRSM is a weird cross between CASPaxos and MultiPaxos. It is probable they may not even had Paxos in mind, when they first deployed this solution. Or maybe they started with MultiPaxos and evolved toward a logless CASPaxos like solution. The protocol is so intertwined with the application, maybe hyperoptimized to the application.

This is what you get for writing a paper after so many years. The protocol takes on a life of its own, and became huge with many details. It becomes hard for the authors to figure out how to abstract away and prioritize the ideas, having worked on implementation and devops of the protocols for many years. It becomes hard to abstract out and tell a simple story. As a result, the paper is not well written, and looks like a collection of half-finished thoughts in many places.

The nice thing about fRSM is it has no logs, which gives it an agility advantage. It uses dirty writes (followers take accept as the last word speculatively), and offloads responsibility to reads for fate sealing of the writes. Solving the ghost-write problem is a big boon. Having similar thing for Cassandra and MongoDB could be of humongous help.

For utmost reliability, I would use a very well understood, battle-tested Paxos implementation, and deploy it as a configuration service fleet like AWS did in Pysalia. This way you can shard at your heart's will as needed using cells, reconfigure them, move them. Nice things, about cells is you can still do transactions within the cell using Paxos, and this gives extra power for programmability of the service. And that solution would be more general, not  tied to an application, and can be lifted and applied for other applications.


AWerner said…
Thanks for the writeup, no comment about it specifically, just some updates on Cockroach that seem relevant.

> CockroachDB uses small shards of 64Mb, and it looks like that granularity works good for them

CockroachDB is moving to 512Mb shards starting in the next release [1]. The primary motivation for this change is to reduce the overhead of maintaining each shard independently when there is a large amount of relatively cold date.

Another, earlier prerequisite for this change was optimizing the proposal [2] and command application [3] pipelines to turn the problem of increased load on a shard to an opportunity for increased batching. One important aspect of Cockroach's design is the opportunity for multi-core parallelism even within a shard by using fine-grained latching. With these changes, the amount of synchronization between non-overlapping commands is largely reduced to a single atomic increment per proposal and. When commands are evaluated (pre-sequencing) they need to acquire a shared mutex, that mutex only now needs to be acquired exclusively around 1-2 times per batch of commands.

Nevertheless, you are correct that single-range (Cockroach's shard terminology) can lead to bottlenecks. We've added some functionality to help alleviate certain application access patterns like sequential write workloads which were especially susceptible given Cockroach's range-based sharding [4]. These workloads were just as problematic with small shards.


Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

My Distributed Systems Seminar's reading list for Fall 2020

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

I have seen things

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters

PigPaxos: Devouring the communication bottlenecks in distributed consensus