Taming Consensus in the Wild (with the Shared Log Abstraction)
This paper recently appeared at ACM SIGOPS Operating Systems Review. It provides an overview of the shared log abstraction in distributed systems, particularly focusing on its application in State Machine Replication (SMR) and consensus protocols. The paper argues that this abstraction can simplify the design and implementation of distributed systems, and can make them more reliable and easier to maintain.
What is the shared log abstraction? The shared log abstraction proposes to separate the system into two layers: the database layer (proposers/learners) and the log layer (acceptors). The shared log provides a simple API for appending entries, checking the tail, reading entries, and trimming the log. This separation allows the SMR layer to focus on higher-level concerns without dealing with the complexities of consensus protocols.
This is a wisdom packed paper. It approaches the problem more from software engineering and systems/operations perspectives. (Previous work, the Delos OSDI'20 paper, talked about the technical/protocol parts.) It touches things at high level and expects you to be aware of the distributed consensus literature. Here is how we will crack this nut. We will talk about the benefits of the approach (mainly through layering and disaggregation), the usage patterns for the approach, and the optimizations available. We will conclude by discussing the disadvantages/tradeoffs of the approach.
Benefits: Disaggregation
The paper breaks the benefits of shared log abstraction under scalability, latency, reliability headings. But, maybe aligning things with storage disaggregation features/goals could be easier and more productive.
While the paper doesn't explicitly use the term "storage-compute disaggregation," the shared log abstraction highlights many of the same principles and benefits. It provides a practical example of how disaggregation can be implemented in the context of distributed consensus systems.
- Separation of Concerns: The shared log abstraction decouples the system into two layers: the database layer (compute) and the log layer (storage).
- Independent Scaling: Since the SMR layer and the shared log can reside on entirely different sets of machines, this allows for independent scaling of storage and compute resources, which is a key benefit of disaggregation.
- Flexibility in Deployment: The shared log abstraction allows for flexible deployment models, including running a small number of database nodes against a larger set of acceptors.
- Failure Isolation: The paper highlights how separating learner and acceptor roles can improve failure isolation. In disaggregated systems, failures in compute nodes don't directly impact storage, and vice versa.
- Performance Tuning: The paper discusses various optimizations that can be applied independently to the compute layer (like parallelizing playback) and the storage layer (like scaling acceptors). This independent tuning is a key advantage of disaggregation.
If you squint at it, you may also say the MemoryDB work we recently reviewed is an instance of shared log approach.
Usage patterns
Who stores what in the shared log?
Multiple writers at the database level is ok. Consider the adding counter example. If the application is a simple counter, then the log entries can literally consist of “increment by 10” or “decrement by 5” written by different nodes. Then the shared log consists of unexecuted inputs – effectively, arbitrary pieces of code – that are executed deterministically by each database replica on playback. This is called input capturing.
On the other hand, output capturing makes sense especially for big operations, such as "scan the db and sum entries". In this case, it is wasteful to have each database replica execute the full scan, so we would like one node to write the output of the operation, rather the operation command.
But, then we would need the writer to do a conditional write as it is no longer safe to directly apply each entry from the shared log to local state. The proposing database replica generated the log entry based on some locally materialized snapshot of database state (corresponding to prefix [0, X) of the log); but by the time it appends the entry, the log might have grown, in which case the new entry lands at some position Y, such that there are multiple intervening entries generated by other proposers between X and Y.
What if conditonal writes run into a liveness issue due to frequent conflict?One approach to cope with this is to use a single leader approach as mention in Section 3.2.D of the paper. This leader is elected at the database layer as the single writer. But this is a limiting option. Another option could be to fence writes for a duration in the log level by taking a lock until a future slot. In each case, you can avoid liveness problems by no-oping. What is no-oping? This is the Mencius trick. Either the slot contains something proposed by the dedicated leader or nooped by recovery accepted to get the progress going.
Transparent, all-cards-on-the-table view
To me the biggest advantage of the shared log abstraction is the transparent, all-cards-on-the-table perspective it provides.
Unlike traditional setups where the primary holds the secret property of the committed index, and we need to jump through loops to perform linearizable reads for the secondaries, this approach enables linearizable reads for any node at the database level because the shared global log ensures that every node in the upper level sees the same committed index. Mahesh doesn't say this explicitly in this paper and earlier papers, but in my opinion this global consistency is a key advantage of the shared log approach.
This linearizable reads feature enable serving the same state as materialized with different data structures optimized for different access patterns. The log-structured protocols SOSP'21 paper mentioned these as motivating examples.
Finally, another benefit of this all-cards-on-the-table approach is that the log becomes the network. In Mahesh's words from an earlier conversation, systems become easier to understand if nodes use a durable ordered bus rather than RPCing each other willy-nilly.
Usual SMR optimizations apply
As I mentioned in the introduction, the paper just gives an overview of when any optimization in the literature is applicable, but doesn't explain these. It mentions techniques for scaling both the acceptor and learner roles, such as striping the log over different nodes, using off-path sequencers, or sharding learners. Other techniques include batch-committing of entries, parallelizing playback for non-conflicting transactions, and implementing a "bus-stand" (i.e., batch request) optimization to reduce the number of checkTail operations.
At the database layer, we can choose a node as a designated proposer, so that all writes and reads are directed to this replica. The paper says: "The election of this designated proposer can happen above the log itself, by appending a special election command with the semantics that players must reject all subsequent entries not generated by the specified proposer (until the next election command). If we also assign a timeout to the election command’s validity, we can enable low-latency strongly consistent reads at the designated proposer. Note that the timeout can be in real-time or in logical time defined by log positions; in the latter case, rather than wait out time, we can burn log positions via dummy entries to evict the old designated proposer."
Disadvantages / Tradeoffs
Ok, this sounds awesome. But, what are the tradeoffs? What are the disadvantages of shared log?
With the designated leader solution, the only difference is the extra hop from leader to log service and back. I think, that is a good tradeoff to take considering it enables you to provide linearizable reads at the secondaries.
Compared to a sharded Paxos groups approach, shared log approach may have more challenges for blast-radius containment. In the shared log approach, you can also have shards of acceptors to improve scalability and availability. The paper also says that the state of the database can be partitioned into shards (e.g., blue, red, and green); and a blue replica at the database layer would only store the blue shard and materialize only the blue entries from the shared log. I think this should help, but it would be nice to have a quantitative comparison of both approaches in terms of fault-tolerance and fault-containment.
Comments