Paper summary. Scalable Consistency in Scatter

- May 24, 2019

Here is the pdf for the paper. It is by Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy, and Thomas Anderson, Department of Computer Science & Engineering University of Washington.

This paper is about peer-to-peer (P2P) systems. But the paper is from 2011, way after the P2P hype had died. This makes the paper more interesting, because it had the opportunity to consider things in hindsight. The P2P corpse was cold, and Dynamo had looted the distributed hash tables (DHT) idea from P2P and applied it in the context of datacenter computing. In return, this work liberates the Paxos coordination idea from the datacenter world and employs it in the P2P world. It replaces each node (or virtual node) in a P2P overlay ring with a Paxos group that consists of a number of nodes.

Ok, what problem do Paxos groups solve in the P2P systems? In the presence of high churn, DHTs in P2P systems suffer from inconsistent routing state and inconsistent name space partitioning issues (see Figure 1). By leveraging the Paxos group abstraction as a stable base to build these coordination operations (split, merge, migrate, repartition), Scatter achieves linearizable consistency even under adverse circumstances.

Group coordination

Scatter supports the following multi-group operations:

split: partition the state of an existing group into two groups
merge: create a new group from the union of the state of two neighboring groups
migrate: move members from one group to a different group
repartition: change the key-space partitioning between two adjacent groups

Each multi-group operation in Scatter is structured as a distributed transaction. The paper calls this design pattern as nested consensus, and says: "We believe that this general idea of structuring protocols as communication between replicated participants, rather than between individual nodes, can be applied more generally to the construction of scalable, consistent distributed systems."

Nested consensus uses a two-tiered approach. At the top tier, groups execute a two-phase commit protocol (2PC), while within each group Paxos is used for agreeing on the actions that the group takes. Provided that a majority of nodes in each group remain alive and connected, the 2PC protocol will be non-blocking and terminate. (This is the same argument Spanner uses as it employs 2PC over Paxos groups.) For individual links in the overlay to remain highly available, Scatter maintains an additional invariant: a group can always reach its adjacent groups. To maintain this connectivity, Scatter enforces that every adjacent group of a group A has up-to-date knowledge of the membership of A.

Multi-group operations are coordinated by whichever group decides to initiate the transaction as a result of some local policy. The group initiating a transaction is called the coordinator group and the other groups involved are called the participant groups. This is the overall structure of nested consensus:

The coordinator group replicates the decision to initiate the transaction.
The coordinator group broadcasts a transaction prepare message to the nodes of the participant groups.
Upon receiving the prepare message, a participant group decides whether or not to commit the proposed transaction and replicates its vote.
A participant group broadcasts a commit or abort message to the nodes of the coordinator group.
When the votes of all participant groups is known, the coordinator group replicates whether or not the transaction was committed.
The coordinator group broadcasts the outcome of the transaction to all participant groups.
Participant groups replicate the transaction outcome.
When a group learns that a transaction has been committed then it executes the steps of the proposed transaction, the particulars of which depend on the multi-group operation.

Figure 5 shows an example of this template for group-split operation. After each group has learned and replicated the outcome (committed) of the split operation at time t3, the following updates are executed by the respective group: (1) G1 updates its successor pointer to G2a, (2) G3 updates its predecessor pointer to G2b, and (3) G2 executes a replicated state machine reconfiguration to instantiate the two new groups which partition between them G2's original key-range and set of member nodes.

The storage service (discussed next) continues to process client requests during the execution of group transactions except for a brief period of unavailability for any reconfiguration required by a committed transaction. Also, groups continue to serve lookup requests during transactions provided that the lookups are serialized with respect to the transaction commit.

Storage service

To improve throughput for put and get operations on keys, Scatter divides the key range assigned to the Paxos group into sub-ranges and assigns these sub-ranges to nodes within the Paxos group. Each key is only assigned to one primary and is serialized by that primary. The group leader replicates information regarding the assignment of keys to primaries using Paxos, as it does with the state for multi-group operations. Once an operation is routed to the correct group for a given key, then any node in the group will forward the operation to the appropriate primary. The primaries can run Paxos on the keys assigned to themselves concurrently with each other because this does not result in a conflict: it is OK to have different keys updated at the same time, since linearizability is a per key property.

Scatter provides linearizable storage within a given key and does not attempt to linearize multi-key application transactions. A read is served by a primary within the Paxos group which is responsible for that key. The primary uses leader lease with the rest of the nodes. It is possible to provide weaker consistency reads, as is default in ZooKeeper, by reading from one node in the group.

Figure 7 plots the probability of group failure for different group sizes for two node churn rates with node lifetimes drawn from heavy-tailed Pareto distributions observed in typical peer-to-peer systems. The plot indicates that a modest group size of 8-12 prevents group failure with high probability. The prototype implementation in the paper demonstrates that even with these very short node lifetimes, it is possible to build a scalable and consistent system with practical performance. This was surprising to me.

Evaluation

They evaluate Scatter in a variety of configurations, for both micro-benchmarks and for a Twitter-style application. Compared to OpenDHT, Scatter provides equivalent performance with much better availability, consistency (i.e. linearizability), and adaptability even in very challenging environments. For example, if average node lifetimes are as short as 180 seconds, therefore triggering very frequent reconfigurations to maintain data durability, Scatter is able to maintain overall consistency and data availability, serving its reads in an average of 1.3 seconds in a typical wide area setting.

This is good performance, but to put things in context of datacenter computing, the evaluation is done with "small data". When you have many gigabytes (if not terabytes) of data assigned to each node, just to copy that data at line speed may take more time than the churn rate of the the nodes in a P2P environment.

The paper also compares Scatter against statically partitioned ZooKeeper groups. Here, the key-space partitioning was derived based on historical workload characteristics, but the inability to adapt to dynamic hotspots in the access pattern limits the scalability of the ZooKeeper-based groups deployment. Further, the variability in the throughput also increases with the number of ZooKeeper instances used in the experiment.

In contrast, Scatter's throughput scales linearly with the number of nodes, with only a small amount of variability due to uneven group sizes and temporary load skews. This is because Scatter uses ring and group operations to adapt to change in access patterns. Based on the load balancing policy in Scatter, the groups repartition their keyspaces proportionally to their respective loads whenever a group's load is a factor of 1.6 or above that of its neighboring group. As this check is performed locally between adjacent groups, it does not require global load monitoring, but it might require multiple iterations of the load-balancing operation to disperse hotspots.

Hat tip for @DharmaShukla for recommending the paper to me. The paper has inspired some design decisions in Cosmos DB.

MAD questions

1. What could be some alternative designs to solve this problem?
Instead of arranging the Paxos groups in a ring, why not have a vertical-Paxos group overseeing the Paxos groups? The vPaxos box would be assigning key ranges to Paxos groups, coordinating the group operations (split, merge, load-balance) and maintaining the configuration information of the Paxos groups. This would allow adapting to changes in workload and reconfiguring in reaction to node availability in a much faster manner than that of the P2P ring, where load-balancing is done by adjacent groups dispersing load to each other in multiple iterations.

Another problem with Scatter is that it lacks WAN locality optimization. A client may need to go across the globe to contact a Paxos group responsible for keys that it interacts with the most. WPaxos can learn and adopt to these patterns. So, while we are at it, why not replace the vanilla Paxos in the Paxos group with WPaxos to achieve client access locality adaptation in an orthogonal way. Then the final set up becomes VPaxos over-seeing groups of WPaxos deployments.

2. Would it ever be possible to replace datacenters with P2P technologies?
The paper in the introduction seems fairly optimistic: "Our interest is in building a storage layer for a very large scale P2P system we are designing for hosting planetary scale social networking applications. Purchasing, installing, powering up, and maintaining a very large scale set of nodes across many geographically distributed data centers is an expensive proposition; it is only feasible on an ongoing basis for those applications that can generate revenue. In much the same way that Linux offers a free alternative to commercial operating systems for researchers and developers interested in tinkering, we ask: what is the Linux analogue with respect to cloud computing?"

I am not very optimistic...

3. Why don't we invest in better visualizations/figures for writing papers?
This paper had beautiful figures for explaining concepts. Check Figure 4 below, it shows two groups considering different operations concurrently, visualized with thought bubbles. These figures go a long way. It is a shame we don't invest any effort in standardizing and teaching good illustration techniques to support exposition. It is even discouraged to use colors because they look faded/blended when printed in black and white. For God's sake, it is 2019, and we should level up our illustration game.