Monday, September 17, 2012

Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen
Proc. 23rd ACM Symposium on Operating Systems Principles
(SOSP ’11) Cascais, Portugal, October 2011.

Geo-replicated, distributed data stores are all the rage now. These stores provide wide area network (WAN) replication of all data across geographic regions in all the datacenters. Geo-replication is very useful for global-scale web-applications, especially social networking applications. In social networking, your updates to your account (tweets, posts, etc) are replicated across several regions because you may have followers there, and they should get low-latency access to your updates. The paper refers to properties desired of such a geo-replication service with the acronym ALPS (Availability, low Latency, Partition-tolerance, and high Scalability).

Of course, in addition to ALPS, we need to add some consistency requirement to the geo-replication service, because otherwise the updates to a record can arrive at replicas in different/arbitrary orders and expose inconsistent views of the updates to the record. The eventual consistency property that Dynamo advocates is too weak for the programmer to build on to provide other services. The inconsistent views of the updates can escape and poison those other services. On the other hand, strong-consistency (linearizability) is impossible due to CAP. So, causal consistency is what the COPS paper shoots for. Actually, a later paper shows that this is the best you can get under these constraints.

Yahoo!’s PNUTS provides per-record timeline consistency (or per-key sequential consistency), which is a pretty good consistency guarantee. All the writes to a record are serialized by the per-record-primary server responsible for that record and these updates are replicated in the same order (the order the per-record primary determines) to the other replicas. PNUTS, however, does not provide any consistency across records/keys. Achieving such a consistency (more specifically, causal consistency) across records/keys introduces scalability challenges and is the aim of this paper, as we explain in the rest of this summary. (Spoiler: To achieve scalability, instead of using vector clocks, COPS explicitly encodes dependencies in metadata associated with each key's version. When keys are replicated remotely, the receiving datacenter performs dependency checks before committing the incoming version.)

Causal consistency model

Dave Anderson explains the causal-consistency model nicely in his blog post.
Consider three updates by a single thread: write A=1, write B=dog, write B=cow, write A=2. 
In Dynamo, the order at which these updates are applied is unconstrained. In a
remote datacenter, read(A), read(A) might return A=2, A=1 
In PNUTS, the value of “A” could never go backwards: A=1, A=1, … A=2, A=2, … 
But in both Dynamo and PNUTS, the relative values of A and B are unconstrained, and each of these systems could read(A), read(B): {2, dog}. Even though the originating thread set B=cow before it set A=2. 
In COPS, this latter read sequence could not happen. The allowable reads are: {1, }, {1,dog}, {1,cow}, {2, cow} 
In all three systems (in PNUTS, using Read-any), concurrent updates at different datacenters could cause conflicts that invoke conflict handlers, and in this way the three systems are not different.
And this brings us to the issue of conflict handling in COPS.

Convergent conflict handling

In causal consistency you can still have conflicts for concurrent events. (Inherent in a distributed system, two nodes may start updates concurrently, and this leads to a "logical" conflict as both updates are uninformed of each other.) For resolving these and to achieve converge at all datacenters, COPS employs convergent conflict handling, which refers to using an associative and commutative handler. Thomas's last-write-wins rule satisfies this constraint and is used by default, or developer can write application-specific conflict-resolution rules provided that they are convergent.

More concretely, convergent conflict handling requires that all conflicting puts be handled in the same manner at all replicas, using a handler function h. This handler function h must be associative and commutative, so that replicas can handle conflicting writes in the order they receive them and that the results of these handlings will converge (e.g., one replica’s h(a, h(b, c)) and another’s h(c, h(b, a)) agree).

The paper coins the term "causal+" to refer to causal-consistency plus convergence conflict handling together.

Scalability problems in causal+ model

Previous causal+ work (Bayou, Practi) was not scalable, because they achieved causal+ via log-based replay. Log-exchange-based serialization inhibits replica scalability, as it relies on a single serialization point in each replica to establish ordering. Thus, causal dependencies between keys are limited to the set of keys that can be stored on one node, or alternatively, a single node must provide a commit ordering and log for all operations across a cluster.

Inner-workings of COPS

A get operation for a record is local at the closest datacenter and is non-blocking. Since all data is replicated at each datacenter, we can have local get.

A put operation for a record is
1) translated to put-after-dependencies based on the dependencies seen in this site
2) queued for asynchronous replication to other sites/replicas
3) returns done to the client at this point (early reply)
4) asynchronous replication to other sites/datacenters occur.

Each operation maintains dependencies for operations. Replication dependencies are checked at each datacenter, and when they are satisfied the value is updated there.

Nodes in each datacenter are responsible for different partitions of the keyspace, but the system can track and enforce dependencies between keys stored on different nodes. COPS explicitly encodes dependencies in metadata associated with each key’s version. When keys are replicated remotely, the receiving datacenter performs dependency checks before committing the incoming version. The paper shows that by only checking the nearest dependencies you can reduce the number of dependency checks during replication and have a faster solution.

Subleties in definition of availability for put operation

The availability of get and put operations are defined at the datacenter scope and not at the whole system scope. The paper defines availability as: "All operations issued to the data store complete successfully. No operation can block indefinitely or return an error signifying that data is unavailable."

What this means is that the availability of put is satisfied at step 3 when early reply is returned to the client of the datacenter. However, the replication of this put operation to other datacenters in fact can be blocked until put-after-dependencies for the operation is satisfied at other datacenters. Consider a put operation originated at site C, which introduced put-after-dependencies to updates originated at site B. When this put operation (now a put-after operation) is asynchronously replicated to site A at step 4, this put-after will block at site A until site A receives the dependent-upon updates originated at site B. And if site B is partitioned away from site A, the put-after operation originated at C will block at A. Although site C and A are connected the put operation will not be able to complete step 4 as long as B is disconnected from A. I think this is a subtle point, which many readers may not notice even after reading the paper a couple of times.

COPS-GT (Get Transactions) operation

Reading a set of dependent keys using a single-key get interface cannot ensure causal+ consistency, even though the data store itself is causal+ consistent. To illustrate the problem the paper gives a photo album example, where Alice first changes her album access control list (ACL) to "friends only", and then writes a new description of her travels and adds more photos to the album. A natural (but incorrect!) implementation of code to read Alice's album might (1) fetch the ACL, (2) check permissions, and (3) fetch the album description and photos. This approach contains a straightforward "time-to check-to-time-to-use" race condition: when Eve accesses the album, her get(ACL) might return the old ACL, which permitted anyone (including Eve) to read it, but her get(album contents) might return the "friends only" version.

To address this issue, a two-round COPS-GT (Get Transactions) algorithm is introduced in the paper. "In the first round, the library issues concurrent get by version operations to the local cluster for each key the client listed in get transaction. These values, however, may not be consistent with one another, as in the example above. Each get by version operation returns a <value, version, deps> tuple, where deps is a list of keys and versions. The client library then examines every dependency entry <key, version>. The causal dependencies for that result are satisfied if either the client did not request the dependent key, or if it did, the version it retrieved was greater than or equal to the version in the dependency list. For all keys that are not satisfied, the library issues a second round of concurrent get by version operations. The version requested will be the newest version seen in any dependency list from the first round. These versions satisfy all causal dependencies from the first round."

The paper notes that COPS-GT comes with additional cost: compared to the regular version of COPS, COPS-GT is less efficient for certain workloads (e.g., write-heavy) and is less robust to network partitions and datacenter failures.

Concluding remarks

The paper is a long and dense read. While the model is formalized well, I wish some issues got better coverage in the paper. One problem I had with the paper is that just when I think the paper is going to deal with a hard problem, it backs out and starts talking about other peripheral issues. For example, in Section 3.4, the problems with log exchange-based serialization is not elaborated sufficiently for me; the paper skipped to discussing system design too quickly. In Section 4.1, when I thought the paper will be explaining composability of linearizability, the paper skipped to presenting the system components. (Here some papers by Herlihy are referenced saying that "linearizable systems can be implemented scalably by partitioning the keyspace into N linearizable partitions and having clients access each partition independently". But details are missing.)


So how big can these dependencies get? It turns out, the dependencies are tracked by the client library based on the client's history. It seems like the dependency list can grow very long as the client reads many different keys. As a result, one can end up with a pretty long list of dependencies: to every other key read by the client.

Instead, to keep dependencies sweet and short, we can use the Hybrid Logical Clocks (HLC). The loosely synchronized real-time component of HLC would help in truncating the dependency list. The logical clock component of HLC would help in maintaining this dependency list precisely even in the uncertainty region of loosely synchronized clocks.

Finally, the COPS client library solution also did not account for whether the client has backchannels. Using real-time windows can help account for the backchannels as well.

Related links

High Scalability post on COPS and followup
COPS project page
Q&A for COPS
Dave's post clarifying some issues

Friday, September 14, 2012

Scalable distributed data structures for internet service construction

I think this 2000 paper (by Gribble, Brewer, Hellerstein, and Culler) may as well be the original NoSQL paper. The paper starts off by identifying the problems with RDBMS that prohibit scalability. (This is how you would motivate a NoSQL key-value store system even today :-)
  1. RDBMSs have not been designed with Internet service workloads, service properties, and cluster environments in mind, and as a result they fail to provide the right scaling, consistency, or availability guarantees. 
  2. RDBMSs permit users to decouple the logical structure of their data from its physical layout, which is a good thing, but excessive data independence (isolation of application from modifying the layout of data definition and organization) can make parallelization and therefore scaling hard. 
  3. RDBMSs always choose consistency over availability.
The paper then advocates a design sweet-point for achieving scalable and consistent data management for web services. RDBMS is out because it provides a too-high-a-level abstraction with ACID and SQL. Filesystems, on the other hand, expose a too low-level interface with little data independence and less strictly defined consistency guarantees where filesystem elements (files and directories) are directly exposed to the clients and the clients are responsible for logically structuring their application data to using these elements. The paper aims to choose a level of abstraction that provides a well-defined and simple consistency model somewhere in between that of an RDBMS and a filesystem. As a solution, the paper proposes a distributed data structure (DDS) ---in this case a distributed hash table--- and argues that DDS interfaces, while not as general as SQL, are rich enough to successfully build sophisticated services. DDS is touted to achieve the desired web service properties: scalability, fault-tolerance, availability, consistency, durability, concurrency.

DDS and key-value stores 

\epsfbox[37 323 709 584]{arch_overview.eps} }
Figure: High-level view of a DDS: a DDS is a self-managing, cluster-based data repository. All service instances (S) in the cluster see the same consistent image of the DDS; as a result, any WAN client (C) can communicate with any service instance.

DDS is basically a key-value store as we understand it today. As the paper puts it, DDS provides a persistent data management layer designed to simplify cluster-based Internet service construction. A distributed hash-table underlies this data management layer, and it simplifies Internet service construction by decoupling service-specific logic from the complexities of persistent, consistent state management. This allows services to inherit the necessary service properties from the DDS rather than having to implement them themselves. DDS presents a conventional single-host data structure interface to service authors, but in fact it partitions and replicates the data across a cluster. DDS is a cluster-level data structure and is not designed for a WAN.

The novel aspects of a DDS are the level of abstraction it presents to service authors (by providing a data structure at the programming language level), the consistency model it supports, the access behavior (concurrency and throughput demands) that it presupposes, and its design and implementation choices that are made based on its expected runtime environment and the types of failures that it should withstand. SEDA is employed for implementing DDS to achieve high throughput and high concurrency.

DDS architecture

\epsfbox[19 148 519 587]{dds_arch.eps} }
Figure: Distributed hash table architecture: each box in the diagram represents a software process. 

Services using DDS may keep soft-state but they rely on the hash table to manage all persistent state. DDS library contains only soft-state, including metadata about the cluster's current configuration and the partitioning of data in the distributed hash tables across the bricks. The DDS library acts as the 2-phase commit coordinator for update operations on the distributed hash tables. (Dynamo forgoes this consistency step, and avoids the complications discussed next.) The paper explains recovery mechanisms for what happens when coordinator fails during this 2-phase commit. However, this unavoidably leads to many corner cases and complicated to manage and may lead to recovery-induced inconsistencies. The 2-phase commit would also slow down write operations and limit scalability.
\epsfbox[47 290 594 580]{metadata.eps} }

Figure: Distributed hash table metadata maps: The key is used to traverse the DP map trie and retrieve the name of the key's replica group. The replica group name is then used looked up in the RG map to find the group's current membership.

The DDS key-lookup uses a trie-based mapping that can deal nicely with overloaded and hot keys. (For this Dynamo employs a ring-based consistent hashing.) To find the partition that manages a particular hash table key, and to determine the list of replicas in partitions' replica groups, the DDS libraries consults two metadata maps that are replicated on each node of the cluster. First is DP map maintained as trie. And the second map is replica group membership table. These two maps are soft-state and self-cleaning. Instead of enforcing consistency synchronously, the libraries can drift out of date, but lazily updated when they are used to perform operations on the bricks.


I think this paper is very nice introduction to the NoSQL key-value store area, in that you can see the original issues and original design decisions that led to the NoSQL key-value store approach in this paper. The DDS approach of providing a simple data-structure abstraction to the service authors and enabling them to inherit scalability, consistency, fault-tolerance, availability properties from the underlying careful distributed implementation of the data structure ultimately gave us BigTable, MegaStore, and similar distributed data structures.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...