Thursday, August 9, 2018

The many faces of consistency

This is a lovely paper (2016) from Marcos Aguilera and Doug Terry. It is an easy read. It manages to be technical and enlightening without involving analysis, derivation, or formulas.

The paper talks about consistency. In distributed/networked systems (which includes pretty much any practical computer system today), data sharing and replication brings up a fundamental question: What should happen if a client modifies some data items and simultaneously, or within a short time, another client reads or modifies the same items, possibly at a different replica?

The answer depends on the context/application. Sometimes eventual consistency is what is desired (e.g., DNS), and sometimes you need strong consistency (e.g., reservations, accounting).

The paper points out two different types of consistency, state consistency and operation consistency, and focuses mainly on comparing/contrasting these two approaches.

When I see the terms state versus operation consistency, without reading the rest of the paper, my gut reaction is to conclude that this is a matter of abstraction. If you are implementing the system, you work at the state consistency level. If you are exposing this to the clients, since they do not need to know the internals of the system and would only care about the operations they invoke, then you work at the operational consistency level. The state consistency can help support and provide operational consistency to the clients.

My gut reaction turns out to be pretty accurate, yet the paper goes into more details and elaborates on some subtleties. I am glad to read about the two approaches in more detail. I think, going forward, it would be important to relate/bridge/reconcile these two approaches.

1. State consistency

State consistency pertains to the state of the system (which comprises of the current values of the data items). These are properties that the system should satisfy despite concurrent access and the existence of multiple replicas.

A major subcategory of state consistency is one defined by an invariant ---a predicate on the state that must evaluate to true. For example, in a concurrent program, a singly linked list must not contain cycles. As another example, in a primary-backup system mutual consistency invariant requires that replicas have the same state when there are no outstanding updates.

It is possible to weaken/loosen invariants to include error bounds, probabilistic guarantees, and eventual satisfaction.

2. Operation consistency

State consistency is limited to the properties on state, but in many cases clients care little about the system state and more about the results that they obtain from the system. This calls for a different form of consistency, operation consistency, which pertains to the behavior that clients observe from interacting with the system.

Operation consistency has subcategories based on the different ways to define the consistency property.

2.1 Sequential equivalence

This subcategory defines the permitted operation results of a concurrent execution in terms of the permitted operation results in a sequential execution. Some examples are as follows.

Linearizability is a strong form of consistency. Each operation must appear to occur at an instantaneous point between its start time (when the client submits it) and finish time (when the client receives the response), and execution at these instantaneous points form a valid sequential execution. More precisely, there must exist a legal total order T of all operations with their results, such that (1) T is consistent with the partial order <, meaning that if op1 finishes before op2 starts then op1 appears before op2 in T, and (2) T defines a correct sequential execution.

Sequential consistency is weaker than linearizability. It requires that operations execute as if they were totally ordered in a way that respects the order in which *each client* issues operations. More precisely, the partial order < is this time defined as: op1 < op2 iff both operations are executed by *the same client* and op1 finishes before op2 starts. In other terms, while linearizability imposes across-clients system-level ordering, sequential consistency is content with imposing a client-centric ordering.

The next examples pertain to systems that support transactions. Intuitively, a transaction is a bundle of one or more operations that must be executed as a whole. The system provides an isolation property, which ensures that transactions do not significantly interfere with one another. There are many isolation properties: serializability, strong session serializability, order-preserving serializability, snapshot isolation, read committed, repeatable reads, etc. All of these are forms of operation consistency, and several of them (e.g., serializability, strong session serializability, and order-preserving serializability aka strict-serializability) are of the sequential equivalence subcategory.

2.2 Reference equivalence

Reference equivalence is a generalization of sequential equivalence. Examples of this occur often in database systems.

Snapshot isolation requires that transactions behave identically to the following reference implementation. When a transaction starts, it gets assigned a monotonic start timestamp. When the transaction reads data, it reads from a snapshot of the system as of the start timestamp. When a transaction T1 wishes to commit, the system obtains a monotonic commit timestamp and verifies whether there is some other transaction T2 such that (1) T2 updates some item that T1 also updates, and (2) T2 has committed with a commit timestamp between T1’s start and commit timestamp. If so, then T1 is aborted; otherwise, T1 is committed and all its updates are applied instantaneously as of the time of T1’s commit timestamp.

Another example is one-copy serializability where the replicated system must behave like a single node (no replicas!) and provides serializability.

2.3 Read-write centric

The read-write centric subcategory applies to systems with two very specific operations: read and write. Consistency is defined with respect to the set of writes that could have potentially affected the read.

Read-my-writes requires that a read by a client sees at least all writes previously executed by the same client, in the order in which they were executed. This is relevant when clients expect to observe their own writes, but can tolerate delays before observing the writes of others.

Bounded staleness bounds the time (or number of updates) it takes for writes to be seen by reads: A read must see at least all writes that complete d time (or number of updates) before the read started.

3. State versus operation consistency

Operation consistency is, well..., operational! I was inculcated as part of my PhD training in distributed systems to avoid operational reasoning as it fails to work for concurrent execution. This is also what I teach my students within the first week of distributed systems class, as well. Operational reasoning does not scale since there are too many corner cases to check. For reasoning about distributed systems, we use invariant-based reasoning, which lends itself better to state consistency.

However, the truth is multidimensional. While unsuitable for reasoning/verification, operational consistency has its advantages.

On which type of consistency to use, the paper suggests the following:
"First, think about the negation of consistency: what are the inconsistencies that must be avoided? If the answer is most easily described by an undesirable state (e.g., two replicas diverge), then use state consistency. If the answer is most easily described by an incorrect result to an operation (e.g., a read returns stale data), then use operation consistency. 
A second important consideration is application dependency. Many operation consistency and some state consistency properties are application independent (e.g., serializability, linearizability, mutual consistency, eventual consistency). We recommend trying to use such properties, before defining an application-specific one, because the mechanisms to enforce them are well understood. If the system requires an application specific property, and state and operation consistency are both natural choices, then we recommend using state consistency due to its simplicity."

Notice that while operation consistency section listed more than a dozen consistency levels, the state consistency section just named a couple invariant types, including a vaguely named mutual consistency invariant. This is because the state-consistency is implementation specific, a whitebox approach. It is more suitable for the distributed systems designers/builders rather than users/clients. By  restricting the domain to specific operations (such as read and write), operation consistency is able to treat the system as a blackbox and provides a reusable abstraction to the users/clients.

4. What is really behind the curtain?

Here is another advantageous usecase for operation consistency approach. It provides you an abstraction (i.e., a curtain, a veil) that you can leverage in your implementation. Behind this curtain, you can pull tricks. The paper gives this simple example.
"An interesting example is a storage system with three servers replicated using majority quorums, where (1) to write data, the system attaches a monotonic timestamp and stores the data at two (a majority of) servers, and (2) to read, the system fetches the data from two servers; if the servers return the same data, the system returns the data to the client; otherwise, the system picks the data with the highest timestamp, stores that data and its timestamp in another server (to ensure that two servers have the data), and returns the data to the client. This system violates mutual consistency, because when there are no outstanding operations, one of the servers deviates from the other two. However, this inconsistency is not observable in the results returned by reads, since a read filters out the inconsistent server by querying a majority. In fact, this storage system satisfies linearizability, one of the strongest forms of operation consistency."

This is sort of a lazy evaluation idea. You don't always need to expend work/energy to keep the database consistent, you tidy up the database only when it is queried, only when it needs to perform.

The "TAPIR: Building Consistent Transactions with Inconsistent Replication (SOSP'15)" paper also builds on this premise. Irene puts it this way:
"Today, systems that want to provide strong guarantees use Paxos (or, if you are hipper than me, RAFT), and everyone else uses something cheaper. Paxos enforces a strict serial ordering of operations across replicas, which is useful, but requires coordination across replicas on every operation, which is expensive. 
What we found in the TAPIR project is that Paxos is too strong for some strong system guarantees and, as a result, is wasting work and performance for those systems. For example, a lock server wants mutual exclusion, but Paxos provides a strict serial ordering of lock operations. This means that a lock server built using Paxos for replication is coordinating across replicas even when it is not necessary to ensure mutual exclusion. 
Even more interesting, a transactional storage system wants strictly serializable transactions, which requires a linearizable ordering of transactions but only requires a partial ordering of operations (because not all transactions touch all keys). With some careful design in TAPIR, we are able to enforce a linearizable ordering of transactions with no ordering of operations."

5. What role do these concepts play in Cosmos DB?

Cosmos DB provides 5 well defined operation consistency properties to the clients: strong, bounded, session, consistent prefix, and eventual consistency. These consistency models were chosen because they are practical and useful as signaled by their actual usage by the customers for their production workloads.

And, of course, in order not to leave performance on the table, instead of imposing strong state-based consistency, Cosmos DB global distribution protocols employ several optimizations behind the curtain while still managing to provide the requested operation consistency levels to the clients.

Thus, state-consistency still plays a major role in Cosmos DB, from the core layer developers/designers perspective. The core layer backend team uses TLA+ to identify and check weakest state-consistency invariants that  support and imply the desired operational consistency levels. These weak invariants are efficient and do not require costly state synchronization.

As I mentioned in my previous post, I will post about TLA+/PlusCal translation of consistency levels provided by Cosmos DB here. I also hope to talk about some of the state-consistency invariants and efficiency techniques employed when I start describing the global distribution at the Cosmos DB core layer in my upcoming blog posts.

MAD questions

1. Is it possible to bridge the two approaches?

These two approaches can be complementary like two sides of the same coin.

I think it is easier to go from state-consistency of the system to inferring and designing operation consistency properties.

How about the other way? While checking operation consistency requires analyzing an execution log, by restricting the domain to specific operations (such as read and write), it is possible to achieve application independence and treat the system as a blackbox. The Jepsen library achieves that for distributed systems testing. If the system is a blackbox, or was developed without invariant/state consistency, this operation consistency based testing approach can help in identifying problems. But it is still unclear, how to infer/design state consistency properties/invariants that can help in fixing the problem.

The development of better tools for observability/auditability approaches (such as Retroscope) can help in bridging this gap.

Another effort to help bridge the gap could be to identify/name and create an ontology of invariants used in state consistency approaches. I don't know much work in that direction except this one from VLDB15.

2. Do these two approaches converge at the eventual consistency position?
The paper states the following. "Operational eventual consistency is a variant of eventual consistency (a form of state consistency) defined using operation consistency. The requirement is that each write be eventually seen by all reads, and if clients stop executing writes then eventually every read returns the same latest value."

Eventual consistency is likely a natural convergence point for the state and operation consistency. This reminds of the "Conflict-free replicated data types" paper.

3. Are terms/definitions about consistency consistent yet?
Consistency is an important concept so it emerged and developed in different domains (distributed systems, databases, and computer architecture) simultaneously. And of course different domains used different terminology and confusion arose.

I am not surprised. Even in the distributed systems community, on the restricted topic of distributed consensus, researchers have been using inconsistent terminology for many decades before consensus (pun intended) arose and the terminology converged and standardized.

Consistency definitions can be overwhelming because there are many parameters involved, even without considering transactions.

Kyle Kingsbury recently provided a simplified clickable map of major consistency models (including transactional consistency models).

No comments: