Wednesday, March 30, 2016

Paper review: Building Consistent Transactions with Inconsistent Replication (SOSP'15)

This paper investigates an interesting and promising research problem: how can we implement distributed transactions more efficiently by performing a cross-layer design of the transaction protocal/layer with the underlying distributed storage protocol/layer? This idea is motivated by the observation that there is wasted/duplicated work done both at the transaction layer and at the storage layer. The transaction layer already provides ordering (i.e., linearizability), and, if you look at the distributed storage layer, it also provides strongly consistent replication which orders/linearizes updates often with Paxos. The paper argues that this is overlapping functionality, and leads to higher latency and lower throughput.

Ok, at this point, we have to take a step back and realize that we cannot avoid all coordination between the two layers. The distributed storage layer needs to be consistent with the order of updates at the transaction layer, because it serves reads and you like it to serve the most recent version, not an old version. OK, to fix this, what if we used a multiversion store and stage data for replication without coordination and consistency, and on top of this we commit transactions with 2PC? But who decides the version numbers? If it is the transaction layer, where does it learn the old version number: from the storage layer. And, this cyclic dependency will create a problem when the storage/replication layer is inconsistent and as a result inconsistent version number maybe learned and proposed back. A way to fix this is to use timestamps for version numbers. But that would require precise clock synchronization so that the timestamps also match the transaction ordering. I guess this would be possible to do with atomic clocks, or maybe with PTP.

Actually, it turned out the paper also resorted to a multiversion store, but did not need very precise clock synchronization, and went with NTP synchronization. In fact, they could have gone without any clock synchronization. Clock synchronization is only used for improving the performance, not for satisfying safety. The paper  lays out a principled cross-layer codesign of a linearizable transaction protocol on top of the  unordered/inconsistent replication. And the design is complicated as we will see below. This shows that if we had very precise and dependable clock synchronization to rely on, we can simplify many problems in distributed systems, including this one.

Of course a co-design, or a cross-layer design, has an associated drawback: now these two layers are tangled, and you cannot swap storage layer protocols and transaction layer protocols and expect any combination to work without problems. When using a consistent replication layer (such as Paxos replication), you don't have to worry about swapping transaction layer protocols; use OCC, or 2PC, etc.  But, when using an inconsistent replication (IR) layer, a special purpose transaction protocol that works on top of IR is needed. Any OCC or 2PC protocol will not work because building over IR imposes restrictions on the transaction protocol. The paper proposes TAPIR (Transactional Application Protocol for Inconsistent Replication) as  the transaction layer protocol to accompany IR.

The results are nice as they show significant improvement. They demonstrate TAPIR in a transactional key-value store, TAPIR-KV, which supports linearizable transactions over a partitioned set of keys. The experiments found that TAPIR-KV had: (1) 50% lower commit latency and (2) more than 3x better throughput compared to systems using conventional transaction protocols, including an implementation of Spanner’s transaction protocol, and (3) comparable performance to MongoDB and Redis, widely-used eventual consistency systems.

What is not nice is the complexity this consistency relaxation at the replication layer imposes on the overall system design. IR is complex, and TAPIR involves even more complex ideas/techniques. It looks like this is inevitable complexity to compensate for the inconsistent/unordered replication at the storage layer.

 Inconsistent Replication (IR) protocol for relaxing the consistency of replication

In the IR protocol, instead of ordering of operations, replicas agree on the operation results. What does this mean? The unordered operation set provides the following 3 guarantees.
P1. [Fault tolerance] At any time, every operation in the operation set is in the record of at least one replica in any quorum of f+1 non-failed replicas.
P2. [Visibility] For any two operations in the operation set, at least one is visible to the other.
P3. [Consensus results] At any time, the result returned by a successful consensus operation is in the record of at least one replica in any quorum. The only exception is if the consensus result has been explicitly modified by the application (i.e., transaction) layer protocol through Merge, after which the outcome of Merge will be recorded instead.

What is this Merge business, you ask.
"Some replicas may miss operations or need to reconcile their state if the consensus result chosen by the application (i.e., transaction) protocol does not match their result. To ensure that IR replicas eventually converge, they periodically synchronize. Similar to eventual consistency, IR relies on the application (i.e., transaction) protocol to reconcile inconsistent replicas. On synchronization, a single IR node first upcalls into the application protocol with Merge, which takes records from inconsistent replicas and merges them into a master record of successful operations and consensus results. Then, IR upcalls into the application (i.e., transaction) protocol with Sync at each replica. Sync takes the master record and reconciles application (i.e., transaction) protocol state to make the replica consistent with the chosen consensus results."

Requirements from IR clients

IR imposes some hard requirements on the transaction layer. These limit the transaction layer and complicate its design.

1. Invariant checks must be performed pairwise.
This is the most restrictive requirement. "IR's limitation is that it can only support system guarantees that depend on conflicts between pairs of operations. For example, IR can be used to replicate a lock server but not a sales app that only sells thing in stock. The lock server's mutual exclusion guarantee is a pair-wise invariant, but calculating the store's stock requires a full history of operations that only strong consistency can achieve."

2. Application  (i.e., transaction) protocols must be able to change consensus operation results.

3. Application  (i.e., transaction) protocols should not expect operations to execute in the same order.

I don't provide explanation for the last two, because they are complicated. I am not sure I understand the explanation provided in the paper. This boils down to the following: the transaction protocol needs to sort out the mess of inconsistent ordering at the replication/storage layer, by essentially resorting to retry (with reordering) or abort (due to conflict) transactions and getting it right in the next round. This retry with reordering resembles the Fast Paxos approach. Due to inconsistent ordering at the replicas, this is the price to pay.


TAPIR is designed to work on top of IR's weak guarantees to provide linearizable (strict-serializability) distributed transactions: (1) TAPIR does not have any leaders or centralized coordination, (2) TAPIR Reads always go to the closest replica, and (3) TAPIR Commit takes a single round-trip to the participants in the common case.

TAPIR uses clients as 2PC coordinators. A client Begins a transaction, then executes Reads and Writes during the transaction's execution period. During this period, the client can Abort the transaction. Once it finishes execution, the client Commits the transaction, after which it can no longer abort the transaction. The 2PC protocol will run to completion, committing or aborting the transaction based entirely on the decision of the participants.

TAPIR uses optimistic transaction ordering via NTP timestamps and OCC in order to concentrate all ordering decisions into a single set of validation checks at the proposed transaction timestamp. Under load, its abort rate can be as bad as OCC. The evaluation section shows a graph comparing TAPIR and OCC abort rates.

TAPIR protocol is complicated. To the credit of the authors, they provide a TLA+ model and model checking of TAPIR to support its correctness.

Finally, let's consider this question: what happens when the TAPIR client fails? Since client is the coordinator of the 2PC transaction, this is similar to coordinator failure in 2PC, and leads to blocking and a complicated recovery procedure which necessitates going to 3PC. In other words, TAPIR doesn't solve the 2PC blocking problem of transactions. A recovery procedure at the client level is needed if the client/coordinator blocks or dies. (RIFL, as we discuss below, considers almost the dual problem of IR/TAPIR and results in a cleaner recovery for the client level.)

RIFL comparison

RIFL also appeared on SOSP 15 as the TAPIR paper. RIFL takes a different approach to the problem. RIFL advocates building transactions over a linearizability layer at the replication, the advantage of which is cleaner recovery at the transaction layer. RIFL eliminated the complex recovery procedures of Sinfonia transactions, when the initiator (transaction manager) dies.


The paper uses Retwis Twitter clone and YCSB benchmark for evaluation. Evaluation is done on Google App Engine. Note that peak throughput is lower than both Cassandra and Redis, but those are eventual consistency systems, while TAPIR-KV uses a distributed transaction protocol that ensures strict serializability. Unfortunately the consistency violations in the eventual consistency KVs are not provided. That would be interesting to see and compare. They may turn out to be not so bad, as PBS paper and Facebook consistency papers point out.

Related links

Link to the paper

SOSP conference presentation where Irene does a great job of presenting overview of the ideas/techniques involved

The code for TAPIR is available here 

Irene's blog post on lessons learned from TAPIR

Sunday, March 27, 2016

Paper review: Measuring and Understanding Consistency at Facebook

I have reviewed many Facebook papers in this blog before (see the links at the bottom for full list). Facebook papers are simple (in the good sense of the word) and interesting due to the huge scale of operation at Facebook. They are testaments of high scalability in practice. This one here is no different.

This paper investigates the consistency of the Facebook TAO system. The TAO system is a replicated storage system for Facebook's social graph of billion vertices. In theory, TAO provides only eventual consistency. The interesting result in this paper is this: In practice, Facebook's TAO system is highly consistent, with only 5 inconsistency violations out of a million read requests!

Facebook's TAO system

TAO is basically 2-level memcached architecture backed by a database, as first discussed in NSDI'13 paper. TAO architecture is given in Figure 1. As a prerequisite to TAO consistency discussion in the next section, it will be enough to review how TAO serves read and write requests below.

Read requests from web servers are served by the corresponding leaf cache in the local region (the cache-hit ratios are very high), and if that fails it progresses down the stack to be served by the root cache, and if that also fails, by the local database.

Write requests are complicated. The update needs to be reflected at the master database so the write is routed all the way to the master database and back, following the path 1, 2, 3, 4, 5, 6, 7, 8 in Figure 1. Each of those caches in that route applies the write when it forwards the database's acknowledgment back towards the client.

When the data is updated at the master database, the old data in other caches (those not in the route to the writing client) need to be updated as well. TAO does not go for the update, but goes for the lazy strategy: invalidation of those old data in the other caches. When a read comes from those other paths, the read will cause a miss, and later populate the caches with the updated value learned from the regional database. This lazy strategy also has the advantage of being simple and avoiding inadvertantly messing things (such as ordering of writes). TAO chooses to delete cached data instead of updating it in cache because deletes are idempotent. The database is the authoritative copy, and caches, are just, well, caches.

So, how does the invalidation of caches in other routes work. This proceeds in an asynchronous fashion. The root caches in the master (6') and originating regions (7') both asynchronously invalidate the other leaf caches in their region. The master database asynchronously replicates the write to the slave regions (5'). When a slave database in a region that did not originate the write receives it, the database asynchronously invalidates its root cache (6'') that in turn asynchronously invalidates all its leaf caches (7'').

This asynchronous invalidation of caches result in a "vulnerability time window" where inconsistent reads can happen. In theory, TAO provides per-object sequential consistency and read-after-write consistency within a cache, and only eventual consistency across caches.

Well, that is in theory, but what about in practice? How can one quantify the consistency of Facebook TAO?

Consistency analysis setup

To answer this question, the authors develop an offline checker-based consistency analysis. A random subset of the Facebook graph is chosen for monitoring. Random sampling is done for 1 vertex out of a million, so this is for ~1000 vertices. Every request for these ~1000 vertices are logged during their 12 day experiment (the trace contains over 2.7 billion requests), and later analyzed by the offline checker they developed for consistency violations.

Here are the consistency properties considered (going from stronger to weaker): linearizability, per-object sequential consistency, read-after write consistency, and eventual consistency. Linearizability means that there exists a total order over all operations in the system (where each operation appears to take effect instantaneously at some point between when the client invokes the operation and it receives the response), and that this order is consistent with the real-time order of operations. Per object-sequential consistency means that there exists a legal, total order over all requests to each object that is consistent with client’s orders. Read-After-Write means that when a write request has committed, all following read requests to that cache always reflect this write or later writes.

The offline checker converts dependencies between reads and writes into a dependency graph and checks for cycles in the dependency graph. Cycle means a linearizability anomaly. The same technique is also used for checking weaker local-consistency models, per-object consistency, and read-after write consistency. They check for total ordering versus real-time ordering and can process these weaker local-consistency models accordingly.

Consistency evaluation results 

Check Table 3 for the consistency evaluation results. It turns out the linearizability inconsistencies are very low: 0.0004%   And this gets even lower if you consider read-after-write within a single region/cluster: 0.00006%.

How come inconsistencies are so rare considering TAO provides only eventual consistency in theory? What gives? This stems from the following feature of Facebook TAO: "writes are rare". (Check Table 2.) In the traces you need both write and reads to an object to see inconsistency. And only 10%-25% of objects has both. Even then, a write is rare, and a read does not immediately follow write. Even when a read immediately follows write there is access locality, the read comes from the same region where the cache is already updated. These all contribute to keep inconsistency rate very low, at 4 in a million.

The paper also considers edge update consistency, which return similar results to vertex consistency. An interesting finding here is that 60% of all observed edge anomalies are related to ‘like’ edges.
The high update and request rate of “likes” explains their high contribution to the number of overall anomalies. The high update rate induces many vulnerability windows during which anomalies could occur and the high request rate increases the likelihood a read will happen during that window. The top 10 types (of edges) together account for ~95% of the anomalies. These most-anomalous edge types have implications for the design of systems with strong consistency.

Online practical consistency monitoring for Facebook TAO

Offline checker is expensive so it cannot be used as online. For online, Facebook uses a heuristic to monitor the consistency of their TAO system in production: phi consistency checking. This is is a snapshot query to check if all caches return the same result for a given object.  This is basically a measure of how closely synchronized the caches are within regions (phi(R)) and globally (phi(G)).

Phi consistency is an incomparable query with linearizability and other local-consistency queries, because it is an instant-query, and does not depend on history like the former. So what good is it? It is still good for catching errors. If for some reason (bad configuration, operator error) caches are not maintained properly (e.g., cache invalidations screw up, etc) these phi queries will catch that problem in real-time and warn.

So, what did we learn? 

Although TAO provides only eventual-consistency in theory, it turns out TAO is highly consistent in practice, with only 5 out of a million read-requests resulting in a linearizability violation.

How generalizable is this finding to other eventual-consistency systems? Of course we need to understand this finding with its limitations. This finding applies for a social network system updated with real-time human actions, so the data does not change very fast. And when it does, usually there is no immediate read request from other regions due to access locality, so we don't see much inconsistency. Another limitation to the study is that the way they sample ~1000 vertices out of a billion may not capture the power law model of the social graph. What about celebraties that have many followers, their posts are bound to see more traffic and prone to more inconsistency problems, no? I think the paper should have investigated the vertices with respect to their popularity tiers: high, medium, low.

Finally another limitation is this. This work considers local consistency model. Local consistency means whenever each individual object provides C, the entire system also provides C. This property is satisfied by linearizability, per-object sequential consistency, read-after write consistency, and eventual consistency, but is not satisfied by causally-consistent and transactional consistency models. The paper has this to say on the topic: "Transactional isolation is inherently a non-local property and so we cannot measure it accurately using only a sample of the full graph. This unfortunately means we cannot quantify the benefits of consistency models that include transactions, e.g., serializability and snapshot isolation, or the benefit of even read-only transactions on other consistency models. For instance, while our results for causal consistency bound the benefit of the COPS system, they do not bound the benefit of the COPS-GT system that also includes read-only transactions."

Related links

Conference presentation video of this paper

Probabilistically bounded staleness

Facebook's software architecture 

Facebook's Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services 

Holistic Configuration Management at Facebook

Scaling Memcache at Facebook

Finding a Needle in Haystack: Facebook's Photo Storage

One Trillion Edges, Graph Processing at Facebook-Scale

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...