Ok, at this point, we have to take a step back and realize that we cannot avoid all coordination between the two layers. The distributed storage layer needs to be consistent with the order of updates at the transaction layer, because it serves reads and you like it to serve the most recent version, not an old version. OK, to fix this, what if we used a multiversion store and stage data for replication without coordination and consistency, and on top of this we commit transactions with 2PC? But who decides the version numbers? If it is the transaction layer, where does it learn the old version number: from the storage layer. And, this cyclic dependency will create a problem when the storage/replication layer is inconsistent and as a result inconsistent version number maybe learned and proposed back. A way to fix this is to use timestamps for version numbers. But that would require precise clock synchronization so that the timestamps also match the transaction ordering. I guess this would be possible to do with atomic clocks, or maybe with PTP.
Actually, it turned out the paper also resorted to a multiversion store, but did not need very precise clock synchronization, and went with NTP synchronization. In fact, they could have gone without any clock synchronization. Clock synchronization is only used for improving the performance, not for satisfying safety. The paper lays out a principled cross-layer codesign of a linearizable transaction protocol on top of the unordered/inconsistent replication. And the design is complicated as we will see below. This shows that if we had very precise and dependable clock synchronization to rely on, we can simplify many problems in distributed systems, including this one.
Of course a co-design, or a cross-layer design, has an associated drawback: now these two layers are tangled, and you cannot swap storage layer protocols and transaction layer protocols and expect any combination to work without problems. When using a consistent replication layer (such as Paxos replication), you don't have to worry about swapping transaction layer protocols; use OCC, or 2PC, etc. But, when using an inconsistent replication (IR) layer, a special purpose transaction protocol that works on top of IR is needed. Any OCC or 2PC protocol will not work because building over IR imposes restrictions on the transaction protocol. The paper proposes TAPIR (Transactional Application Protocol for Inconsistent Replication) as the transaction layer protocol to accompany IR.
The results are nice as they show significant improvement. They demonstrate TAPIR in a transactional key-value store, TAPIR-KV, which supports linearizable transactions over a partitioned set of keys. The experiments found that TAPIR-KV had: (1) 50% lower commit latency and (2) more than 3x better throughput compared to systems using conventional transaction protocols, including an implementation of Spanner’s transaction protocol, and (3) comparable performance to MongoDB and Redis, widely-used eventual consistency systems.
What is not nice is the complexity this consistency relaxation at the replication layer imposes on the overall system design. IR is complex, and TAPIR involves even more complex ideas/techniques. It looks like this is inevitable complexity to compensate for the inconsistent/unordered replication at the storage layer.
Inconsistent Replication (IR) protocol for relaxing the consistency of replicationIn the IR protocol, instead of ordering of operations, replicas agree on the operation results. What does this mean? The unordered operation set provides the following 3 guarantees.
P1. [Fault tolerance] At any time, every operation in the operation set is in the record of at least one replica in any quorum of f+1 non-failed replicas.
P2. [Visibility] For any two operations in the operation set, at least one is visible to the other.
P3. [Consensus results] At any time, the result returned by a successful consensus operation is in the record of at least one replica in any quorum. The only exception is if the consensus result has been explicitly modified by the application (i.e., transaction) layer protocol through Merge, after which the outcome of Merge will be recorded instead.
What is this Merge business, you ask.
"Some replicas may miss operations or need to reconcile their state if the consensus result chosen by the application (i.e., transaction) protocol does not match their result. To ensure that IR replicas eventually converge, they periodically synchronize. Similar to eventual consistency, IR relies on the application (i.e., transaction) protocol to reconcile inconsistent replicas. On synchronization, a single IR node first upcalls into the application protocol with Merge, which takes records from inconsistent replicas and merges them into a master record of successful operations and consensus results. Then, IR upcalls into the application (i.e., transaction) protocol with Sync at each replica. Sync takes the master record and reconciles application (i.e., transaction) protocol state to make the replica consistent with the chosen consensus results."
Requirements from IR clientsIR imposes some hard requirements on the transaction layer. These limit the transaction layer and complicate its design.
1. Invariant checks must be performed pairwise.
This is the most restrictive requirement. "IR's limitation is that it can only support system guarantees that depend on conflicts between pairs of operations. For example, IR can be used to replicate a lock server but not a sales app that only sells thing in stock. The lock server's mutual exclusion guarantee is a pair-wise invariant, but calculating the store's stock requires a full history of operations that only strong consistency can achieve."
2. Application (i.e., transaction) protocols must be able to change consensus operation results.
3. Application (i.e., transaction) protocols should not expect operations to execute in the same order.
I don't provide explanation for the last two, because they are complicated. I am not sure I understand the explanation provided in the paper. This boils down to the following: the transaction protocol needs to sort out the mess of inconsistent ordering at the replication/storage layer, by essentially resorting to retry (with reordering) or abort (due to conflict) transactions and getting it right in the next round. This retry with reordering resembles the Fast Paxos approach. Due to inconsistent ordering at the replicas, this is the price to pay.
TAPIRTAPIR is designed to work on top of IR's weak guarantees to provide linearizable (strict-serializability) distributed transactions: (1) TAPIR does not have any leaders or centralized coordination, (2) TAPIR Reads always go to the closest replica, and (3) TAPIR Commit takes a single round-trip to the participants in the common case.
TAPIR uses clients as 2PC coordinators. A client Begins a transaction, then executes Reads and Writes during the transaction's execution period. During this period, the client can Abort the transaction. Once it finishes execution, the client Commits the transaction, after which it can no longer abort the transaction. The 2PC protocol will run to completion, committing or aborting the transaction based entirely on the decision of the participants.
TAPIR uses optimistic transaction ordering via NTP timestamps and OCC in order to concentrate all ordering decisions into a single set of validation checks at the proposed transaction timestamp. Under load, its abort rate can be as bad as OCC. The evaluation section shows a graph comparing TAPIR and OCC abort rates.
TAPIR protocol is complicated. To the credit of the authors, they provide a TLA+ model and model checking of TAPIR to support its correctness.
Finally, let's consider this question: what happens when the TAPIR client fails? Since client is the coordinator of the 2PC transaction, this is similar to coordinator failure in 2PC, and leads to blocking and a complicated recovery procedure which necessitates going to 3PC. In other words, TAPIR doesn't solve the 2PC blocking problem of transactions. A recovery procedure at the client level is needed if the client/coordinator blocks or dies. (RIFL, as we discuss below, considers almost the dual problem of IR/TAPIR and results in a cleaner recovery for the client level.)
RIFL comparisonRIFL also appeared on SOSP 15 as the TAPIR paper. RIFL takes a different approach to the problem. RIFL advocates building transactions over a linearizability layer at the replication, the advantage of which is cleaner recovery at the transaction layer. RIFL eliminated the complex recovery procedures of Sinfonia transactions, when the initiator (transaction manager) dies.
EvaluationThe paper uses Retwis Twitter clone and YCSB benchmark for evaluation. Evaluation is done on Google App Engine. Note that peak throughput is lower than both Cassandra and Redis, but those are eventual consistency systems, while TAPIR-KV uses a distributed transaction protocol that ensures strict serializability. Unfortunately the consistency violations in the eventual consistency KVs are not provided. That would be interesting to see and compare. They may turn out to be not so bad, as PBS paper and Facebook consistency papers point out.
Related linksLink to the paper
SOSP conference presentation where Irene does a great job of presenting overview of the ideas/techniques involved
The code for TAPIR is available here
Irene's blog post on lessons learned from TAPIR