Consolidating concurrency control and consensus for commits under conflicts

This paper is by Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li and it appeared in OSDI 2016.

This paper presents Janus, another two-in-one shampoo & conditioner solution. Why deal with providing a solution for replication and another solution layered on top for transaction ordering when you can do both together in less time with a unified solution? 

TAPIR had investigated this question first, and Janus does a followup. Instead of using the general transaction model considered in TAPIR, Janus uses one shot transactions, where each transaction contains multiple pieces and each piece has one read/write operation on a single key. 

This reduced power transaction model allows Janus to implement an abort-free transaction ordering capability. A novel insight in Janus is that it notices it is possible to consolidate replication ordering and transaction ordering using the same setup. To order the transactions and replication alike, Janus takes a dependency graph tracking approach, like MDCC and EPaxos. It orders replication and transaction using dependency graph information in the same deterministic way: first perform a topological sort, and the use transaction ids to break ties in partial equivalence classes with cyclic dependencies. 

Since Janus can process these reduced power transactions in an abort-free manner, it can commit transactions in 1 round trip time (RTT) when there is no conflict and at most in 2 RTTs otherwise. Avoiding aborts means that Janus provides high throughput and good latency benefits especially in wide area network setups. 


The Algorithm

Clients send their requests to the nearest coordinator, which becomes responsible for ordering and coordinating the transaction. Janus has pre-accept and commit phases in the fast path (1 RTT commits!) and needs accept phase in case of conflicts (resulting in 2 RTT commits). 

In the pre-accept phase, the coordinator replicates the transaction to *all* participating servers. Each of the participating servers updates its dependency graphs by adding the new transaction and sends the dependency lists for each piece. If the dependency lists for each piece are identical from all the servers,  the fast quorum is achieved, and the coordinator skips the accept phase and sends commit message with the aggregated dependency list. Each participating server updates its local graph accordingly and uses the topological sort on the cycle free graph to decide the execution order. 

(Notice the *all* above. The fast quorum in Janus is all nodes, not just 3/4th of the nodes, and this causes drawbacks as we discuss later.)

In case of a conflict, the coordinator receives different dependency lists and starts the accept phase by aggregating the dependency list and sending it to the replicas with a ballot number, 0 in normal case. If the coordinator receives Accept-OK from the majority, it continues to the commit phase. The accept phase is there to reliably establish/store the aggregated set of dependencies (learned in the fast path attempt) on a majority of nodes in the system. 

Discussion point 1: major drawbacks of Janus

Requiring a reply from all nodes as part of the fast quorum in the pre-accept phase is very limiting. A straggler/slow replica can drag down the performance of the entire system. Tail latency is a big problem. Just receiving acks from a quorum help reduce tail latency. And in contrast requiring a reply from each node ensures that the tail latency will dominate performance. Unfortunately the evaluation of the paper does not consider this.

Another problematic issue with requiring a response from all nodes as part of fast quorum means, a crashed node will default the system to always go to 2 RTT latency for transaction commit. 

The fault-tolerance discussion in the paper is insufficient. The paper discusses only coordinator failure, but does not discuss the participating server/replica failure. I think the resolution for this is reconfiguration and adding a new server. But, recovery in fast Paxos family of protocols are notoriously complicated. Good luck trying to recover and reconfigure in the dependency graph hell without stopping any processing. 

Finally, Janus has to send dependencies over the network to all participating servers. As these dependencies grow, the message sizes bloat. Even when the network bandwidth is ample, large message sizes cause significant delays/bottlenecks due to serialization/deserialization at the CPU.


Discussion point 2: comparison with other ideas

Table 1 provides a nice comparison of different approaches to transaction processing. 2PL+MultiPaxos above is Spanner, and OCC+MultiPaxos is HStore. 

The SLOG work (from VLDB 2019) decentralized the centralized transaction ordering approach in Calvin. SLOG has some nice ideas about adjusting the ownership of keys and improving the locality and latency of commits. As for multihome transactions, SLOG relies on deterministic processing and a locking based solution to avoid two phase commit. Once all parties agree to the plan, processing occurs (mostly) independently on each node, with the system relying on the plan's determinism in order to avoid replica divergence. 

There is also Ocean Vista's visibility tracking approach, which I discuss in point 3 below.

At this point, we have seen easily a dozen of these low-latency WAN transaction approaches. It would be nice to create a graph of all these systems and relate them to with respect to the approaches they share, and how they influenced each other. 


Discussion point 3: Ocean Vista comparison

Ocean Vista is another WAN transaction processing paper we recently discussed. 

Ocean Vista (OV) makes everything so simple. First the transaction T is replicated to all the parties across the datacenters. The parties then check to see that the watermark rises above T to ensure that all transactions including and preceding T has been replicated successfully to all the parties. Then, the execution of T can be done asynchronously at each party.

Since OV is simple, it can handle replica and data center failures easily, in contrast to Janus. In OV, 1 RTT is enough (modulo the additional latency for gossiping) even under conflicts. And OV supports a slightly more general version of transactions in Janus, but still more limited than the transactions in TAPIR.

The catch in OV is that since there is only one visibility watermark for the system, a transaction needs to wait for any earlier transaction even if there is no conflict. Janus ensures that you only wait for dependent transactions, which provides lower latency for independent transactions, but this comes at the cost of tracking every dependency and complicating the protocol and recovery.

Finally, OV uses a crotch: multiversioning. Janus doesn't need that.


Discussion point 4: Atlas comparison

Another protocol we recently discussed is Atlas. Atlas provides efficient fast Paxos over WAN, without aborts, just by reordering respecting the dependency graphs. Atlas is just consensus, and Janus is transaction and consensus. But of course, Janus transactions are limited in scope to one-shot transactions, so they may not be actually that far apart. 

Compared to Janus, I think Atlas made the fault-handling much clearer. It is still complicated of course because of the dependencies involved in fast quorums, but at least Atlas made the process explicit and laid it out clearly. 


Evaluation 

The code for Janus is open source at https://github.com/NYU-NEWS/janus

The evaluation results show that Janus provides low latency and good throughput even with many conflicts.


The evaluation results also show that Janus has small throughput overhead when there are no conflicts. As we mentioned earlier, the paper should have discussed how to address different fault scenarios and provide evaluation under straggler node and crashed node scenarios as well. 

Comments

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Linearizability

Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)