Unanimous 2PC: Fault-tolerant Distributed Transactions Can be Fast and Simple

- July 03, 2024

This paper (PAPOC'24) is a cute paper. It isn't practical or very novel, but I think it is a good thought-provoking paper. It did bring together the work/ideas around transaction commit for me. It also has TLA+ specs in the appendix, which could be helpful to you in your modeling adventures.

I don't like the paper's introduction and motivation sections, so I will explain these my way.

The problem with 2PC

2PC is a commit protocol. A coordinator (transaction manager, TM) consistently decides based on participants (resource managers, RMs) feedback to commit or abort. If one RM sees/applies a commit, all RMs should eventually apply commit. If one RM sees/applies an abort, all RMs should eventually apply abort. Below figure shows 2PC in action. (If you are looking for a deeper dive to 2PC, read this.)

There are three different transactions going on here. All transactions access two RMs X and Y.

T1, the blue transaction is coordinated by C1, and this ends up committing. C1 starts by sending prepare messages to X and Y, and upon receiving their prepared-ack, it sends them a commit message. Happy path!

T2, the first red transaction is coordinated by C2, and ends up aborting. C2 sends prepare messages to X and Y, and since both are locked for t1, it receives back nack so C2 rules for an abort.

Finally T3, the second red transaction, also coordinated by C2, illustrates a hang-up scenario with 2PC. C2 sends prepare messages to X and Y. Y says yes. The message to X is delayed and there are two branching worlds W1 and W2. In W1, the lock message from C2 arrives before X fails, and thus C2 can commit before itself failing. In W2, the lock message is delayed until after X fails, thus C2 does not commit before failing. Since X and C2 have failed, W1 is indistinguishable from W2, and hence X is unrecoverable.

So, while 2PC is fast and simple, it is not fault-tolerant.

2PC amended

The popular approach in industry is to run 2PC on top of Paxos/Raft groups. This is the mule of distributed systems used by practically all distributed sharded databases out there, including Spanner, CRDB, MongoDB, Yugabyte, etc.

Since Paxos/Raft group effectively means an infallible virtual node, 2PC gets to run on virtual nodes that are always available, and the scenario above is avoided. This does mean that more messages/replication is happening, but since these are in parallel, and since we buy fault-tolerance without hiccups in performance this is a great tradeoff. There are variations optimizations possible on this main idea as shown in Figure 6, but I won't get into them.

U2PC protocol as an alternative

The paper is not very appreciative of the 2PC+Paxos approach. It criticizes that it needs 2f+1 replicas to tolerate f node failures. It says U2PC instead achieves replication of each shard f + 1 times to frugally sustain (up to f ) crashes. That is instead of shards with 3 replicas, U2PC uses shards with only 2 replicas to "tolerate" 1 node failure.

What the paper fails to acknowledge is that the 2f+1 is there to give you quorum, which gives you reduced tail-latency, resistence to stragglers, and faul-tolerance without performance getting effected with a single fault. We will revisit this point later when discussing the pitfall of U2PC.

The above figure showcases the U2PC protocol. Here we replicate each RM to tolerate one failure. So instead of X we have X1 and X2, and instead of Y we have Y1 and Y2. C1 sends the prepare message to each copy and hears back ack from everyone, and rules to commit the transaction. The commit message gets delayed to Y1.

In the meanwhile, C2 sends prepare for another transation. Y2 says ack, since it is unlocked. Y1 is locked so it sends a nack. C2 now has to abort this also at Y2 before it sends abort to the client.

So there are two key ideas. First, we ensure each shard in 2PC is fault-tolerant by replicating it f + 1 times and waiting for *unanimous* LOCK responses from these replicas before committing a transaction. Second, if the LOCK responses are not unanimous, U2PC waits for *unanimous* PRE-ABORT-UNLOCK replies before aborting the transaction.

U2PC's pitfall

U2PC is not exactly fault-tolerant, at least not in the masking sense. In U2PC, when an RM replica crashes, the system must reconfigure to remove and replace that replica before it is able to continue processing transactions. That is, you lose availability after a single node crash until the system is reconfigured. We will cover reconfiguration in the next section.

This makes U2PC an impractical protocol to deploy in the cloud where node/VM failures do occur and need to be accounted for. Moreover, U2PC is problematic even in the absence of node failures, due to stragglers. The tail latency on the transactions would be unpredictable due to a single straggler node. Finally, cloud deployments also have planned failures for maintenance and software update which would need to be accounted for.

Maybe the soften the blow, the paper says the FaRM protocol also has this problem. Well, then FaRM also has a big problem.

Couple days ago Aleksey complained about this same problem. "So this is the (maybe obvious and not novel) lesson I got here. Mencius is an example of a protocol that is fault-tolerant only on paper. I sometimes also call this an “algorithmic” fault-tolerance: it is safe, and it restores liveness after the reconfiguration/election, but in practice, its performance is so degraded that it almost does not matter."

Reconfiguration/Recovery in U2PC

Ok, we said that U2PC needs to reconfigure/recover even after a single node failure. Here is how that reconfiguration protocol works.

The reconfiguration coordinator (CR) gets a lease from the configuration manager (CM) and invalidates old leases. Note that CM is a Paxos backed fault-tolerant entity.
CR stops Config1 by halting at least one replica per shard and reading its log. This recovers any ongoing transactions.
CR loads the recovered state onto Config2.
CR verifies its lease is still valid and tells CM that Config2 is ready.
CR starts Config2.

The leases here just for progress, best-effort-way of ensuring there is one CR active at a given time. Since the CM is Paxos backed safety is guaranteed. How does CR recover in-flight transactions in Config2 then? It is as follows:

Commits transactions that any live replica has APPLY-UNLOCK'd (meaning commit done)
Aborts transactions that any live replica has PRE-ABORT-UNLOCK'd (meaning abort done)
Aborts transactions not locked by all live replicas in all participating shards
Commits transactions locked by all live replicas in all participating shards

The key thing to remember here is that recovery is decidable and deterministic. The above covers all cases. Consider the problematic case for 2PC, the coordinator and one replica of RM dies. Then by just looking at the live replicas (which at least one is guaranteed due to F=1 failure hypothesis), the transaction is recoverable by the above rules.

OK, what about coordinator failure alone? Coordinators are like soft-state, so another coordinator upon a client retry with same transaction id can carry the transaction through by looking at the states of RM replicas, because recovery is deterministic and decidable.

Is this salvageable?

Let's ask the important questions. Is there a way to address the tail-latency and need for reconfiguration even for a single RM replica failure in U2PC?

What if we used 2f+1 replication and any 2 out of 3 nodes for RM copy replying is enough? Can U2PC work as is with 3 RM replicas instead of 2 replicas where upon receiving 2 acks out of 3 this means a go for the transaction? This would take care of the tail-latency and need for stopping for reconfiguration on a single replica unavailability.

Frustratingly enough, the paper doesn't raise this question. It is maybe because the answer turns out to be negative. But this is such a good learning moment that it is important to answer this question.

This doesn't work. Consider 3 replicas of an RM: we find that replica A is locked for T1, B is locked for T2, and C is unavailable. This is unrecoverable. Because if we decide to abort T1 and T2, we may be wrong because maybe T1 (or similarly T2) saw a majority with C included (before C became unavailable) and rulled for a commit to the client. Instead of aborting, if we decide to wait, then we may be stuck forever because C may have crashed rather than having gone temporarily unavailable. This is why we need the Fast Paxos approach (and for the slow path Paxos) to coordinate this.

That is what Meerkat/Tapir work does. 2PC+FastPaxos requires 3 out of 3 nodes to reply because for 3 nodes fast quorum is 3. (For 5 nodes, the fast quorum is 4.) When the fast quorum fails, a coordinator takes over to do regular Paxos to resolve the transaction decision.

Well, in the end, U2PC just boils down to flexible quorums applied to Fast Paxos. That is the big reveal, the paper comes clean just before the conclusion section and says that U2PC is a special instance of this idea: "U2PC uses the FastPaxos equivalent of FlexiblePaxos, with unanimous replication and singleton quorums to reduce the shard size to f + 1. Additionally, U2PC uses FastPaxos to PRE-ABORT-UNLOCK transactions if the response is not unanimous rather falling back to a Paxos-based slow-path."

There is something to the "wasteful" Paxos/Raft groups approach. It is batteries-included, fault-tolerance built in. It doesn't let you mess with reconfiguration as in chain replication. It is also not as complicated as Fast Paxos first approach. Fast Paxos approach is problematic mainly due to the complexity of the recovery of that approach, and improvements are being made there as well. I still have yet to see the Fast Paxos implemented/deployed in an industrial setting. You use Paxos/Raft group approach, and you cut down on your headaches about fault-tolerance. This proved to be a good tradeoff for deployments.