Paxos derived

Lamport's fault-intolerant state machine replication algorithm

In 1978, Lamport published his classic "Time, Clocks, and the Ordering of Events in a Distributed System". As an application of logical clocks, he presented a distributed replicated state machine algorithm (and then he instantiated that algorithm to solve mutual exclusion as an example). Lamport complains that no one seemed to be aware of the distributed replicated state machine algorithm introduced in the paper:
"This is my most often cited paper. Many computer scientists claim to have read it. But I have rarely encountered anyone who was aware that the paper said anything about state machines. People seem to think that it is about either the causality relation on events in a distributed system, or the distributed mutual exclusion problem. People have insisted that there is nothing about state machines in the paper. I’ve even had to go back and reread it to convince myself that I really did remember what I had written."
I had talked about this distributed replicated state machine algorithm earlier. This algorithm is decentralized to a defect. It is not even tolerant to a single node failure. It assumes failure-free nodes.

The idea of the algorithm is as follows: In order to ensure that processes do not have different views of the order of updates, logical clocks is used to impose a total ordering on the updates. Each process keeps as part of its state the following: copy of the state, logical clock, queue of "modify requests" (with their logical time stamps), list of "known-times", one for every other process. Each process executes an update request on its copy of the state in increasing order of timestamps. For safety, all "known times" from other processes should be later than the time of the request.

The algorithm works as follows:
  1. Push your request in your own queue (timestamped with your logical clock)
  2. Broadcast your request to every node timestamp included.
  3. Wait for replies from all other nodes.
  4. If your request is now at the head of your queue and the known-times for other processes is ahead of its request timestamp (known-times is updated as processes send replies to the update request), enter critical section (where update to the state is done).
  5. Upon exiting the critical section, remove your request from the queue and send a release message to every process.

A fault-intolerant version of Paxos

I recently realized that the algorithm above (from the 1978 paper) constitutes a fault-intolerant instance of Paxos!

This occurred to me after thinking about it in the context of flexible quorums result. The flexible quorums idea (2016) states that we can weaken Paxos’ "all quorums should intersect" assertion to instead "only quorums from different phases should intersect". That is, majority quorums are not necessary for Paxos, provided that phase-1 quorums (Q1) intersect with phase-2 quorums (Q2).

This result allows trading off Q1 and Q2 sizes to improve performance (to the detriment of fault-tolerance)  Assuming failures and resulting leader changes are rare, phase-2 (where the leader tells the acceptors to decide values) is run more often than phase-1 (where a new leader is elected). Thus it is possible to improve performance of Paxos by reducing the size of Q2 at the expense of making the infrequently used Q1 larger. For example in a system of 10 acceptors, we can safely allow any set of only 3 acceptors to participate in Phase2, provided that we require 8 acceptors to participate for Phase1.  Note that the majority quorums (Q1=Q2=6) would be able to mask upto 5 node failures (f=5), whereas the Q1=8 configuration can only with stand upto 2 node failures (f=2) as it needs 8 nodes to be able to perform phase-1 if needed.

So, if you take Q1=N and Q2=1, the Paxos algorithm simplifies to the Lamport's distributed state machine replication algorithm above. Note that Q1=N implies the algorithm cannot tolerate any node failures, i.e., f=0. On the other hand, with this setup, you can combine phase 2 and phase 3 because you are writing to only one node, yourself. So phase 3 is non-existent in that algorithm.

The road from f=0 to Paxos

Ok, let's approach our claim from the other side as well. How do we take that f=0 protocol and strengthen it so that it doesn't block (lose progress) with one node failure?

This is how Phase 3 comes in to play as we add fault-tolerance. In order to tolerate one node crash  (in a fault-masking manner), you need Q2 to be 2. Then things suddenly get complicated, because you are not just writing to yourself, you will also need to write to another node in a guaranteed manner to persist the state. But, another leader may be stealing your turn before you can write to your other Q2 node your decision at Phase 2, so it is not safe to commit the update request! Therefore, Phase 2 clearing, which is phase 3, is needed to make this check, and it helps you replicate your state so it is preserved to the face of one node failure.

This is a point of objection, though. In Lamport's f=0 algorithm, logical clocks (LC) are used for reservation; every node respects LC, and puts requests into its queue ordered by LC. If one node needs to get its update done, it eventually will because the system is making progress. On the other hand, in Paxos, using the ballot numbers, for whose implementation LC could be used, a leader steals the previous leader's turn instead of patiently waiting the previous round to be complete. So what gives?

Well... In Lamport's f=0 algorithm, you could afford to be nice and patiently wait for each node to finish its turn, because f=0, and you are guaranteed to reach what you wait for. But when f>0 and a node can fail, you can't afford to wait for it to finish its turn (otherwise you would have to wait for an eternity in an asynchronous system model), and that is why Paxos is happy to change leaderships, and dueling leaders can arise (even to the point of violating progress).

In sum, something "fundamental" changes when you want to go fault-tolerant and tolerate node failure in an asynchronous system. When you combine faults and full-asynchrony, you get the FLP impossibility result. That means you lose progress! That is why Paxos does not guarantee making progress under a full asynchronous model with a crash failure. However, it preserves safety thanks to its balloting and anchoring system, and will provide progress as soon as the partial synchrony kicks in and weak-complete & eventually-weak-accurate failure detectors are implementable (i.e., when we are out of the realm of the FLP result). So, yes, there is a phase transition going from no faults to faults in asynchronous system.

I thank my PhD students, Ailidani Ailijiang and Aleksey Charapko, for discussion on this idea.

MAD questions

Was this actually how Leslie Lamport come up with the Paxos protocol? Does the 1978 fault-intolerant distributed state machine replication form a basis to evolve a fault-tolerant version?

I am not aware of any paper that makes this connection. Was this connection noticed and mentioned before?


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The end of a myth: Distributed transactions can scale

There is plenty of room at the bottom

Distributed Transactions at Scale in Amazon DynamoDB

Dude, where's my Emacs?