Paxos taught

In my previous posts I have alluded a couple of times to how I teach Paxos. Here I will explain how I go about teaching Paxos.

These are the slides I use in class. (In my slides I reuse a lot of material from the lecture slides of Jeff Chase (Duke). He shared his slides with me under the Creative Commons license, where I get to remix/change the contents with proper credit to him.)

Paxos is a protocol for solving distributed consensus. The problem is easy to state. Consensus requires the following three properties. The first two are safety properties, the last one is a liveness property.
  • Agreement: No two process can commit different decisions.
  • Validity (Non-triviality): If all initial values are same, nodes must commit that value.
  • Termination: Nodes commit eventually.

I first start with a discussion of impossibility results in consensus. A major impossibility result is the coordinated attack (aka two generals) paradox. The result states that there is no deterministic algorithm for reaching consensus in a model where an arbitrary number of messages can be lost undetectably (to the sender). The result is strong, it applies to both asynchronous and synchronous models. Teaching this result is fun. Before I finally explain the proof of impossibility, I challenge the students to come up with a solution, and we shoot each "solution" down together.

Even after assuming no message loss, another impossibility result haunts consensus. Fisher-Lynch-Paterson impossibility result says that there is no deterministic algorithm for reaching consensus under the asynchronous system model in the presence of just 1 crash failure. Unlike the coordinated attack result FLP applies only for asynchronous systems, partially-synchronous, or synchronous systems are safe from this impossibility result. However, it is important to note that even for the most synchronous cluster of computers a heavy load of requests can break all the timeliness assumptions and turn the system into an asynchronous one in effect.

Next I discuss another related impossibility result, the CAP theorem, which had a lot of impact in the context of large scale web services development. CAP theorem says that your system can only have two of the three properties: Consistency, Availability, Partition-Tolerance. CAP theorem has been highly influential in large-scale web services as it succinctly captured the tradeoffs that follow from the above impossibility results in that domain. The reason I teach CAP theorem is because it provides a map for the distributed system protocols I teach later, including Paxos and Dynamo. Paxos provides CP because the system is always consistent, even in a partition, but a reachable replica may deny service without agreement of the others. (Actually Paxos provides availability even in a partition if the server the request hits is part of a majority quorum. The requests that hit a minority island are the ones that get rejected.) Amazon Dynamo, in contrast, provides AP because the system is always available if any replica is up and reachable, even in a partition, but might not be consistent, even without a partition. Dynamo provides eventual consistency approximation instead of a consistency guarantee. (Footnote: In Daniel Abadi's alternative proposal to CAP theorem, Paxos is PCEC, and Dynamo is PAEL.)

Then I start discussing Paxos. First I motivate Paxos by pointing out how important Paxos is, and how Paxos got widely adopted in datacenters for the tasks that require consistency. The importance of Paxos is that it satisfies the safety properties even under asynchrony and arbitrary message loss. This does not conflict with the impossibility results discussed above. Those impossibility results said that you can't achieve consensus (both safety and liveness at the same time), they did not say you have to sacrifice safety under message-losses or asynchrony conditions. So Paxos preserves safety under all conditions and achieves liveness when conditions improve outside the impossibility realm (less message losses, some timing assumptions start to hold). Another reason Paxos is important is because it is simple and comes with a formal proof.

I caution my students that although Paxos is a simple protocol, it may take a lot of time and effort to really internalize and grok the protocol. After all this protocol has been mostly elusive to distributed systems people for about 20 years. Leslie Lamport first formulated Paxos in 1980s, and submitted his first paper on it on 1990. The paper got rejected :-), and finally appeared in print in 1998. He gave some talks about the protocol, casting it as a historical part-time parliment protocol from the Greek Paxos islands (hence the name of the protocol). Nobody got it, except a few (Butler Lampson, Nancy Lynch). The protocol stayed underground mostly until the 2000s. Here is Lamport's discussion about the interesting history behind the Paxos protocol.


This post has already gotten long, so I will not go into an explanation of how/why Paxos works. Maybe I will have a "Paxos Taught 2" post for this later. You can see slides 10-40 for an explanation of Paxos. While teaching, I bring 5-6 students to the board to perform live reenactments the Paxos consensus algorithm under several fault scenarios. So the Paxos classes are generally the most fun ones in my distributed systems course.

After describing and reiterating Paxos, which takes 2-3 classes at least, I show how Paxos is used in real world. I discuss the Paxos Made Live paper which discusses the Google Chubby system. That work provides some optimizations (which do not violate safety and Paxos's guarantees) to improve the efficiency and performance of the system.

Finally, I give another application of Paxos in the transaction commit problem. 2PC is blocking if the leader dies, supposedly the 3PC protocol takes care of the blocking, but it has many special cases to consider, and is unproven. Leslie Lamport and Jim Gray, two giants of distributed systems, have proposed using Paxos to solve this transaction commit problem. The obvious solution is to use Paxos to record the transaction manager(TM)'s decision in 2PC, so that if the TM fails, the new TM learns this decision unambigiously from that Paxos box. However, Lamport and Gray suggest using Paxos to record each resource manager(RM)'s decision, so the TM (original or new TM) can learn each RM's vote unambiguously from their corresponding Paxos boxes. Although this approach requires more messages, it's advantage is that it shaves off another message latency from the commit compared to the obvious solution. At the end, this paxos-commit requires 5 message delays, which is very reasonable given that 2PC requires 4 message delays.

Comments

thattommyhall said…
This comment has been removed by the author.
Anonymous said…
Thank you for the great article.

There is a broken URL to description of transaction commit problem from MS research at the end of the article.
Unknown said…
Hello,

Great article! Thank you very much!

In your slides, you mentioned on page 32:

A round anchors if a majority of agents hear the command (2a) and obey.

However, the example on page 30 shows that even if the minority (1 out of 4) of the acceptors receive accepted value for the previous round, the value is still anchored.

Is the example in-complete or am I misunderstanding some point?

Thanks,
Tao
Anonymous said…
Hi Prof. Murat,

I go to a crappy engineering college in India, where we have even crappier lecturer for the distributed systems class.

Your blog has been of IMMENSE help in understanding these difficult concepts/systems.

Thanks you for sharing your teaching skills with the rest of the world. And please the good work.

Regards.
-Ramesh
Murat said…
Thanks for the kind words Ramesh.
I try to make myself useful.
Unknown said…
This comment has been removed by the author.

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book