Thursday, September 26, 2019

Do leases buy us anything?

Consider the atomic storage problem, which is a simpler problem than consensus. CAP theorem says that when there is a partition, you will need to sacrifice strong-consistency or high-availability.

Using leases, it is possible to sacrifice availability for sometime (until lease expires), and reconfigure the storage system (often by shrinking the quorum) to keep providing availability. Consistency is preserved throughout, and availability is sacrificed only during lease expiration time. This is a good tradeoff. (I am going to bring up the question of whether this may help circumvent CAP at the end of the post.)

But is this power unique to leases? Is there a way to explain this in an alternate way using only failure detectors instead of leases? This is the question I want to explore here.

Many distributed systems implement leases with just countdown timers and without NTP timestamps. This is because in the short-term the rate of clocks at processes don't drift too much.

So maybe, we can simulate a lease expiration by a node suspecting itself as failed. If the other nodes have knowledge of the timeout on the failure detector of this expired node, they can wait that time out, and start reconfiguration of the storage system after that. While a failure detector requires only unilateral decision, this explanation requires that other nodes know about the bound on the failure detector at the expired node. Let's see if we can do without that requirement.

For reconfiguration, two options are possible. One is decentralized reconfiguration implemented over the nodes themselves, and the other is by using a reconfiguration box (often implemented by Paxos) as the arbiter.

An example of the former is the dynamic atomic storage without consensus (2011) setup. There, majority is needed to pass reconfiguration. But there, even for a consistent read, majority is needed. So we don't really need a failure-detector at the partitioned node to drop itself of the system. It won't be able to serve reads or writes by itself anyways.

An example of the latter is chain replication. Consider that the tail node is partitioned. The original chain cannot complete serving writes anymore, because the tail is partitioned. The tail can still serve reads though for the clients that contact it directly for some time.

Here the reconfiguration box can reconfigure the chain to remove the partitioned tail. To explain this with failure detector terminology, let's say the reconfiguration box has its failure detector itself. The failure detector suspects the tail, and passes a reconfiguration with a higher epoch and takes the tail off. But before reconfiguring the chain to remove the original tail, the reconfiguration box should make sure that the tail stops serving reads to client. How can this be accomplished? The reconfiguration message will not reach the partitioned old tail. So the tail should know about the reconfiguration box's failure detector timeout duration. Without this knowledge, without tying the reconfiguration server's failure detector about the tail to the tail's failure detector about itself, we wouldn't know when it is safe to switch from the old configuration and start the new configuration. (The alternative is that the tail checks with the reconfiguration box for each operation, so it confirms its status in the configuration. Even with this one, due to asymmetric message delay, the reconfiguration box may need to wait some duration before reconfiguring.)

Leases buy us time and optionality

Using leases, a node does not have to check its status in a configuration for each operation. Provided that the lease holds, the nodes status in the configuration is unchanged. Using leases on the acceptors, a Paxos leader can serve reads locally, without checking with a quorum. And using leases, a tail in chain replication can serve reads without checking if it is still the tail. This translates to efficiency because checking your status in the configuration is not done for each operation, but rather batched and done once for each lease renewal.

In terms of trading off availability to get availability, it seems like leases provides more information than a unilateral failure detector and can buy you consistency in the presence of a partitioned node. This comes with the loss of some availability because reconfiguration for quorum shrinking needs to wait the lease time to expire.

Leases also provide an advantage for reconfiguration in the presence of partitions. Leases sacrifice availability to restore availability (while preserving safety) for the storage system. This functionality requires more than the unilateral decision taken by failure detectors, but rather a bilateral information on the expiration of the timeouts.

MAD questions

1. Did we circumvent the CAP result?
CAP defines availability as "Each request for read or write eventually receives a response." Even with leases and reconfiguration, there will be a request that does not receive a response. In my example, the old tail will not respond to read request after the expiration of the lease. But since at that point the old tail is not part of the system anymore, why does that count against the availability of the system? But the formulation of CAP is too strict and defines the system as the initial set of nodes in the system. That formulation prohibits any reconfiguration of the system even when there are no partitions.

I think we need more refined versions of CAP. It has a very rough granularity formulation.

Tuesday, September 24, 2019

Some of my peculiarities

Daniel Lemire recently wrote about his kindergarten experience, how he couldn't tie his shoes and couldn't memorize numbers for counting. I had similar experiences. I was a smart kid (not just my mom's opinion :-), but I couldn't memorize how to count to 5 until first grade (which I started at 5 years old --which is another story). My mom was a primary school teacher, and she was worried that she couldn't teach me how to count. She invented some rhyming words for numbers to help me memorize them, but apparently that didn't work because I would say the rhyming words instead of the numbers.

Even up to third grade, I would occasionally put on my shoes the wrong foot. I couldn't learn how to tie my shoes properly till middle/high school. In middle/high school I started doing a single loop tie. I learned how to do double loop tie only after university. On the other hand, I had decent coordination. I played soccer and basketball fine.

I had a very hard time learning days of the week and months in a year (both in Turkish and later in English). It took me much much later than other kids to learn these. I am still not good with that stuff and mix the months occasionally. For some holidays, I still don't know which month they fall in. And overall I don't have a good appraisal of time.

I am clumsy and awkward in a peculiar way. I activated the fire alarm in my rental apartment at my sabbatical by mistake.

I can also be dense in stuff I don't like to learn, such as stock market, investing, etc. Once I label a topic as bad/boring/useless/tedious, it is as if I make it a point not to understand anything about it.

I procrastinate on some things for no good reason. In my twenties, I was very bad with paying bills, and got many late fee charges. This was a stupid form of procrastination. I would be stressed about these, but not do anything about them. (This wasn't depression. Luckily I have been able to avoid depression so far.) In my thirties, I learned how to be organized thanks to Emacs org-mode.

Being in academia, I was in the presence of people with similar peculiarities, and didn't stick out much. One of my lab mates at MIT kept procrastinating paying the utility bills in his apartment and his services disconnected couple of times. Then his mom started paying them instead. This guy was one of the smartest kids I knew, but he just couldn't pay his bills on time. Another person I know, kept his  car wash subscription in a city he moved out of for 6 months, because he wouldn't make a simple phone call---even after I started pleading him to make the call. (I also hate making phone calls.)

Ok, I am on a roll, let's keep going.

I can't touch peaches or apricots. Touching them makes me feel freezing cold. The mere mention of them used to make every hair on my arms stand up. I got better at tolerating their mention and appearance. Similar with touching chalk.

I have sensitive inner-ears. I get dizzy in an elevator and in a plane during takeoff. I cannot read anything in a car or bus.

When multiple conversations are going on in a dinner party, I cannot concentrate and understand what people say to me, because I can't tune out other conversations. When there is a TV in a room, I can't function, even thought the TV is on mute. Long flights are my nightmare: every seat there is a TV playing, and I go into a crazed/tired state after a couple hours.

I cannot function well in an open-office environment. I get depleted when I work in an open-office for a long duration. I love love love my private office.

I am not a very logical learner and a clear/organized thinker. I have a messy brain, and a messy learning/understanding style. I have much better intuition than rot logic learning power. I can form relationships/analogies with different things quickly and effortlessly.

I am good at catching typos. It is as if typos on a paper jump from the page to my attention. And in general I have a good visual memory.

I think I am good at reading emotions, and I am a pretty (probably too) sensitive person for emotions of others.

I think my peculiarities are just peculiarities. Like the stereotypical absentminded/clumsy professor. But, maybe, I am somewhere on the spectrum. I am not too bothered by this. But this occasionally leads to problems. My wife thinks I am feigning ignorance about some of our house routines (like whereabouts of things) and some of my behavior at social gatherings, even though I am genuinely confused and lost. Some of these, unfortunately, I cannot fix, even though I try hard. Probably some people find me peculiar, clumsy, or child-like, though I do a very good job of hiding this from people I interact infrequently.

MAD questions

1. What are your peculiarities?
I expect that some of you share some of my peculiarities. And some has other quirks. I like people with quirks, as long as they are harmless and not infringing to others.

2. Here is another good thing going for me, but probably this is a trick and not a peculiarity
I can stop my hiccups by just noticing that I have a hiccup and thinking it is silly to have a hickup. This works reliably and within one second for me. I picked this trick up 5-6 years ago after I saw a mention of this on a subreddit. Ever since that I didn't have a hiccup session with more than 2 hiccups (at which point I notice the hiccup and use the trick). Let's not jinx this, this is one good thing going for me.

Sunday, September 22, 2019

Teaching Paxi

Paxos family of protocols (which I refer to as Paxi) is immensely useful for building distributed systems due to their excellent fault-tolerance properties. Many cloud computing services and distributed databases employ Paxi for state machine replication (SMR). Paxi preserve the safety of consensus problem (no two nodes commit different values for the same slot) even to the face of a fully asynchronous execution, crash faults, message losses, and network partitions. Paxi satisfy liveness of consensus problem (some value is eventually committed for the slot) when the system moves outside the realm of the coordinated attack and FLP impossibility results.

Paxi are perennially misunderstood and their sophistication underrated. While there has been a lot of work on Paxi, we have been able to explore only a fraction of the algorithm design space. A striking evidence of this arrived in 2016, where we had a flexible quorum breakthrough after 30 years, which no one had anticipated.

There is a need to unpack Paxi and explain them better for both students and practitioners alike. Paxi provide a great opportunity for teaching the principles of distributed systems and we should seize on this opportunity.

Problems with teaching Paxi

Most coverage of Paxos in courses is dry and superficial: the Paxos protocol is described and the students memorize the protocol. While the Paxos protocol looks simple, it has a lot depth and subtleties. It is not possible to appreciate these and truly understand distributed consensus by just memorizing the Paxos protocol. To understand Paxos, you should not only understand how it works, but also why it works, and what cornercases it prevents, and how else it could be realized.

Raft has been proposed as a simple explanation of consensus and Paxos. While many developers love the operationalized explanation style of Raft and the implementation that accompany it, tying the explanation to a constrained implementation is unnecessarily restrictive. The generality of Paxos family of protocols are lost, and the context and principles of distributed consensus is not communicated satisfactorily.

We need better explanations of not just Paxi but the context and derivation of these protocols, explaining why each action is needed and why this is a hard problem. However, explaining the protocol solely in a declarative way using derivation is also hard to follow for students, and some intuition should be provided as well. The students should also be provided with ample opportunity to get a lot of hands-on exercise and experience with the protocols, their implementation, and their integration into practical applications/systems.

Every year my favorite part of the distributed systems class is when I get to teach Paxos for two weeks. Every year, I am able to do a better job of it by gradual/gradient-descent improvement. But these days, I am planning an overhaul of how I teach Paxos, and put some real effort behind this to realize my ideal setup for teaching Paxi. Here is my explanation of this setup in terms of the content of the module and supporting tools for it.

Course module on Paxi

The Paxi course module will provide context and teach the principles and derivation of the protocols.

To teach about the context, the module will cover questions such as: What makes the distributed consensus problem so hard?  How has our understanding of distributed consensus change after "Attacking Generals" and FLP impossibility results? What are the cornercases that haunt correctness and liveness? Covering these will help the students appreciate the depth and ingenuity of Paxi.

To teach about derivation and principles of the protocol, the module will employ a stepwise refinement from the high-level consensus specification to an intermediate round-based abstraction (for which Heidi's generalized consensus framework is a good candidate), and then to the Paxos algorithm.  The module will explore both leader-based (as in Paxi) and non-leader-based (as in Ben-Or and Texel) refinements of this round-base intermediate specification, and will discuss the advantages and disadvantages of each approach.

The module will also relate distributed consensus to decentralized protocols in the blockchain domain. By showing a Texel to Avalanche transition, the module will tie consensus (where Texel shows an alternative solution to leader based consensus as in Paxos) to blockchains (where Avalanche shows how to scale and operationalize Texel to large-scale decentralized environments).

Supporting tools

To allow the students to experiment with the protocols at the design level, we will provide TLA+/Pluscal modeling of Paxos and variants. With these, the students will be able to model-check Paxi protocols and experiment with modifications to see which safety and progress properties are satisfied under different environments.

To enable the students to experiment at the implementation level, we will use our Paxi framework implemented in Go (available as opensource at Our Go Paxi framework provides a leveled playground for protocol evaluation and comparison. The protocols are implemented using common building blocks for networking, message handling, quorums, etc., and the developer needs to only fill in two modules for describing the distributed coordination protocol. Paxi includes benchmarking support to evaluate the protocols in terms of their performance, availability, scalability, and a linearizability checker to check the protocol for consistency. We have a dozen Paxos variants already implemented in Paxi. We will invite students to implement more, especially Byzantine versions of Paxos protocols, and consider tie-ins to permissioned blockchain protocols. In order to link the high-level design to the implementation, we will provide a mapping from TLA+ model to the Paxi implementation of distributed consensus.

In order to showcase the integration of Paxi protocols to distributed systems applications, we will use the globally distributed database FleetDB ( as the hands-on application. We will first extend FleetDB to be instantiable with different Paxi protocols as plugins, and make the Paxi protocols exchangeable based on workload and environment. FleetDB can lead the way and help distributed databases for integrating Paxi protocols in their replication operation. Currently only a handful databases (including Spanner and CockroachDB) use Paxos as part of their replication. Although Paxos provides excellent fault-tolerance properties and prevent any loss of consistency, the vanilla Paxos protocol is not a good fit for WAN deployments, and performs poorly under certain topologies and workloads.

Couple other tools are worth mentioning. One is the DSLabs tool from UW for model checking distributed systems projects implementations. Another is the DistAlgo tool from SUNY Stonybrook.

MAD questions

1. What is your favorite teaching technique for Paxi?
While teaching Paxos, I bring 5 students to the board to perform live reenactments the Paxos consensus algorithm (each one simulating a Paxos node) under several fault scenarios. This part is the most fun one and most beneficial in my distributed systems course. I do this twice in two different classes. Through this exercises the students get to learn how the protocol works, and see how it deals with the cornercases.

Another favorite moment for me is to watch the students think they understand the protocol, and then forget about it and get confused again. This is inevitable, and a big part of how learning works. Learning needs self evaluation and self correction. Without doing work yourself, you can't truly learn anything. Easy come, easy go. I also like watching the students learn the application of the single-instance Paxos in the MultiPaxos protocol, and see them learn about some Paxos variants. The students then realize that what they learned was only the tip of the iceberg.

2. Are there other algorithms/systems you suggest can serve as capstone projects in a distributed systems class?

Thursday, September 19, 2019

Avalanche is a descendent of Texel

When I first skimmed the Texel paper (which was released on August 28 2019), I could see parallels between Texel and Avalanche, and noted those in the third paragraph of my review. But I had missed the footnote in the Texel paper which said that Texel was originally written in 2010. I thought Texel was new, and was speculating that it may be a more deterministic version of Avalanche, that is applied to crash tolerant distributed consensus. After writing two more blog posts on modeling Texel in TLA+ and understanding it better, I now think Texel formed a basis that Avalanche descended from.

Texel provided an asynchronous and truly-leaderless solution to consensus. Instead of appointing a leader to bring nodes to consensus (as in Paxos), Texel shows how each node can make its own mind and still achieve consensus in an asynchronous system. By adopting a leaderless solution to asynchronous consensus, Texel avoids the disadvantages of solutions that appoint a leader for achieving consensus. In a leader-based solution, what if the leader fails? In order to avoid getting stuck forever, the nodes should use a failure detector to suspect if the leader is unavailable. Failure detectors are a liability in large-scale systems with a lot of turn-over. Another big drawback with having a leader is that for large-scale systems, the leader becomes a performance bottleneck.

Avalanche operationalized Texel's leaderless approach for large-scale decentralized consensus. It extended Texel's leaderless consensus approach in terms of scalability and quick finality (by using sampling and metastability), and applied the resultant decentralized algorithm in the blockchain domain.

But Texel did not consider Byzantine faults

Avalanche considers Byzantine faults which Texel did not consider. The question is, what can a Byzantine node do in blockchains? Answer: it can try to perform double-spending. That translates to the node proposing two different transactions with the same UTXO for itself (the transactions need to be signed by the private-key of the initiator).

The safety (i.e., agreement) property of the Texel approach says that no node in the system can decide different things for the same value (transaction). This, translated to Avalanche terms, means that no two correct nodes will decide two different transactions with the same UTXO. And this rules out double-spending for a Byzantine initiator. Even when other Byzantine nodes in the system may try to conspire with the Byzantine initiator and push some correct nodes to adopt different supporting values, with the threshold for supporting value adoption higher than the number of possible Byzantine nodes, Texel approach and its respective adaptation in Avalanche can avoid this problem.

Avalanche also does a very clever judo move on Texel's liveness problem and turns it into a feature. In my reviews for Texel, I mentioned that liveness (termination of consensus) is a problem for Texel. In the blockchain domain, Avalanche adopts a similar approach to supporting value selection, and runs into liveness problem when two different values are competing to be decided on for the same consensus instance. In the blockchain domain, this corresponds to a Byzantine node pushing two different transaction with the same UTXO. And in this case the liveness violation is a feature not a bug. Since the correct clients follow the protocol as prescribed (and avoid double-spending), they are guaranteed both safety and liveness. In contrast, the protocols do not guarantee liveness (but still guarantees safety) for double-spending transactions submitted by Byzantine clients, which conflict with one another. As the Avalanche paper says "such decisions may stall in the network, but have no safety impact on virtuous transactions."

What about the Sybil nodes problem? Avalanche deals with Sybil problem  using a PoS solution. It can even adopt a PoW solution as well, because dealing with Sybil nodes is an orthogonal problem to solving consensus.

Scaling Texel

In Texel for adopting a supporting value, you need to read it from more than F nodes. In the decentralized consensus setting Avalanche considers, N and F can be huge, thousands of nodes. So for finding a supporting value, a node in Avalanche samples a bunch of nodes, which is much smaller than F nodes. But the random sampling of nodes still enables tolerating the F faulty nodes. Since F is a fraction of N, it cannot have too much effect in the sampling based selection of a supporting value for a node in Avalanche.

How does Avalanche deal with the cancellation of experimentation problem in Texel? Again sampling and the use of metastability concept helps with this. Having a large scale system becomes an advantage here because the likelihood/risk of reading from inconsistent cuts of each other from overlapping experiments and getting affected by this diminishes. This way Avalanche avoids the agreement violation problem due to inconsistent snapshot read (if concurrent and overlapping experiments are not canceled).

Avalanche also applies the metastability concept to make the consensus finalization decision faster, and without the need to contacting N-F nodes.

Closing the loop

I will assign Texel as part of a TLA+ modeling project in my distributed systems class. My distributed systems class topics are as follows:
  1. Introduction, 2 phase-commit
  2. Reasoning about distributed programs, safety/progress
  3. Consensus, Paxos
  4. Failure detectors, Faults and fault-tolerance
  5. Time: logical clocks, State: distributed snapshots
  6. Datacenter computing, Cloud computing
  7. NoSQL databases, CAP theorem, Distributed databases
  8. Big data, Big data analytics
  9. Decentralized ledgers and blockchains
The course starts with consensus and ends with blockchains. Showing a Texel to Avalanche transition is a good way to tie consensus (where Texel shows an alternative solution to leader based consensus as in Paxos) to blockchain (where Avalanche shows how to scale and operationalize Texel to large-scale decentralized environments).

MAD questions

1. Could it be that Avalanche and Texel are unrelated?
No. Absolutely not!

Tobler's first law of geography says "everything is related to everything else, but near things are more related than distant things."

"Everything is related", so Avalanche and Texel are related :-)

"Near things are more related than distant things". Since both Texel an Avalanche have strong Cornel links, they are even more related.

Banter aside, I noticed that Texel in turn has several parallels to Ben-Or. Nothing comes out of void. So you can also make an argument that Avalanche is a descendent of Ben-Or as well. But, as the law said "everything is related", so I am still in the right. Here are the similarities I see between Texel and Ben-Or.
  1. Texel does not use rounds, but consistent-cuts. Ben-Or uses rounds, but the rounds in Ben-Or are leaderless rounds: they are non-client-restricted rounds, in contrast to client-restricted rounds in leader-based solutions.
  2. Similar to a node experimenting in Texel to find a supporting value by talking to F+1 nodes, in Ben-Or a node goes through the first phase of a round to identify a supporting value by talking to F+1 nodes.
  3. Similar to the node finalizing a decision by finding it at N-F nodes in Texel, in Ben-Or a node finalizes its decision by finding it at N-F nodes.
As one difference, upon failure to get a decision, Ben-Or makes some nodes to change their vote before the next experimenting phase. This helps jolt/tilt the system toward a decision, so that eventually probabilistically the system converges to consensus.

Tuesday, September 17, 2019

Modeling a read-write version of Texel: an asynchronous consensus algorithm without rounds

In the previous post I gave a model of atomic Texel, where a node can atomically read all other nodes' decision and update its own decision. Here is a refined version of that, where a node can atomically read the state of *one* other node and update its decision. This refined model shows why it is important for the nodes to read from consistent cuts, and how when multiple nodes are experimenting they can violate this requirement, and Agreement property is violated as a result.

The model

This builds and extends over the previous model. N stands for number of nodes, and F denotes the number of nodes that can crash. We use f to keep track of actual number of nodes that crash. In addition to the *decision* array that tracks the decision of each node, we now have an *exp* array that denotes the experimentation status of each node. Initially each node is in the experimenting state.

Each node starts with t=FALSE (the decision is not finalized), pollSet= Procs \{self} (the node can poll all nodes except self), and tally=<<0,0>> (the number of votes from the polled nodes for "a" and "b" is initally 0 and 0).

Each node has three actions that it can choose from and execute as long as the node has not finalized its decision or crashed.

The first action starts in Line 21. This action is enabled if the node is in experimenting state. It picks (and removes) a node k from its pollSet. If k's decision is "a", it increases the tally for "a" and if k's decision is "b", it increases the tally for "b". After this, if any of these tallies is a supporting decision, i.e., is greater than F (which means it is a majority of N-F nodes), then the node adopts it as its own decision.

Line 32 starts the second action. If f, the actual number of crashes is still less than F, the allowed number of classes, then a process can crash, by setting its decision to crash permanently.

Line 36 starts the third action. If a node finds that its current decision is shared by at least N-F processes, then that decision is "anchored", and the node can finalize its decision by setting its t=TRUE. If no such anchor is in place, and the node is not in experimenting state, the node switches to experimenting state (resetting pollSet and tally). By experimenting again, the node can potentially change its decision to another supporting decision, which may lead to progress and finalization of consensus.

Safety violation

When I model-check this protocol for N=4, F=1, this model violates the Agreement property. Two nodes can finalize their decisions with different values, because they experiment concurrently, and one of them reads from an inconsistent cut. In the trace below node 2 builds its supporting decision on an inconsistent snapshot involving 1, which changes its state after being read by 2.

Here are the steps to the violation of Agreement. Initially the decision array of nodes is "a","b","b","a".
  1. Node 1 reads from node 2 the value "b".
  2. Node 2 reads from node 1 the value "a". (Note that two nodes are concurrently experimenting and reading state from each other which will get inconsistent soon.)
  3. Node 1 reads from node 3 the value "b", and since tally for "b" is more than F=1, node 1 changes its decision to "b", and concludes its experimentation.
  4. Node 1 finalizes its decision of "b", because it sees an anchored quorum (cardinality >= N-F) for "b".
  5. Node 2 reads node 4 the value "a", and since tally for "a" is more than F=1 (including the now invalid vote from Node 1), node 2 changes its decision to "a", and concludes its experimentation.
  6. Node 3 reads from node 2 the value "a".
  7. Node 3 reads node 4 the value "a", and since tally for "a" is more than F=1, node 3 changes its decision to "a", and concludes its experimentation.
  8. Node 2 finalizes its decision of "a", because it sees an anchored quorum (cardinality >= N-F) for "a". This decision violates Agreement, because Node 1 has finalized its decision to "b", and we have conflicting decisions.

To fix the safety violation, we should disallow concurrent experimentation when it may lead to reading from inconsistent snapshots. This is possible by making the reads preemptive/destructive. (If, instead of using preemptive reads, we try constraining the nodes to read from only non-experimenting nodes, deadlock would happen.) In the above trace, when node 2 reads from node 1, this should have halted node 1's already ongoing experiment. This is easy to achieve by extending/modifying the model above, and when I fixed this problem, I found that safety is always satisfied for N>=3*F+1. (I don't provide that model, because I am considering assigning modeling Texel and Ben-Or as a TLA+ project in my distributed systems class.)

Liveness violation

Liveness is also an interesting story. Even with F=0 and starting state of <<"a","b","b","b">>, we can have a liveness violation. With F=0, reading from one node is enough to change your vote. So the value of "a" may be circulated in the system, since it can keep getting adopted by another minority of processes. The system may not be able to anchor the majority value as the consensus value, and as a result cannot finalize a decision. Side note: When you appoint a leader for consensus (as in Paxos) this vote looping does not become an issue, because the leader will break the symmetry by dictating the value it picks (or a suitable value) to the other nodes for adoption.

In that same setup (with <a,b,b,b>), if I make it F=1, liveness is satisfied, because no node will copy a, as it will need to see another node with a before passing threshold. So, in this case, increasing F did help for liveness. This suggests that maybe we should introduce another free parameter to serve as threshold for value adoption, and not tie that strictly to F the potential number of faults.

By restricting the problem to binary (with only two values) consensus and with proper selection of the threshold for adoption, Texel may solve the problem of having a minority value circulating in the system forever and breaking progress. But even then, we have another progress violation problem. When we introduce experiment cancellation to satisfy the Agreement property, nodes that keep interfering and canceling each others' experiments will violate progress.

This is another place where having a leader for consensus provides advantage. The leader by definition reads from a consistent state, as for that round the other nodes are passive. When you have each node polling for itself, coordinate these distributed transactions to read from clean consistent states becomes very difficult (maybe requires consensus itself).

MAD questions

1. What are the advantages and disadvantages of appointing a leader for solving consensus?
Picking up from the thread of previous discussion on comparing Texel and Paxos, here are pros and cons of appointing a leader node for solving consensus.

There may be symmetry, and no clear winner, when there are multiple initial values present in the system. Using a leader breaks the symmetry, because nodes go with whatever the leader proposes as the vote to decide on. So using a leader, you can solve more than just binary consensus. Even with binary consensus, as we have seen in Texel, liveness can still be jeopardized due to experiment cancellation. And in Ben-Or, liveness is facilitated by jolting the system by using random changes of some values, so that the system will eventually probabilistically converge to a consensus. On the other hand, using a leader boosts liveness in the presence of multiple initial values. (Errr... when things go right. See below.)

On the other hand, trusting a leader to finish a round introduces a problem. What if the leader is dead? (2PC blocking problem!) In order to avoid getting stuck forever, nodes should use a failure detector. Then, upon suspicion of the leader's death, any node can start a new round to lead the rest of the nodes. But what if the suspicion is wrong? The FLP impossibility result strikes again! Fortunately, there is a way to circumvent the impossibility result by postponing liveness and still preserving safety. For example, Paxos preserves safety even with multiple leaders concurrently trying to lead rounds.

Another drawback with having a leader is, if N is large, the leader is a performance bottleneck in the deterministic and instant consensus protocols, like Paxos.

Sunday, September 15, 2019

Modeling an atomic version of Texel: an asynchronous consensus algorithm without rounds

I had written about my preliminary impressions about Texel, an asynchronous consensus algorithm without rounds.

Over the last couple of nights, I have modeled a shared memory version of the algorithm in TLA+, so that I can understand the algorithm better and learn more about the approach. I started with a very rough atomic version of the algorithm, where a node can atomically read the state of all other nodes and update its own state. This is not practical, but it is good for highlighting the essence of the Texel algorithm. In this post, I will talk about this atomic Texel model.

After I got this model down, it was easy for me to refine the atomicity. In the refined model, a process can atomically read from one other process and update its state. That refined model is just one step removed from the message passing Texel algorithm presented in the paper, and demonstrates the tricky issues that arise when multiple nodes are concurrently trying to update their states. In that read-write atomicity model, we see the need for reading states from a consistent-cut, and why some concurrent experiments should be aborted to satisfy that condition. But that read-write atomicity Texel specification is the topic of my next post. Today we just focus on the atomic Texel model.

The atomic Texel model

N stands for number of nodes, and F denotes the number of nodes that can crash. At model checking time, TLA+ toolkit asks you to enter values for N and F. I tried for N=4, and F=0 and F=1. I also tried for N=7, and F=0,1,2. My Texel model satisfies both agreement and progress always for N>=3F+1. Progress is always satisfied, because Choose is deterministic. A node will choose "a" when both "a" and "b" meets SupDecision criteria, which is a supporting value that a node can adopt based on its querying of other nodes. A supporting value is one that is shared by at least F+1 nodes. Note that, for N>=3F+1, it is always possible to find a supporting value for binary consensus, even when up to F nodes fail.

I use f to keep track of actual number of nodes that crash. The model ensures that f=<F. The variable decision tracks of the decision of each node. I hardwire it for N=4 in my run. When I try for N=7, I change this initial assignment.

In my specification, each node has three actions.

Line 23 gives the first action. A node reads the state of other nodes to find a supporting value and adopts it as its own decision value. But the decision is not finalized until, the node sets its finality flag t for the decision to TRUE.

Line 24 starts the second action. If f, the actual number of crashes is still less than F, the allowed number of crashes, then the node can crash by setting its decision to crash permanently.

Line 28 starts the third action. If a node finds that its current decision is shared by at least N-F processes, then that decision is "anchored", and the node can finalize its decision by setting its t=TRUE.

Here are the Agreement and Progress properties I check. Agreement says that if two nodes j and k finalized their decisions, then their decisions cannot differ. Progress says that eventually any non-crashed node will finalize its decision.

For N>=3F+1, both Agreement and Progress are satisfied. Since the atomicity is too rough (a node can read the states of all other nodes atomically), Agreement holds without installing extra mechanisms for reading states from a consistent-cut and aborting other nodes' concurrent experiments, because each experimentation is done atomically and hence in an interleaving manner. Progress holds because the CHOOSE in SupDecision is deterministic, and helps the nodes to converge to one the binary consensus values.

This is a very simple model, but it helped me to come up with the refined read-write atomicity Texel specification quickly, and in that refined model it becomes easy to see what could go wrong when we don't have additional mechanisms in place to enable nodes read from concurrent states.

MAD questions

1. How does Texel compare with Paxos and Ben-Or and how does failure-detectors fit in this picture?

In Paxos, there are rounds, and the rounds are client-restricted. This means that a higher round preempts the lower rounds. A leader leads the rest of the nodes through a round, and this means that the nodes are held hostage by a leader which may be dead. Hence, failure-detectors need to be utilized so that the nodes do not wait forever for a dead leader in an asynchronous model. However, if the failure detectors at nodes are trigger happy, the nodes will suspect whoever is the leader currently for no reason, and will start their own rounds, which preempts the leader's round. This leads to the dueling leaders problem, and violation of liveness even when we have a bound on F, (i.e., F< N/2).

In Texel, there is no need for a failure detector if we have a bound on F (i.e., F<N/3). This is because Texel is a decentralized consensus algorithm, and the nodes do not need to rely/wait on a leader to lead a round; instead all nodes do their own polling and deciding. But as we will discuss in the read-write Texel model (wait for the next post), if the nodes are very snoopy and keep interfering with each others’ experiments, then liveness is still violated. This is where having a leader to lead a round (as in Paxos) provides advantage: the leader by definition reads from a consistent state, as for that round the other nodes are passive.

What if we had non-client restricted rounds as in Fast-Paxos? That is an opportunistic-leader-based solution, and progress is not guaranteed if multiple opportunistic-leaders clash. Then we need to default to Paxos.... which is subject to the failure-detectors result as above for progress! Back to square one.

In the Ben-Or algorithm, there is no need for failure detectors if we have a bound on F (i.e., F<N/2), because that is also a decentralized algorithm. Ben-Or has rounds but the rounds are not client-restricted and do not preempt each other. Also it seems like a node does not interfere/cancel other nodes' progress in querying/experimenting. Ben-Or does not have the disadvantages of Paxos or Texel. So what gives? Ben-Or is a probabilistic algorithm. By using randomization, the system eventually and probabilistically converges to a consensus decision.

(While writing the read-write model of Texel algorithm, I found several parallels between Ben-Or and Texel. Those will also be interesting to investigate more closely.)

Friday, September 13, 2019

Paper review. Gray Failure: The Achilles' Heel of Cloud-Scale Systems

This paper (by Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao) occurred in HotOS 2017. The paper is an easy read at 5 pages, and considers the fault-tolerance problem of cloud scale systems.

Although cloud provides redundancy for masking and replacing failed components, this becomes useful only if those failures can be detected. But some failures that are partial and suble failures remain undetected and these "gray failures" lead to major availability breakdowns and performance anomalies in cloud environments. Examples of gray failures are performance degradation, random packet loss, flaky I/O, memory thrashing/leaking, capacity pressure, and non-fatal exceptions.

The paper identifies a key feature of gray failure as differential observability. Consider the setup in Figure 2. Within a system, an observer  gathers information about whether the system is failing or not. Based on the observations, a reactor takes actions to recover the system. The observer and reactor are considered part of the system. A system is defined to experience gray failure when at least one app makes the observation that system is unhealthy, but observer observes that system is healthy.

This setup may seem appealing, but I take issue with it. Before buying into this setup did we consider a simpler and more general explanation? How about this one? Gray failures occur due to a gap/omission in the specification of the system. If we had formally defined the "healthy" behaviors we want from the system, then we would be able to notice that our detectors/observers are not able to detect the "unhealthy" behaviors sufficiently. And we would look into strengthening the observers with more checks or by adding some end-to-end observers to detect the unhealthy behaviors. In other words, if we had cared about those unhealthy behaviors, we should have specified them for detection, and developed observers for them.

The paper states that the realization about differential observability as the cause of gray failures implies that, "to best deal with them, we should focus on bridging the gap between different components' perceptions of what constitutes failure."
Even for cases where the underlying problem is simply that the observer is doing a poor job of detecting failures (so the gray failures are extrinsic and could be avoided by fixing the observer), such distributed observation can also be helpful. Where to conduct such aggregation and inference is an interesting design question to explore. If it is done too close to the core of a system, it may limit what can be observed. If it is near the apps, the built-in system fault-tolerance mechanisms that try to mask faults may cause differential observability to be exposed too late. We envision an independent plane that is outside the boundaries of the core system but nevertheless connected to the observer or reactor.
But this is a tall order. And this doesn't narrow the problem domain, or contribute to the solution. This is the classic instrumentation and logging dilemma. What should we log to be able to properly debug a distributed system? The answer depends on the system and what you care about the system, what you define as healthy behavior for the system.

I think for dealing with gray failures, one guiding principle should be this: don't ever mask a fault silently. If you mask a fault, do complain about it (raise exceptions and log it somewhere). If you mask failures without pointing out problems, you are in for an unexpected breakdown sooner or later. (This is also the case in human relationships.) Communication is the key. Without complaining and giving feedback, backpressure, the system may be subject to a boiling frog problem, as the baselines gradually slip and degrade.

Cloud anomalies with gray failures

In Section 2, the paper gives examples of cloud anomalies with gray failures, but I have problems with some of these examples.
The Azure IaaS service provides VMs to its customers using highly complex subsystems including compute, storage, and network. In particular, VMs run in compute clusters but their virtual disks lie in storage clusters accessed over the network. Even though these subsystems are designed to be fault-tolerant, parts of them occasionally fail. So, occasionally, a storage or network issue makes a VM unable to access its virtual disk, and thus causes the VM to crash. If no failure detector detects the underlying problem with the storage or network, the compute-cluster failure detector may incorrectly attribute the failure to the compute stack in the VM. For this reason, such gray failure is challenging to diagnose and respond to. Indeed, we have encountered cases where teams responsible for different subsystems blame each other for the incidents since no one has clear evidence of the true cause.
Isn't this a classic example of under-specification of healthy behavior. The VM should have specified its rely/expectation from the storage for its correct execution and accordingly included a detector for when that expectation is violated?

I also have a problem with "the high redundancy hurts" example.
As a consequence, we sometimes see cases where increasing redundancy actually lowers availability. For example, consider the following common workload pattern: to process a request, a frontend server must fan out requests to many back-end servers and wait for almost all of them to respond. If there are n core switches, the probability that a certain core switch is traversed by a request is $1-\frac{n-1}{n}*m$, where m is the fan-out factor. This probability rapidly approaches 100% as m becomes large, meaning each such request has a high probability of involving every core switch. Thus a gray failure at any core switch will delay nearly every front-end request. Consequently, increasing redundancy can counter-intuitively hurt availability because the more core switches there are, the more likely at least one of them will experience a gray failure. This is a classic case where considering gray failure forces us to re-evaluate the common wisdom of how to build highly available systems.

I think this is not a redundancy problem, this is the "you hit cornercases faster in big deployments" problem. This setup is in fact an abuse of distribution as it is the exact opposite of providing redundancy. This setup provides a conjunctive distribution as it makes success depend on success of each component, rather than a disjunctive distribution to make success depend on success of some components.

MAD questions

1. Is this related to robust-yet-fragile concept?
This notion of masked latent faults later causing big disruptions reminds be of the robust-yet-fragile systems concept. Robust-yet-fragile is about highly optimized tolerance. If you optimize your tolerance only for crash failures but not partial/gray failures, you will be very disappointment when you are faced with this unanticipated fault type.
A good example here is the glass. Glass (think of automobile glasses or gorilla glass) is actually very tough/robust material. You can throw pebbles, and even bigger rocks at it, and it won't break or scratch, well, up to a point that is. The glass is very robust to the anticipated faults (stressor) up to a point. But, exceed that point, and then the glass is in shambles.  That shows an unanticipated stressor (a black swan event in Taleb's jargon) for the glass: a ninja stone. The ninja stone is basically a piece of ceramic that you take from the spark plug, and is denser than glass. So if you gently throw this very little piece of ceramic to your car window, it breaks in shambles.  
This is called a robust-yet-fragile structure, and this is actually why we had the Titanic disaster. Titanic, the ship, had very robust panels, but again upto a point. When Titanic exceeded that point a little bit (with the iceberg hitting it), the panels broke into shambles, very much like the glass meeting ninja stone. Modern ships after Titanic, went for resilient, instead of robust (yet fragile) panels. The resilient panels bend easier, but they don't break as miserably. They still hold together to the face of an extreme stressor. Think of plastic; it is less robust but more resilient than glass. 
The robust-yet-fragile effect is also known as highly optimized tolerance. If you optimize tolerance for one anticipated stressor, you become very vulnerable to another unanticipated fault. (Much like the closed Australian ecosystem.)

2. Is fuzzy logic applicable here?
It seems like instead of binary detectors which output healthy or failed, it is better to have detectors that give probabilities and confidence to their decisions. So I thought the phi accrual detectors should be a relevant work to consider for detecting gray failures. I don't know if there are any other fuzzy detectors work for identifying latent failures.

Update: Ryan Huang, one of the authors of the work, left a comment with insightful response to my questions. In the response, he includes links to followup work as well.

Wednesday, September 11, 2019

Paper review. Asynchronous consensus without rounds

This paper by Robbert van Renesse appeared on Arxiv two weeks ago. (Update: Huh, I missed this earlier, but the paper has a footnote that says it was written in 2010.) The paper looks very interesting. I only got to skim the paper, but I will give this a careful read later.

All published crash and Byzantine fault tolerant asynchronous consensus protocols use rounds. (Yes, indeed... Paxos, Viewstamped Replication, even Nakamoto consensus, and Avalanche protocol all use rounds.) Rounds are such an inherent part of consensus algorithms that it is tempting to conclude that solving fault tolerant consensus requires ordered rounds. This paper shows that such a conclusion would be wrong by showing an asynchronous consensus protocol that does not use rounds.

The protocol is named after Texel, an island of Netherlands. Presumably this is because 1) Robbert is Dutch, and 2) he wants to name an alternative island to Paxos island in a sea farther away from the Ionian sea. Texel provides binary consensus (can only decide between 0 and 1 as input votes) without rounds. Nodes query/poll other nodes to make up their minds. If 2/3rd vote is same, the value is anchored. Texel is reminiscent of Avalanche in that it is a decentralized binary consensus algorithm that works by nodes polling other nodes. However, in contrast to Avalanche which uses rounds and is probabilistic, Texel does not use rounds and it is deterministic. Instead of rounds, nodes use querying of consistent cuts and change their vote. The consistent cuts are identified by using vector clocks in the Texel algorithm.

Texel is not a very efficient algorithm, because each node makes up its own mind in a decentralized manner. In contrast, by having leaders coordinate other nodes for which proposal to vote on for a given round, Paxos achieved efficiency and economy.

What is more, in Texel, processes are not allowed to respond to queries if they are experimenting themselves. How is this supposed to terminate/decide? In this respect, Texel again resembles the Avalanche algorithm. If two conflicting proposals exist, consensus is not guaranteed to terminate for Avalanche. The same thing holds for Texel as well.

Texel requires 3f+1 processes, where f is the number of processes that can crash. Byzantine fault-tolerant Texel requires 5f+1 nodes. (I need to check how Texel deals with vector clock integrity in the presence of Byzantine nodes. Maybe it is done through signing entries by corresponding processes, because vector clock entries are only maxed, not updated by processes other than the originators.)


Ok, so what do we have here? Texel is not very efficient. It may not terminate. It uses more processes to tolerate f faulty processes. This all makes me think, rounds are a  great abstraction for distributed systems. Ain't nothing wrong with rounds. They are implementable via ballotnumbers as in PAxos and you are done. The paper also doesn't claim that there is something  wrong with rounds, and neither does it claim that solving consensus without rounds brings any advantages.

On the other hand, this is still a super exciting paper, because Texel proves that distributed fault-tolerant consensus is possible without rounds. Texel breaks new ground! It may be possible to have more useful instances of no-round consensus algorithms in the future. The Texel protocol is derived using stepwise refinement. (Robbert had also used stepwise refinement in his work on chain replication. It is a technique that keeps on giving.) Starting from a high-level specificition of Consensus, an intermediate level specification  called ProtoConsensus with no-rounds is shown to refine the Consensus specification, and Texel is shown to refine ProtoConsensus. It may be possible to search for alternative implementations refining ProtoConsensus.

I am happy that new territory is being explored for decentralized consensus for the last couple years. It is exciting times for distributed systems and algorithms.

MAD questions

1. How could we extend this to multi-decree consensus?
Texel is a single-instance consensus algorithm? Is it possible to extend Texel to multiple-instance consensus in an efficient way and implement state machine replication using it? Is it possible to do linearizable reads from that multi-instance algorithm? Given that there is no leader, and commit time of an update operation is murky, this will be tricky.

2. Is it possible to use HLC instead of VC for finding consistent cuts?
I suppose it may be possible to use HLC for the non-byzantine version of Texel and benefit from loosely synchronized clocks, but I don't know if there would be a big practical gain.

3. How do we implement this protocol in TLA+?
In PlusCal implementing Texel won't be very hard. Modeling Texel in PlusCal may provide value because it will let you test different invariants and temporal properties and exploring variations on the protocol. If I can get the scaffold for this in place, I may even assign this as course project this year.

Monday, September 9, 2019

Paper review. A generalized solution to distributed consensus

This paper (by Heidi Howard and Richard Mortier) proposes a framework for consensus that uses immutable state to simplify the exposition. They show that both Paxos and Fast Paxos are certain instantiations of this general consensus framework. Finally, they outline some new instances of the consensus framework which provide nice benefits in certain setups.

I should note and caution you that this paper considers single instance/decree consensus rather than multiple instances/slots back-to-back consensus. So if you are interested in the latter, there are gaps to fill before you can implement these algorithms to solve multi-decree consensus and maintain RSMs. Moreover while immutability via the write-once registers is great for exposition, extra work needs to be done for achieving implementation efficiency of this abstraction.

The consensus problem

An algorithm solves consensus if it satisfies these requirements:
  • Non-triviality. All output values must have been the input value of a client.
  • Agreement. All clients that output a value must output the same value.
  • Progress. All clients must eventually output a value if the system is reliable and synchronous for a sufficient period.
If we have only one server, the solution is straightforward. The server has a single persistent write-once register, R0, to store the decided value. Clients send requests to the server with their input value. If R0 is unwritten, the value received is written to R0 and is returned to the client. If R0 is already written, then the value in R0 is read and returned to the client. The client then outputs the returned value. This algorithm achieves consensus but requires the server to be available for clients to terminate. To overcome this limitation requires deployment of more than one server, so we now consider how to generalise to multiple servers.
And holy moly! By considering more than one node, we open Pandora's box! It is very difficult to provide distributed consensus in a fault-tolerant manner to the face of bouts of asynchrony, node failures, and message losses.

The reason Paxos is so popular is because Paxos and its variants have excellent fault-tolerant properties that cover all possible corner cases. Paxos preserves consistency even when the system is asynchronous, all timing assumptions are wrong, all failure detectors are wrong, processes crash in an undetected ways, and messages can be lost arbitrarily, etc. The beauty of Paxos is that safety (agreement) is preserved under any condition, and only the progress condition depends on the availability of weakest failure detector (which eventually stabilizes to ensure that the processes do not suspect a single up process as failed). In other words, whenever the setup is out of the impossibility realm of FLP and attacking general results, progress is guaranteed. And even when the setup has not moved beyond that point, Paxos can still provide progress probabilistically.

Generalized consensus framework

Each node (i.e., server) has an infinite series of write-once, persistent registers, {R0, R1, ...}. Clients read and write registers on servers and, at any time, each register is in one of the three states:
  • unwritten, the starting state for all registers
  • contains a value,
  • contains nil.
A register set, i, is the set comprised of the register Ri from each server. Each register set i can be pre-configured (hardwired) with a set of quorums.

Figure 3 describes the generalized consensus framework by giving four rules governing how clients interact with registers.

I like rule 1: quorum agreement. According to this rule,  the clients/proposers need not be competing, they can in fact be collaborating and strengthening each other if they write the same values in the same register set across nodes. This is different than Lamport's formulation. Lamport also gives similar rules for an abstract Vote algorithm as an intermediate step to deriving the Paxos algorithm. But by divorcing the register-sets abstraction from ballots, this framework achieves more generality.

A register-set corresponds roughly to a round or view in the Paxos algorithm. The framework still allows client-restricted register-sets, where a register-set is allocated for use of only a certain client where only one proposal is voted on. These client-restricted register sets can be implemented using ballot-numbers. But it is also possible to have non client-restricted register-sets for which there is no need for clients to use different ballot-numbers. Multiple clients can then write concurrently to the same register-set, and it may even possible to consider them as cooperating if they write the same value.

From its local decision table, each client can track whether decisions have been reached or could be reached by previous quorums. At any given time, each quorum is in one of four decision states:
  • Any: Any value could be decided by this quorum.
  • Maybe v: If this quorum reaches a decision, then value v will be decided.
  • Decided v: The value v has been decided by this quorum; a final state.
  • None: This quorum will not decide a value; a final state.

Figure 5 describes how clients can use decision tables to implement the four rules for correctness. If a client reads a (non-nil) value v from the register r on server s, it learns that:
  • If r is client restricted then all quorums in r must decide v if they reach a decision (Rule 3)
  • If any quorum of register sets 0 to r − 1 reaches a decision then value v is decided (Rule 4)

Instantiating to Paxos

The paper shows that the framework can be instantiated to produce the single-decree Paxos algorithm. All the register-sets are client-restricted and all quorums in a register-set are assigned to be majority quorums. Note that this Paxos instance Figure 8 satisfies the four rules in the general consensus framework introduced, so it is an instance of that framework. On the other hand, the algorithm is still not straightforward even after understanding the general framework. There is still a leap to be made from the framework to the rules to the single-decree Paxos algorithm. Why two phases? What should be accomplished in Phase-1? How does Phase-2 relate and build on Phase-1? Similar issues were also present  going from the Vote algorithm to Paxos in Lamport's exposition of Paxos. As I have been telling students, Paxos looks deceptively simple, but there is a lot of depth and levels to it, and you should consider other ways to make this work (and mostly fail) to appreciate why the algorithm is that way.

Again, remember that this is only single instance/decree Paxos, and the paper does not provide a MultiPaxos instantiation. For this single decree Paxos the clients may be competing. Two clients may be sending their proposals to different replicas in the register-sets assigned to them via their unique ballot-numbers. If you consider the client colocated with one of the replicas  then this setup maps better to the popular descriptions of Paxos with a leader that goes through Phase-1 and Phase-2.

A new consensus algorithm

In Section 6, the paper gives example of a new consensus algorithm that has some advantages for certain environments.
Co-located consensus. Consider a configuration which uses a quorum containing all servers for the first k register sets and majority quorums afterwards, as shown in Figure 12b. All register sets are client restricted. Participants in a system may be deciding a value between themselves, and so a server and client are co-located on each participant. A client can therefore either achieve consensus in one round trip to all servers (if all are available) or two round trips to any majority (in case a server has failed).
Ok, what does this mean?

All the k clients are colocated with the k replicas/servers. If only one client/replica has something to propose that client/replica can write its value in the register set allocated to it at all other nodes in an uncontested manner and succeed in one round.

If multiple client/replicas has a value to propose, there is still chance that the highest id client/replica x succeed in achieving consensus in one round if all replicas are available. When x writes to its client-restricted register set, it writes nil to its register for all previous rounds and thus invalidates consensus attempts of all the replica/clients with smaller id register sets making their decision=NONE. This satisfies the 4 rules in Figure 3 as well as the client decision table rules in Figure 5.

Another way to understand this, in a manner closer to a message-passing implementation, is to consider the flexible quorums result applied to Paxos protocol. Here all the replicas are competing with ballot-numbers (all register sets are client-restricted), and phase 1 quorum is just one node, which is the replica/proposer itself. So the phase 1 is skipped since it does not require consent from another node. In order for phase 2 to intersect phase 1, the phase 2 quorum is the set of all replicas. And in this set up, if multiple replicas have something to propose, the replica with the highest id can achieve consensus in one phase (that is phase 2) if all replicas are available.

If a replica is crushed or unresponsive, the algorithm still offers fault-tolerance, because after their first try with flexible quorum, the replicas switch to majority phase-1 & phase-2 quorums and achieve consensus if a majority of nodes are available.

MAD questions

1. Is this just a different presentation of the same old stuff or is there more here?

Yes, there is more here than just re-packaging consensus with immutable write-once registers.

As I wrote above, I really liked how this enables the nodes to collaborate in non client-restricted register sets. When the register set is non client-restricted, the clients do not compete with ballot-numbers and multiple clients writing the same value can in fact strengthen that value and improve consensus.

I also liked that it is possible to pre-assign different quorums to different rounds. Many different combinations can be possible this way. Also, as I mentioned, the algorithms in the last section are interesting.

These being said, even if you didn't get anything new here and labeled this as just a different packaging/presentation of the Paxos consensus idea, I would argue this is still worth reading. Because by learning about the same topic from a different perspective, you will strengthen your understanding of the topic.

Saturday, September 7, 2019

Linearizable Quorum Reads in Paxos

While there has been a lot of work on Paxos protocols, there has not been any study that considers the read operation in Paxos protocols thoroughly. Read operations do not mutate state, and in many applications the read operations outnumber the update operations.

Traditionally, there have been three main ways to perform reads.
  1. Treat the read as a regular command, let the leader clear it with a quorum, and return the response
  2. Use a lease on the leader (to prevent another leader emerging and committing an update), and read from the leader
  3. Read from one of the replicas
The first two approaches provide linearizable reads, but the last method is non-linearizable. For example, in ZooKeeper or Raft if you read from a replica, it returns stale values, because the leader commits first---after hearing from a quorum of replicas--- and the replicas follow behind. (Linearizability means that the distributed system emulates a single register where each client can read or write from that register. Each operation appears to have occurred instantaneously between the time when it is invoked and the time when it produces a response.)

While the first two approaches provide linearizable reads, they both involve the leader for read operations. However, the leader is already overwhelmed with write operations: for each write it is doing disproportionately large work. Involving the leader with the reads magnifies the bottleneck at the leader.

Paxos Quorum Reads

To solve this problem and provide linearizable reads in Paxos without involving the leader, we (Aleksey Charapko, Ailidani Ailijiang, and Murat Demirbas) have introduced Paxos Quorum Reads (PQR) in a recent work. PQR can work in an asynchronous setup without requiring leases or involving the leader.

A client multicasts the quorum-read request to a majority quorum of replicas and waits for their replies. Each reply message contains three fields, the highest accepted slot number $s$, the highest applied slot number $\underline{s}$, and the value $v$ of the object requested. (There are four possible states of a slot: empty, accepted $v$, committed $\hat{v}$, and applied $\underline{v}$.)

We say that a read is clean if a quorum of replicas return the same slot number for $s$ and $\underline{s}$ for which the requested data item is last updated. In this case, a quorum read is successful and the learned value can be returned immediately. Intuitively, a clean read means it not possible for the leader to have a higher committed value $\hat{s}$ than any of the replicas.

Note that it is possible for some replicas to return $s=\underline{s}=x$ and some replicas to return $s=\underline{s}=x+k$. This is still a clean read, and the client uses $s=x+k$, the higher value, as the clean read value. This read is clean, because it is impossible for the leader to have $\hat{s}$ greater than $s=x+k$. If that was the case, the read quorum would have returned at least one node that has $s$ greater than $\underline{s}$, violating the clean read.

Accordingly, a read is dirty if at least one node has seen a higher slot number in accepted state $s>\underline{s}$. The value learned from dirty read is unsafe to return as the same value is not guaranteed to be seen from subsequent reads. Therefore, a second phase of read is required to confirm that such a slot is finalized/cleaned. For the second phase, the rinse phase, the client retries any replica to see if $s$ is now applied/executed. If so, the read is completed as the current value in slot $s$. It is possible to perform this second phase as a callback from a replica to the client as well.

There are several optimizations possible over this basic scheme, and apply this to many Paxos flavors. We are currently investigating those optimizations.

To recap, the important thing about PQR is that it helps balance the load between Paxos leader and replicas. Relieving the leader from serving reads allows it to serve more writes. Reads use underutilized replicas and are performed by clients. This way PQR improves throughput, especially in write-heavy workloads. The figure shows that with a 75% writes workload, PQR  is able to provide better latency and higher maximum throughput.

You can read more about the PQR method in our HotStorage paper.

Thursday, September 5, 2019

On becoming a researcher

I remember the first semester I joined THE Ohio State University as a graduate student, Spring 1998. I had taken Anish Arora's class on distributed systems, and I fell in love with distributed algorithms. The emergence of order out of chaotic (distributed/concurrent) execution was intellectually challenging and was very exciting to me. The way Anish taught was also very engaging, and I got drawn into trying the algorithms he mentioned in the class. He taught about the Dijkstra-Safra algorithm which performed termination detection in a ring with 3 rotations of the token. I thought it should be possible to improve on the 3 rounds required for completion, if we kept track of more information at the token. I mentioned this to Anish after the class, and he told me to give it a try. I started working on it. I was also taking English as A Second Language writing class that semester. So I wrote a technical report on this improved version of the Termination Detection algorithm. (We ended up not publishing the paper because we thought there wasn't much practical interest in the algorithm. However, the technical report got cited several times, and I still get occasional emails about it.)

Twenty years later, I still remember the kick I got out of working on that improved algorithm. I would work in the library: failing, succeeding, then realizing that this didn't work either, and fixing it. It was the first time I experienced the highs and lows of research. Little did I know then, that I would be repeating these cycles many many many times in the remaining of my career. When you are researching on a novel idea, you start pretty much afresh every time. You think you get it, then you lose it, then you get it again, and lose it again. The idea you think is so novel looks straightforward/trivial and worse broken next week.

As you become a seasoned researcher, you still experience these things, but you also get wiser. Since you know about these cycles, you will feel less panicked when you think all is lost. But the joy you get when you discover something new doesn't get old. I still experience this with the same intensity after 20 years. And I saw this joy in the eyes of researchers that are 40 years in the field. They still get starry-eyed and excited when they find a new research question or insight. The joy of figuring things out never gets old.

As far as research is concerned the titles are meaningless. What you do matters more than what you call yourself. If you are doing research, you are a researcher. You don't need any other title. You need curiosity, tenacity, and hard working.

How do you get started? You start by reading things you don't understand. You read more and more papers on a topic, underlining things, and figuring out them by thinking and obsessing about them. When you start to understand, you start to be able to criticize and analyze these work better. Then you start identifying gaps, and you begin to address them. Then you find one particular tasty problem, you wrestle with it for days and weeks. You will go through the ups and downs many times. And that is how you do research.

Matt Might's illustrated guide to PhD explains this last point nicely.

And here is some more advice from me to beginning PhD students.

Tuesday, September 3, 2019

Reading list for our distributed systems class

Here is our reading list. It follows our distributed systems course schedule provided in the previous post. I tried to choose the papers that are foundational, comprehensive, and readable by a first year graduate student. In some cases, I omitted very long or hard to follow papers ---even though they may be foundational papers--- and instead included some follow up papers that summarized the concept better.

I assign these papers as review material for the corresponding topic. The students then choose one from the batch and review that one critically. But I hope that the students can read all of them if possible.

These papers should be all available with a Google Scholar search. Some of these papers appear in open access venues. If that is not the case, often the authors make their papers available freely at their websites, as they have the right to do. "Authors retain the right to use the accepted author manuscript for personal use, internal institutional use and for permitted scholarly posting provided that these are not for purposes of commercial use or systematic distribution".

Consensus and Paxos

  • Paxos Made Moderately Complex. Robbert Van Renesse and Deniz Altinbuken, ACM Computing Surveys, 2015.
  • Paxos made live - An engineering perspective. Tushar D. Chandra, Robert Griesemer, Joshua Redstone. ACM PODC, Pages: 398 - 407, 2007 
  • ZooKeeper: Wait-free coordination for internet-scale systems P. Hunt, M. Konar, F. P. Junqueira, and B. Reed  USENIX ATC 2010. 
  • The Chubby Lock Service for Loosely-Coupled Distributed Systems. Mike Burrows, OSDI 2006.
  • In Search of an Understandable Consensus Algorithm. Diego Ongaro, John Ousterhout, USENIX ATC, 2014.
  • WPaxos: Wide Area Network Flexible Consensus. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Tevfik Kosar, IEEE TPDS, 2019.
  • Dissecting the Performance of Strongly-Consistent Replication Protocols. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Sigmod 2019.
  • Viewstamped replication revisited. Barbara Liskov and James Cowling. MIT-CSAIL-TR-2012-021, 2012.
  • Chain Replication for Supporting High Throughput and Availability. Robbert van Renesse and Fred B. Schneider, OSDI 2004. 
  • FAWN: A Fast Array of Wimpy Nodes David G. Andersen and Jason Franklin and Michael Kaminsky and Amar Phanishayee and Lawrence Tan and Vijay Vasudevan. SOSP 2009.
  • CORFU: A shared log design for Flash clusters. Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, John D. Davis, NSDI'2012.

Failure detectors and fault-tolerance

  • Unreliable Failure Detectors for Reliable Distributed Systems, Tushar Deepak Chandra and Sam Toueg, Journal of the ACM, 1996.
  • Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm, OSDI 2014.
  • Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages,  Haryadi S. Gunawi, Mingzhe Hao, and Riza O. Sumintom Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar, SOCC 2016.
  • Does The Cloud Need Stabilizing? Murat Demirbas, Aleksey Charapko, Ailidani Ailijiang, 2018.
  • TaxDC: A Taxonomy of nondeterministic concurrency bugs in datacenter distributed systems, Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi, ASPLOS 2016.

Time and snapshots

  • Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM,  1978.
  • Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases. Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, and Marcelo Leone, 2014.
  • Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985.

Cloud computing

  • Tail at scale. Jeff Dean, Luiz Andre Barroso, Commn of the ACM, 2013. 
  • Lessons from Giant-Scale Services. Eric A. Brewer, IEEE Internet Computing, 2001.
  • Above the Clouds: A Berkeley View of Cloud Computing. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica and Matei Zaharia.  EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2009-28
February 10, 2009.
  • Serverless computing: One step forward, two steps back, UC Berkeley, CIDR 2019.
  • Cloud Programming Simplified: A Berkeley View on Serverless Computing, 2019.
  • On designing and deploying Internet scale services, James Hamilton, LISA 2007.

NoSQL and distributed databases

  • Life beyond Distributed Transactions: an Apostate’s Opinion. Pat Helland, CIDR 2007. 
  • Optimistic Replication. Yasushi Saito and Marc Shapiro, ACM Computing Surveys, 2005.
  • CAP Twelve Years Later: How the "Rules" Have Changed. Eric Brewer, IEEE Computer, 2012
  • PNUTS: Yahoo!'s Hosted Data Serving Platform. Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni, VLDB 2008.
  • Dynamo: Amazon’s Highly Available Key-Value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, ACM SIGOPS 2007.
  • Bigtable: A Distributed Storage System for Structured Data. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, ACM Transactions on Computer Systems, 2008.
  • Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh,Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura,David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak,Christopher Taylor, Ruth Wang, Dale Woodford, ACM Trans on Computer Systems, 2013.

Big data processing

  • MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, Commn of the ACM, 2008.
  • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012.
  • TUX2: Distributed Graph Computation for Machine Learning, Wencong Xiao,  Jilong Xue, Youshan Miao, Zhen Li, Cheng Chen and Ming Wu, Wei Li, Lidong Zhou, NSDI 2017.
  • Proteus: agile ML elasticity through tiered reliability in dynamic resource markets. Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, Phillip B. Gibbons. EuroSys, 2017.
  • TensorFlow: A system for large-scale machine learning, Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. OSDI 2016.

Decentralized ledgers

  • Practical Byzantine Fault Tolerance, Miguel Castro and Barbara Liskov, OSDI'99.
  • Bitcoin: A Peer-to-Peer Electronic Cash System, Satoshi Nakamoto, 2008.
  • Scalable and Probabilistic Leaderless BFT Consensus through Metastability, Team Rocket, Maofan Yin, Kevin Sekniqi, Robbert van Renesse, and Emin Gun Sirer, 2019.
  • Untangling Blockchain: A Data Processing View of Blockchain Systems. Tien Tuan Anh Dinh, Rui Liu, Meihui Zhang, Gang Chen, Beng Chin Ooi, IEEE Transactions on Knowledge and Data Engineering, 2017.
  • Bridging Paxos and Blockchain Consensus. A. Charapko, A. Ailijiang, M. Demirbas,  IEEE Blockchain, 2018.
  • Blockchains from a distributed computing perspective, Maurice Herlihy, Commn of the ACM, 2017.

Sunday, September 1, 2019

Distributed systems course revamped

I revamped my distributed systems course. I cut out algorithms that do not have much practical use ---even though they are elegant--- such as dining philosophers, termination detection, and self-stabilizing token ring algorithms. I increased coverage of modern distributed systems (distributed databases, big data analysis systems, and cloud computing services).

The new lean and mean course schedule

The first 6 weeks of the course covers the fundamentals, and the second part covers the technologies that build on these foundations.
  • Introduction, 2 phase-commit
  • Reasoning about distributed programs, safety/progress
  • Consensus, Paxos
  • Failure detectors, Faults and fault-tolerance
  • Time: logical clocks, State: distributed snapshots

  • Datacenter computing, Cloud computing
  • NoSQL databases, CAP theorem, Distributed databases
  • Big data, Big data analytics
  • Decentralized ledgers and blockchains

I believe it is important to teach reasoning about distributed programs. I don't expect the students to prove invariants about distributed programs, but I want them to understand and internalize the concept: Each action of the program should preserve the invariant (starting from a state satisfying the invariant). Then the induction does its job, and regardless of which order the actions are executed at any of the processes, the invariant still holds for the distributed system.

Also, in the first part, I take a lot of time to make sure I teach Paxos right. I give TLA+ specifications for Paxos. I get students to role-play Paxos under several scenarios so they can understand how Paxos covers all possible corner cases.

For the second part, I believe it is important for the students to understand that real systems involve tradeoffs. In order to help students develop critiquing skills, I assign several papers to read on a topic, and ask students to write a critical review of one of the papers as the assignment. The students initially find critical reading hard, but then they get to appreciate the experience.

Teaching rigorously can sometimes be frustrating, because many students are just looking for an easy class and getting done with the material. Many students are turned off, when I start talking about invariants, consistency guarantees, using TLA+, critiquing tradeoffs in distributed systems. But I try not to compromise the quality and rigor in the course, because I know that at least half the students are motivated, and they deserve to get a proper introduction to the field.
Hello Professor, 
Hope you're doing well. I was a Computer Science Masters student who graduated last year in Feb. During my masters, you were our Distributed Systems professor, and I wanted to thank you for the way in which you'd taught the course. Reading various papers, and thinking about them is a great way of understanding and reasoning about them. Although, I didn't fully appreciate this method while actually studying the course. 
However, three months ago, just out of curiosity, I started reading the Dynamo DB paper, this time with only one goal, to take as much time as required, to read it purposefully, to internalize it properly. And, it took almost 15 days to understand almost 95% of the paper. I spent all my commute reading it. By the time I finished reading it, I was not only able to understand how the various smaller components fit together, I was able to appreciate the beauty of those complex systems. That I feel, what your goal was! And since then, I've read 5 other papers like Paxos, Aurora, BigTable, Zookeeper, Spanner and I will continue to read more. 
I've now realized what I want to work towards in the long term, and would like to thank you for your teaching, which have been a crucial in shaping my thoughts. :) 
Thank you Professor! It's been a privilege!
Best Regards,


I don't follow a textbook because there isn't a good textbook on modern distributed systems. Some of the textbooks are too theoretical. Most of them are about regurgitating the existing algorithms and subsystems. That is no way to learn a material. When you read a description of an algorithm from a book, you would say "yeah that should work", but even if that was an incorrect algorithm, and you would still say, "yeah that should work". By just reading a description of something and knowing its name, you don't learn that thing. You should get a sense of why it works, how does it cover corner cases, why is it designed in this particular way out of a dozen alternatives.

I refer the students to two free textbooks if they want some reference material:
+ Maarten van Steen Andrew S. Tanenbaum. Distributed Systems
+ Paolo Sivilotti, Introduction to Distributed Systems, 2005 (This is useful for learning about reasoning about distributed algorithms.)

I try to keep an open mind about textbooks, so if you have a textbook to recommend, let me know.


I give a TLA+ project each semester. Since I teach the course with an emphasis on reasoning about the correctness of distributed algorithms, TLA+ is a good fit and complement for my course. Integrating TLA+ to the class gives students a way to get a hands-on experience in algorithms design and dealing with the intrinsic complexities of distributed systems: concurrent execution, asymmetry of information, concurrency bugs, and a series of untimely failures. TLA+ has a lot to offer for practice and implementation of distributed systems as well. At Amazon and Microsoft, engineers use TLA+ for modeling production systems that see a lot of updates and new features. TLA+ helps the engineers find several critical bugs introduced by updates/features in the design stage, which if not found would have resulted in large amount of engineering effort later on.

Ideally I would also like to give a programming project using our Paxi framework. Paxi provides most of the elements that any Paxos implementation or replication protocol needs, including network communication, state machine of a key-value store, client API, and multiple types of quorum systems. The developer only needs to fill in two modules for describing the distributed coordination protocol (many Paxos variants are already implemented using Paxi). In order to provide a leveled playground for protocol evaluation and comparison, Paxi also includes benchmarking support to evaluate the protocols in terms of their performance, availability, scalability, and consistency.

Ailidani has implemented Paxi in Go programming language and made it available as opensource at (It has been starred more than 300 times on GitHub.)

On the other hand, I don't want to burden students further by asking them to learn Go and hack on Paxi, as they have a lot of other things going at this course and 2-3 other courses they are taking in the same semester. Moreover the class has more than 150 students, and supporting extra projects at a class of this size require more resources than we are provided with. As a middle ground, I will assign students who want to go the extra distance optional/ungraded projects in Paxi. Some examples we are thinking of are: Mencius, RingPaxos, ScatterNet, Reconfiguration strategies, Vector-clock based multi-master replication, PBFT, and some decentralized ledgers protocols. If you are interested in hacking on this, you are welcome to try on your own, and contact us if you have questions.

Here are the couple first lectures of my course. At the end of the semester, I will consider making all the lectures available on GitHub.

In the next post, I will provide the reading list for my distributed systems class. It is long, and it is better to give it separately.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...