Showing posts from September, 2019

Do leases buy us anything?

Consider the atomic storage problem, which is a simpler problem than consensus. CAP theorem says that when there is a partition, you will need to sacrifice strong-consistency or high-availability.

Using leases, it is possible to sacrifice availability for sometime (until lease expires), and reconfigure the storage system (often by shrinking the quorum) to keep providing availability. Consistency is preserved throughout, and availability is sacrificed only during lease expiration time. This is a good tradeoff. (I am going to bring up the question of whether this may help circumvent CAP at the end of the post.)

But is this power unique to leases? Is there a way to explain this in an alternate way using only failure detectors instead of leases? This is the question I want to explore here.

Many distributed systems implement leases with just countdown timers and without NTP timestamps. This is because in the short-term the rate of clocks at processes don't drift too much.

So maybe, we …

Some of my peculiarities

Daniel Lemire recently wrote about his kindergarten experience, how he couldn't tie his shoes and couldn't memorize numbers for counting. I had similar experiences. I was a smart kid (not just my mom's opinion :-), but I couldn't memorize how to count to 5 until first grade (which I started at 5 years old --which is another story). My mom was a primary school teacher, and she was worried that she couldn't teach me how to count. She invented some rhyming words for numbers to help me memorize them, but apparently that didn't work because I would say the rhyming words instead of the numbers.

Even up to third grade, I would occasionally put on my shoes the wrong foot. I couldn't learn how to tie my shoes properly till middle/high school. In middle/high school I started doing a single loop tie. I learned how to do double loop tie only after university. On the other hand, I had decent coordination. I played soccer and basketball fine.

I had a very hard time learn…

Teaching Paxi

Paxos family of protocols (which I refer to as Paxi) is immensely useful for building distributed systems due to their excellent fault-tolerance properties. Many cloud computing services and distributed databases employ Paxi for state machine replication (SMR). Paxi preserve the safety of consensus problem (no two nodes commit different values for the same slot) even to the face of a fully asynchronous execution, crash faults, message losses, and network partitions. Paxi satisfy liveness of consensus problem (some value is eventually committed for the slot) when the system moves outside the realm of the coordinated attack and FLP impossibility results.

Paxi are perennially misunderstood and their sophistication underrated. While there has been a lot of work on Paxi, we have been able to explore only a fraction of the algorithm design space. A striking evidence of this arrived in 2016, where we had a flexible quorum breakthrough after 30 years, which no one had anticipated.

There is a …

Avalanche is a descendent of Texel

When I first skimmed the Texel paper (which was released on August 28 2019), I could see parallels between Texel and Avalanche, and noted those in the third paragraph of my review. But I had missed the footnote in the Texel paper which said that Texel was originally written in 2010. I thought Texel was new, and was speculating that it may be a more deterministic version of Avalanche, that is applied to crash tolerant distributed consensus. After writing twomore blog posts on modeling Texel in TLA+ and understanding it better, I now think Texel formed a basis that Avalanche descended from.

Texel provided an asynchronous and truly-leaderless solution to consensus. Instead of appointing a leader to bring nodes to consensus (as in Paxos), Texel shows how each node can make its own mind and still achieve consensus in an asynchronous system. By adopting a leaderless solution to asynchronous consensus, Texel avoids the disadvantages of solutions that appoint a leader for achieving consensus.…

Modeling a read-write version of Texel: an asynchronous consensus algorithm without rounds

In the previous post I gave a model of atomic Texel, where a node can atomically read all other nodes' decision and update its own decision. Here is a refined version of that, where a node can atomically read the state of *one* other node and update its decision. This refined model shows why it is important for the nodes to read from consistent cuts, and how when multiple nodes are experimenting they can violate this requirement, and Agreement property is violated as a result.

The model This builds and extends over the previous model. N stands for number of nodes, and F denotes the number of nodes that can crash. We use f to keep track of actual number of nodes that crash. In addition to the *decision* array that tracks the decision of each node, we now have an *exp* array that denotes the experimentation status of each node. Initially each node is in the experimenting state.

Each node starts with t=FALSE (the decision is not finalized), pollSet= Procs \{self} (the node can poll a…

Modeling an atomic version of Texel: an asynchronous consensus algorithm without rounds

I had written about my preliminary impressions about Texel, an asynchronous consensus algorithm without rounds.

Over the last couple of nights, I have modeled a shared memory version of the algorithm in TLA+, so that I can understand the algorithm better and learn more about the approach. I started with a very rough atomic version of the algorithm, where a node can atomically read the state of all other nodes and update its own state. This is not practical, but it is good for highlighting the essence of the Texel algorithm. In this post, I will talk about this atomic Texel model.

After I got this model down, it was easy for me to refine the atomicity. In the refined model, a process can atomically read from one other process and update its state. That refined model is just one step removed from the message passing Texel algorithm presented in the paper, and demonstrates the tricky issues that arise when multiple nodes are concurrently trying to update their states. In that read-write …

Paper review. Gray Failure: The Achilles' Heel of Cloud-Scale Systems

This paper (by Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao) occurred in HotOS 2017. The paper is an easy read at 5 pages, and considers the fault-tolerance problem of cloud scale systems.

Although cloud provides redundancy for masking and replacing failed components, this becomes useful only if those failures can be detected. But some failures that are partial and suble failures remain undetected and these "gray failures" lead to major availability breakdowns and performance anomalies in cloud environments. Examples of gray failures are performance degradation, random packet loss, flaky I/O, memory thrashing/leaking, capacity pressure, and non-fatal exceptions.

The paper identifies a key feature of gray failure as differential observability. Consider the setup in Figure 2. Within a system, an observer  gathers information about whether the system is failing or not. Based on the observations, a reactor takes ac…

Paper review. Asynchronous consensus without rounds

This paper by Robbert van Renesse appeared on Arxiv two weeks ago. (Update: Huh, I missed this earlier, but the paper has a footnote that says it was written in 2010.) The paper looks very interesting. I only got to skim the paper, but I will give this a careful read later.

All published crash and Byzantine fault tolerant asynchronous consensus protocols use rounds. (Yes, indeed... Paxos, Viewstamped Replication, even Nakamoto consensus, and Avalanche protocol all use rounds.) Rounds are such an inherent part of consensus algorithms that it is tempting to conclude that solving fault tolerant consensus requires ordered rounds. This paper shows that such a conclusion would be wrong by showing an asynchronous consensus protocol that does not use rounds.

The protocol is named after Texel, an island of Netherlands. Presumably this is because 1) Robbert is Dutch, and 2) he wants to name an alternative island to Paxos island in a sea farther away from the Ionian sea. Texel provides binary co…

Paper review. A generalized solution to distributed consensus

This paper (by Heidi Howard and Richard Mortier) proposes a framework for consensus that uses immutable state to simplify the exposition. They show that both Paxos and Fast Paxos are certain instantiations of this general consensus framework. Finally, they outline some new instances of the consensus framework which provide nice benefits in certain setups.

I should note and caution you that this paper considers single instance/decree consensus rather than multiple instances/slots back-to-back consensus. So if you are interested in the latter, there are gaps to fill before you can implement these algorithms to solve multi-decree consensus and maintain RSMs. Moreover while immutability via the write-once registers is great for exposition, extra work needs to be done for achieving implementation efficiency of this abstraction.

The consensus problem An algorithm solves consensus if it satisfies these requirements:
Non-triviality. All output values must have been the input value of a client.…

Linearizable Quorum Reads in Paxos

While there has been a lot of work on Paxos protocols, there has not been any study that considers the read operation in Paxos protocols thoroughly. Read operations do not mutate state, and in many applications the read operations outnumber the update operations.

Traditionally, there have been three main ways to perform reads.
Treat the read as a regular command, let the leader clear it with a quorum, and return the responseUse a lease on the leader (to prevent another leader emerging and committing an update), and read from the leaderRead from one of the replicas The first two approaches provide linearizable reads, but the last method is non-linearizable. For example, in ZooKeeper or Raft if you read from a replica, it returns stale values, because the leader commits first---after hearing from a quorum of replicas--- and the replicas follow behind. (Linearizability means that the distributed system emulates a single register where each client can read or write from that register. Each…

On becoming a researcher

I remember the first semester I joined THE Ohio State University as a graduate student, Spring 1998. I had taken Anish Arora's class on distributed systems, and I fell in love with distributed algorithms. The emergence of order out of chaotic (distributed/concurrent) execution was intellectually challenging and was very exciting to me. The way Anish taught was also very engaging, and I got drawn into trying the algorithms he mentioned in the class. He taught about the Dijkstra-Safra algorithm which performed termination detection in a ring with 3 rotations of the token. I thought it should be possible to improve on the 3 rounds required for completion, if we kept track of more information at the token. I mentioned this to Anish after the class, and he told me to give it a try. I started working on it. I was also taking English as A Second Language writing class that semester. So I wrote a technical report on this improved version of the Termination Detection algorithm. (We ended u…

Reading list for our distributed systems class

Here is our reading list. It follows our distributed systems course schedule provided in the previous post. I tried to choose the papers that are foundational, comprehensive, and readable by a first year graduate student. In some cases, I omitted very long or hard to follow papers ---even though they may be foundational papers--- and instead included some follow up papers that summarized the concept better.

I assign these papers as review material for the corresponding topic. The students then choose one from the batch and review that one critically. But I hope that the students can read all of them if possible.

These papers should be all available with a Google Scholar search. Some of these papers appear in open access venues. If that is not the case, often the authors make their papers available freely at their websites, as they have the right to do. "Authors retain the right to use the accepted author manuscript for personal use, internal institutional use and for permitted sc…

Distributed systems course revamped

I revamped my distributed systems course. I cut out algorithms that do not have much practical use ---even though they are elegant--- such as dining philosophers, termination detection, and self-stabilizing token ring algorithms. I increased coverage of modern distributed systems (distributed databases, big data analysis systems, and cloud computing services).

The new lean and mean course schedule The first 6 weeks of the course covers the fundamentals, and the second part covers the technologies that build on these foundations.
Introduction, 2 phase-commitReasoning about distributed programs, safety/progressConsensus, PaxosFailure detectors, Faults and fault-toleranceTime: logical clocks, State: distributed snapshots
Datacenter computing, Cloud computingNoSQL databases, CAP theorem, Distributed databasesBig data, Big data analyticsDecentralized ledgers and blockchains
I believe it is important to teach reasoning about distributed programs. I don't expect the students to prove invar…

Popular posts from this blog

I have seen things

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Frugal computing

Learning about distributed systems: where to start?

Fine-Grained Replicated State Machines for a Cluster Storage System

My Distributed Systems Seminar's reading list for Spring 2020

Cross-chain Deals and Adversarial Commerce

Book review. Tiny Habits (2020)

Zoom Distributed Systems Reading Group