Posts

Showing posts from September, 2019

Do leases buy us anything?

Consider the atomic storage problem , which is a simpler problem than consensus. CAP theorem says that when there is a partition, you will need to sacrifice strong-consistency or high-availability. Using leases, it is possible to sacrifice availability for sometime (until lease expires), and reconfigure the storage system (often by shrinking the quorum) to keep providing availability. Consistency is preserved throughout, and availability is sacrificed only during lease expiration time. This is a good tradeoff. (I am going to bring up the question of whether this may help circumvent CAP at the end of the post.) But is this power unique to leases? Is there a way to explain this in an alternate way using only failure detectors instead of leases? This is the question I want to explore here. Many distributed systems implement leases with just countdown timers and without NTP timestamps. This is because in the short-term the rate of clocks at processes don't drift too much. So ma

Some of my peculiarities

Daniel Lemire recently wrote about his kindergarten experience , how he couldn't tie his shoes and couldn't memorize numbers for counting. I had similar experiences. I was a smart kid (not just my mom's opinion :-), but I couldn't memorize how to count to 5 until first grade (which I started at 5 years old --which is another story). My mom was a primary school teacher, and she was worried that she couldn't teach me how to count. She invented some rhyming words for numbers to help me memorize them, but apparently that didn't work because I would say the rhyming words instead of the numbers. Even up to third grade, I would occasionally put on my shoes the wrong foot. I couldn't learn how to tie my shoes properly till middle/high school. In middle/high school I started doing a single loop tie. I learned how to do double loop tie only after university. On the other hand, I had decent coordination. I played soccer and basketball fine. I had a very hard time l

Teaching Paxi

Paxos family of protocols (which I refer to as Paxi) is immensely useful for building distributed systems due to their excellent fault-tolerance properties. Many cloud computing services and distributed databases employ Paxi for state machine replication (SMR). Paxi preserve the safety of consensus problem (no two nodes commit different values for the same slot) even to the face of a fully asynchronous execution, crash faults, message losses, and network partitions. Paxi satisfy liveness of consensus problem (some value is eventually committed for the slot) when the system moves outside the realm of the coordinated attack and FLP impossibility results. Paxi are perennially misunderstood and their sophistication underrated. While there has been a lot of work on Paxi, we have been able to explore only a fraction of the algorithm design space. A striking evidence of this arrived in 2016, where we had a flexible quorum breakthrough after 30 years, which no one had anticipated. There

Avalanche is a descendent of Texel

When I first skimmed the Texel paper (which was released on August 28 2019) , I could see parallels between Texel and Avalanche , and noted those in the third paragraph of my review . But I had missed the footnote in the Texel paper which said that Texel was originally written in 2010. I thought Texel was new, and was speculating that it may be a more deterministic version of Avalanche, that is applied to crash tolerant distributed consensus. After writing two more blog posts on modeling Texel in TLA+ and understanding it better, I now think Texel formed a basis that Avalanche descended from. Texel provided an asynchronous and truly-leaderless  solution to consensus. Instead of appointing a leader to bring nodes to consensus (as in Paxos), Texel shows how each node can make its own mind and still achieve consensus in an asynchronous system. By adopting a leaderless solution to asynchronous consensus, Texel avoids the disadvantages of solutions that appoint a leader for achieving co

Modeling a read-write version of Texel: an asynchronous consensus algorithm without rounds

Image
In the previous post I gave a model of atomic Texel , where a node can atomically read all other nodes' decision and update its own decision. Here is a refined version of that, where a node can atomically read the state of *one* other node and update its decision. This refined model shows why it is important for the nodes to read from consistent cuts, and how when multiple nodes are experimenting they can violate this requirement, and Agreement property is violated as a result. The model This builds and extends over the previous model. N stands for number of nodes, and F denotes the number of nodes that can crash. We use f to keep track of actual number of nodes that crash. In addition to the *decision* array that tracks the decision of each node, we now have an *exp* array that denotes the experimentation status of each node. Initially each node is in the experimenting state. Each node starts with t=FALSE (the decision is not finalized), pollSet= Procs \{self} (the node can

Modeling an atomic version of Texel: an asynchronous consensus algorithm without rounds

Image
I had written about my preliminary impressions about Texel , an asynchronous consensus algorithm without rounds. Over the last couple of nights, I have modeled a shared memory version of the algorithm in TLA+, so that I can understand the algorithm better and learn more about the approach. I started with a very rough atomic version of the algorithm, where a node can atomically read the state of all other nodes and update its own state. This is not practical, but it is good for highlighting the essence of the Texel algorithm. In this post, I will talk about this atomic Texel model. After I got this model down, it was easy for me to refine the atomicity. In the refined model, a process can atomically read from one other process and update its state. That refined model is just one step removed from the message passing Texel algorithm presented in the paper , and demonstrates the tricky issues that arise when multiple nodes are concurrently trying to update their states. In that read-w

Paper review. Gray Failure: The Achilles' Heel of Cloud-Scale Systems

Image
This paper (by Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao) occurred in HotOS 2017. The paper is an easy read at 5 pages, and considers the fault-tolerance problem of cloud scale systems. Although cloud provides redundancy for masking and replacing failed components, this becomes useful only if those failures can be detected. But some failures that are partial and suble failures remain undetected and these "gray failures" lead to major availability breakdowns and performance anomalies in cloud environments. Examples of gray failures are performance degradation, random packet loss, flaky I/O, memory thrashing/leaking, capacity pressure, and non-fatal exceptions. The paper identifies a key feature of gray failure as differential observability . Consider the setup in Figure 2. Within a system, an observer  gathers information about whether the system is failing or not. Based on the observations, a reactor t

Paper review. Asynchronous consensus without rounds

Image
This paper by Robbert van Renesse appeared on Arxiv  two weeks ago. (Update: Huh, I missed this earlier, but the paper has a footnote that says it was written in 2010.) The paper looks very interesting. I only got to skim the paper, but I will give this a careful read later. All published crash and Byzantine fault tolerant asynchronous consensus protocols use rounds. (Yes, indeed... Paxos , Viewstamped Replication, even Nakamoto consensus , and Avalanche protocol all use rounds.) Rounds are such an inherent part of consensus algorithms that it is tempting to conclude that solving fault tolerant consensus requires ordered rounds. This paper shows that such a conclusion would be wrong by showing an asynchronous consensus protocol that does not use rounds. The protocol is named after Texel, an island of Netherlands . Presumably this is because 1) Robbert is Dutch, and 2) he wants to name an alternative island to Paxos island in a sea farther away from the Ionian sea. Texel provides

Paper review. A generalized solution to distributed consensus

Image
This paper (by Heidi Howard and Richard Mortier) proposes a framework for consensus that uses immutable state to simplify the exposition. They show that both Paxos and Fast Paxos are certain instantiations of this general consensus framework. Finally, they outline some new instances of the consensus framework which provide nice benefits in certain setups. I should note and caution you that this paper considers single instance/decree consensus rather than multiple instances/slots back-to-back consensus. So if you are interested in the latter, there are gaps to fill before you can implement these algorithms to solve multi-decree consensus and maintain RSMs. Moreover while immutability via the write-once registers is great for exposition, extra work needs to be done for achieving implementation efficiency of this abstraction. The consensus problem An algorithm solves consensus if it satisfies these requirements: Non-triviality. All output values must have been the input value of

Linearizable Quorum Reads in Paxos

Image
While there has been a lot of work on Paxos protocols , there has not been any study that considers the read operation in Paxos protocols thoroughly. Read operations do not mutate state, and in many applications the read operations outnumber the update operations. Traditionally, there have been three main ways to perform reads. Treat the read as a regular command, let the leader clear it with a quorum, and return the response Use a lease on the leader (to prevent another leader emerging and committing an update), and read from the leader Read from one of the replicas The first two approaches provide linearizable reads, but the last method is non-linearizable. For example, in ZooKeeper or Raft if you read from a replica, it returns stale values, because the leader commits first---after hearing from a quorum of replicas--- and the replicas follow behind. (Linearizability means that the distributed system emulates a single register where each client can read or write from that reg

On becoming a researcher

I remember the first semester I joined THE Ohio State University as a graduate student, Spring 1998. I had taken Anish Arora's class on distributed systems, and I fell in love with distributed algorithms. The emergence of order out of chaotic (distributed/concurrent) execution was intellectually challenging and was very exciting to me. The way Anish taught was also very engaging, and I got drawn into trying the algorithms he mentioned in the class. He taught about the Dijkstra-Safra algorithm which performed termination detection in a ring with 3 rotations of the token. I thought it should be possible to improve on the 3 rounds required for completion, if we kept track of more information at the token. I mentioned this to Anish after the class, and he told me to give it a try. I started working on it. I was also taking English as A Second Language writing class that semester. So I wrote a technical report on this improved version of the Termination Detection algorithm. (We ended up

Reading list for our distributed systems class

Here is our reading list. It follows our distributed systems course schedule provided in the previous post . I tried to choose the papers that are foundational, comprehensive, and readable by a first year graduate student. In some cases, I omitted very long or hard to follow papers ---even though they may be foundational papers--- and instead included some follow up papers that summarized the concept better. I assign these papers as review material for the corresponding topic. The students then choose one from the batch and review that one critically. But I hope that the students can read all of them if possible. These papers should be all available with a Google Scholar search. Some of these papers appear in open access venues. If that is not the case, often the authors make their papers available freely at their websites, as they have the right to do. "Authors retain the right to use the accepted author manuscript for personal use, internal institutional use and for permitte

Distributed systems course revamped

I revamped my distributed systems course. I cut out algorithms that do not have much practical use ---even though they are elegant--- such as dining philosophers, termination detection, and self-stabilizing token ring algorithms. I increased coverage of modern distributed systems (distributed databases, big data analysis systems, and cloud computing services). The new lean and mean course schedule The first 6 weeks of the course covers the fundamentals, and the second part covers the technologies that build on these foundations. Introduction, 2 phase-commit Reasoning about distributed programs, safety/progress Consensus, Paxos Failure detectors, Faults and fault-tolerance Time: logical clocks, State: distributed snapshots Datacenter computing, Cloud computing NoSQL databases, CAP theorem, Distributed databases Big data, Big data analytics Decentralized ledgers and blockchains I believe it is important to teach reasoning about distributed programs. I don't expect

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book