Posts

Showing posts with the label stabilization

Checking statistical properties of protocols using TLA+

Image
In my previous post, I mentioned how excited I was about Jack Vanlightly and Markus Kuppe's talk at TLA+ Conference. They showed a new mode (called "generate") added to the TLA+ checker to enable  obtaining statistical properties from the models. Here is how this works in a nutshell. Your TLA+ model/protocol produces a tree of runs, and, in the generate mode the TLA+ checker chooses these runs in a uniform manner. You also specify some state constraints, and when these are satisfied the TLA+ checker executes/evaluates your state constraint formula: you add the statistics logging as the side effect of this formula, and record information about the current run to a CSV file. This way, you can collect statistical hyper properties (abort rates, completion rounds, etc.) about the runs, in addition to checking it for correctness invariants. Later you can use Excel or Rscript to visualize the statistics collected and analyze the model behavior. Checking statistics is very usef...

Mergeable Replicated Data Types

Image
This paper is by Gowtham Kaki, Swarn Priya, KC Sivaramakrishnan, and Suresh Jagannathan , and it appeared in OOPSLA 19. We all know and love CRDTs . Using CRDTs, any replica can accept updates without remote synchronization. The updates may reach to replicas and applied in each in different order, but provided that the updates commute each replica converges to the same state eventually. CRDTs provide out-of-the-box eventual consistency to replicated data types with commutative operations, examples include set data type with member-addition/removal and counter data type with addition/subtraction. Unfortunately CRDTs are limited to commutative operations. If we also wanted to have multiplication for the counter data structure, since that won't commute with addition and subtraction, we would be in trouble. MRDTs extend CRDT idea to allow non-commutative operations for certain data structures. The idea in MRDTs is to use a 3-way merge formula. If you have an extra bit of informat...

Paper review. Gray Failure: The Achilles' Heel of Cloud-Scale Systems

Image
This paper (by Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao) occurred in HotOS 2017. The paper is an easy read at 5 pages, and considers the fault-tolerance problem of cloud scale systems. Although cloud provides redundancy for masking and replacing failed components, this becomes useful only if those failures can be detected. But some failures that are partial and suble failures remain undetected and these "gray failures" lead to major availability breakdowns and performance anomalies in cloud environments. Examples of gray failures are performance degradation, random packet loss, flaky I/O, memory thrashing/leaking, capacity pressure, and non-fatal exceptions. The paper identifies a key feature of gray failure as differential observability . Consider the setup in Figure 2. Within a system, an observer  gathers information about whether the system is failing or not. Based on the observations, a reactor t...

Everything is broken

Last Wednesday, I attended one of the monthly meetings of the "Everything is Broken" meet up at Seattle. It turns out I selected a great meeting to attend, because both speakers, Charity Majors and Tammy Butow , were excellent. Here are some select quotes without context. Observability-driven development - Charity Majors Chaos engineering is testing code in production. "What if I told you: you could test both in and before production." Deploying code is not a binary switch; deploying code is a process of increasing your confidence in your code. "Microservices are hard!" as a caption for a figure comparing the LAMP stack 2005 versus the complexity of the Parse stack 2015. We are all distributed systems engineers and unknowns outnumber the knowns! Distributed systems have an infinite number of almost-impossible failures! Without observability you don't have chaos engineering, you have a chaos. Monitoring systems have not changed signi...

Mind Your State for Your State of Mind

Image
This article by Pat Helland appeared at ACM Queue on July 2018. Pat is a self-confessed apostate . He is also a database philosopher; look at the title of his recent publications : Standing on distributed shoulders of giants, Life beyond distributed transactions, Immutability changes everything, Heisenberg was on the "write" track, consistently eventual, etc. This "mind your state for your state of mind" article looks at the history of interactions of applications and storage/databases, and charts their co-evolution as they move into the distributed and scalable world. The evolution of state, storage, and computing Storage has evolved from disks directly attached to your computer to shared appliances such as SANs (storage area networks) leading the way to storage clusters of commodity servers contained in a network. Computing evolved from a single-process on a single server, to multiple processes communicating on a single server, to RPCs (remote procedure c...

About self-stabilization and blockchain consensus

This is a half-baked exploratory piece about how some self-stabilization ideas may relate with blockchain consensus in open/permissionless environments. I will write about some similarities and some major differences between the self-stabilizing distributed algorithms work and blockchain consensus deployments. Self-stabilization Let's start with a brief review self-stabilization . Stabilization is a type of fault-tolerance that handles faults in a principled unified manner instead of on a case-by-case basis. Instead of trying to figure out how much faults can disrupt the system's operation, stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization then advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary state. This makes stabilization suitable for dealing with unanticipated faults. The most distinct property of self-stabilizat...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems