Posts

Beyond the Code: TLA+ and the Art of Abstraction

Image
I have been teaching a TLA+ miniseries inside AWS. I just finished the 10th week, with a one hour seminar each week. I wanted to pen down my learnings from this experience. The art of abstraction Let's start with abstraction. Abstraction is a powerful tool for avoiding distraction. The etimology of the word abstract comes from Latin for cut and draw away. With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. For example, if you are interested in the consistency model of your distributed system, you can abstract away the mechanics of communication in the system when that is an unnecessary distraction. In his 2019 talk, Leslie Lamport said : Abstraction, abstraction, abstraction! That's how you win a Turing Award. Inside Amazon abstraction is also very valuable. We see skilled engineers look at large systems, slice out a protocol/subsystem, cut to the essence of it, and dive deep to make th...

Metastable failures in the wild

Image
This paper appeared in OSDI'22. There is a great summary of the paper by Aleksey (one of the authors and my former PhD student, go Aleksey!). There is also a great conference presentation video from Lexiang. Below I will provide a brief overview of the paper followed by my discussion points. This topic is very interesting and important, so I hope you have fun learning about this. Metastability concept and categories Metastable failure is defined as permanent overload with low throughput even after the fault-trigger is removed. It is an emergent behavior of a system, and it naturally arises from the optimizations for the common case that lead to sustained work amplification. In this paper, the authors are able to capture/abstract the system behavior of interest in terms of two parameters, the load and capacity. If the load is above capacity, you have work piling up, right? Or if the capacity drops under the sustained load level, the same effect, right? Both of these create  a tem...

Review: Performance Modeling and Design of Computer Systems: Queueing Theory in Action

Image
Performance Modeling and Design of Computer Systems: Queueing Theory in Action , by Dr. Mor Harchol-Balter. Cross-posted with https://emptysqua.re/blog/review-queue-theory-book/ We are A. Jesse Jiryu Davis , Andrew Helwer , and Murat Demirbas , three enthusiasts of distributed systems and formal methods. We’re looking for rigorous ways to model the performance of distributed systems, and we hoped this book would point the way. We followed the author’s undergrad syllabus , although we were disappointed at its focus on single-threaded task queues. Most of our work is with multi-threaded servers, aka Processor-Sharing, so we added Chapter 10 section 2 (Aloha protocol) and Chapter 22 (only Jesse read it), skipped Chapter 29, and added Chapter 30. We met roughly every two weeks on Zoom, from December 2022 to August 2023. Between meetings we spent a few hours per week reading; we did many of the homework exercises at the beginning and none of them towards the end. Our Review...

A snapshot isolated database modeling in TLA+

Image
While idly browsing tlaplus/Examples repository, I noticed TLA+ spec for a snapshot isolated Key-Value Store written by my friend Andrew Helwer . This is a centralized database. Each transaction makes a copy of the store, and OCC merges their copy back to store upon commit. I immediately wanted to write a PlusCal version of it and contribute back. My motivations were: Let's see how quickly I can write the PlusCal version. This way I can get practice converting TLA+ specs to PlusCal. (TLC automatically converts PlusCal to TLA+ spec, but the other way requires some judgement call.) The PlusCal version would make this spec easier to understand for modeling newbies. I can also add the client-centric isolation checking to this to provide a more concrete example of that immensely useful module. It would be cool to see the Serializability counterexample the model would throw. I was curious if the model checker will throw the classical write-skew counterexample. This would be a cool self...

Model Checking Guided Testing for Distributed Systems

Image
This paper appeared in Eurosys 2023. It shows how to generate test-cases from a TLA+ model of a distributed protocol and apply it to the Java implementation to check for bugs in the implementation. They applied the technique to Raft, XRaft, and Zab protocols, and presented the bugs they find. Background on the problem Bridging the gap between model (i.e., abstract protocol) to the implementation (i.e., concrete code) is an important problem. TLA+ gives validates the correctness of your abstract/model protocol, but does not say anything about whether your implementation of the protocol is correct.  This is still very useful as Lamport likes to argue: if you didn't get the abstract design right, it doesn't matter if your implementation is true to the abstract, you will still have a flawed implementation. While having a TLA+ model checked abstract does not guarantee that your implementation of the abstract is correct, it helps you avoid the big/intricate protocol level problems a...

Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate

Image
This paper appeared in OSDI 2023 in July, and is authored by Audrey Cheng, David Chu, Terrance Li, Jason Chan, Natacha Crooks, Joseph M. Hellerstein, and Ion Stoica, UC Berkeley; Xiangyao Yu, University of Wisconsin--Madison. The paper seems to break the acronym rule. It defines one acronym, TraChe, in the title, but uses another one, Detox, for the rest of the paper. I guess I agree, referring to your system as Trache is not very appealing.   I kid, because I love the authors. Looks like I cover almost any paper Audrey Cheng writes these days. Speaking of Audrey, she had written a brief summary of this paper here: https://audreyccheng.com/blog/transaction-caching . Hmm, she didn't use the word "TraChe" in her post even once. I wonder which one of the advisors on the paper coined the typo TraChe, and pushed it to the title. Ok, enough with the trache talk. The Problem           You have been doing caching wrong for your transactional workloads! That ...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems