Showing posts from September, 2023

Beyond the Code: TLA+ and the Art of Abstraction

I have been teaching a TLA+ miniseries inside AWS. I just finished the 10th week, with a one hour seminar each week. I wanted to pen down my learnings from this experience. The art of abstraction Let's start with abstraction. Abstraction is a powerful tool for avoiding distraction. The etimology of the word abstract comes from Latin for cut and draw away. With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. For example, if you are interested in the consistency model of your distributed system, you can abstract away the mechanics of communication in the system when that is an unnecessary distraction. In his 2019 talk, Leslie Lamport said : Abstraction, abstraction, abstraction! That's how you win a Turing Award. Inside Amazon abstraction is also very valuable. We see skilled engineers look at large systems, slice out a protocol/subsystem, cut to the essence of it, and dive deep to make th

Metastable failures in the wild

This paper appeared in OSDI'22. There is a great summary of the paper by Aleksey (one of the authors and my former PhD student, go Aleksey!). There is also a great conference presentation video from Lexiang. Below I will provide a brief overview of the paper followed by my discussion points. This topic is very interesting and important, so I hope you have fun learning about this. Metastability concept and categories Metastable failure is defined as permanent overload with low throughput even after the fault-trigger is removed. It is an emergent behavior of a system, and it naturally arises from the optimizations for the common case that lead to sustained work amplification. In this paper, the authors are able to capture/abstract the system behavior of interest in terms of two parameters, the load and capacity. If the load is above capacity, you have work piling up, right? Or if the capacity drops under the sustained load level, the same effect, right? Both of these create  a tem

Review: Performance Modeling and Design of Computer Systems: Queueing Theory in Action

Performance Modeling and Design of Computer Systems: Queueing Theory in Action , by Dr. Mor Harchol-Balter. Cross-posted with We are A. Jesse Jiryu Davis , Andrew Helwer , and Murat Demirbas , three enthusiasts of distributed systems and formal methods. We’re looking for rigorous ways to model the performance of distributed systems, and we hoped this book would point the way. We followed the author’s undergrad syllabus , although we were disappointed at its focus on single-threaded task queues. Most of our work is with multi-threaded servers, aka Processor-Sharing, so we added Chapter 10 section 2 (Aloha protocol) and Chapter 22 (only Jesse read it), skipped Chapter 29, and added Chapter 30. We met roughly every two weeks on Zoom, from December 2022 to August 2023. Between meetings we spent a few hours per week reading; we did many of the homework exercises at the beginning and none of them towards the end. Our Review

A snapshot isolated database modeling in TLA+

While idly browsing tlaplus/Examples repository, I noticed TLA+ spec for a snapshot isolated Key-Value Store written by my friend Andrew Helwer . This is a centralized database. Each transaction makes a copy of the store, and OCC merges their copy back to store upon commit. I immediately wanted to write a PlusCal version of it and contribute back. My motivations were: Let's see how quickly I can write the PlusCal version. This way I can get practice converting TLA+ specs to PlusCal. (TLC automatically converts PlusCal to TLA+ spec, but the other way requires some judgement call.) The PlusCal version would make this spec easier to understand for modeling newbies. I can also add the client-centric isolation checking to this to provide a more concrete example of that immensely useful module. It would be cool to see the Serializability counterexample the model would throw. I was curious if the model checker will throw the classical write-skew counterexample. This would be a cool self

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom