Posts

Showing posts from August, 2023

Model Checking Guided Testing for Distributed Systems

Image
This paper appeared in Eurosys 2023. It shows how to generate test-cases from a TLA+ model of a distributed protocol and apply it to the Java implementation to check for bugs in the implementation. They applied the technique to Raft, XRaft, and Zab protocols, and presented the bugs they find. Background on the problem Bridging the gap between model (i.e., abstract protocol) to the implementation (i.e., concrete code) is an important problem. TLA+ gives validates the correctness of your abstract/model protocol, but does not say anything about whether your implementation of the protocol is correct.  This is still very useful as Lamport likes to argue: if you didn't get the abstract design right, it doesn't matter if your implementation is true to the abstract, you will still have a flawed implementation. While having a TLA+ model checked abstract does not guarantee that your implementation of the abstract is correct, it helps you avoid the big/intricate protocol level problems a

Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate

Image
This paper appeared in OSDI 2023 in July, and is authored by Audrey Cheng, David Chu, Terrance Li, Jason Chan, Natacha Crooks, Joseph M. Hellerstein, and Ion Stoica, UC Berkeley; Xiangyao Yu, University of Wisconsin--Madison. The paper seems to break the acronym rule. It defines one acronym, TraChe, in the title, but uses another one, Detox, for the rest of the paper. I guess I agree, referring to your system as Trache is not very appealing.   I kid, because I love the authors. Looks like I cover almost any paper Audrey Cheng writes these days. Speaking of Audrey, she had written a brief summary of this paper here: https://audreyccheng.com/blog/transaction-caching . Hmm, she didn't use the word "TraChe" in her post even once. I wonder which one of the advisors on the paper coined the typo TraChe, and pushed it to the title. Ok, enough with the trache talk. The Problem           You have been doing caching wrong for your transactional workloads! That is the main message

Distributed Transactions at Scale in Amazon DynamoDB

Image
This paper appeared in July at USENIX ATC 2023. If you haven't read about the architecture and operation of DynamoDB, please first read my summary of the DynamoDB ATC 2022 paper . The big omission in that paper was discussion about transactions. This paper amends that. It is great to see DynamoDB, and AWS in general, is publishing/sharing more widely than before. Overview A killer feature of DynamoDB is predictability at any scale. Do read Marc Brooker's post to fully appreciate this feature. Aligned with this predictability tenet, when adding transactions to DynamoDB, the first and primary constraint was to preserve the predictable high performance of single-key reads/writes at any scale. The second big constraint was to implement transactions using update in-place operation without multi-version concurrency control. The reason for this was they didn't want to mock with the storage layer which did not support multi-versioning. Satisfying both of the above constraints may s

Going Beyond an Incident Report with TLA+

Image
This is a short article at usenix-login about the use of TLA+ to explain root cause for an Azure incident. It seems like this is part of a longer arxiv paper on the use of TLA+ for exploring/explaining data consistency in Cosmos DB. Markus Kuppe is an author here. Markus is heading the TLA+ model checker, TLC, development for more than a decade. If you've used the TLA+ model checker, you used Markus's work. If you are an intermediate/advanced user, it is very likely that you interacted with Markus, and got help from him at some point. The incident The Microsoft Azure incident is a 28-day outage. It looks like the incident got undetected/unreported for 26 days, because it was a partial outage. " A majority of requests did not fail -- rather, a specific type of request was disproportionately affected, such that global error rates did not reveal the outage despite a specific group of users being impacted. " For each request serviced by the work dispatcher, it would (1)

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom