Metadata

Posts

Showing posts from August, 2023

Model Checking Guided Testing for Distributed Systems

- August 28, 2023

This paper appeared in Eurosys 2023. It shows how to generate test-cases from a TLA+ model of a distributed protocol and apply it to the Java implementation to check for bugs in the implementation. They applied the technique to Raft, XRaft, and Zab protocols, and presented the bugs they find. Background on the problem Bridging the gap between model (i.e., abstract protocol) to the implementation (i.e., concrete code) is an important problem. TLA+ gives validates the correctness of your abstract/model protocol, but does not say anything about whether your implementation of the protocol is correct. This is still very useful as Lamport likes to argue: if you didn't get the abstract design right, it doesn't matter if your implementation is true to the abstract, you will still have a flawed implementation. While having a TLA+ model checked abstract does not guarantee that your implementation of the abstract is correct, it helps you avoid the big/intricate protocol level problems a

Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate

- August 24, 2023

This paper appeared in OSDI 2023 in July, and is authored by Audrey Cheng, David Chu, Terrance Li, Jason Chan, Natacha Crooks, Joseph M. Hellerstein, and Ion Stoica, UC Berkeley; Xiangyao Yu, University of Wisconsin--Madison. The paper seems to break the acronym rule. It defines one acronym, TraChe, in the title, but uses another one, Detox, for the rest of the paper. I guess I agree, referring to your system as Trache is not very appealing. I kid, because I love the authors. Looks like I cover almost any paper Audrey Cheng writes these days. Speaking of Audrey, she had written a brief summary of this paper here: https://audreyccheng.com/blog/transaction-caching . Hmm, she didn't use the word "TraChe" in her post even once. I wonder which one of the advisors on the paper coined the typo TraChe, and pushed it to the title. Ok, enough with the trache talk. The Problem You have been doing caching wrong for your transactional workloads! That is the main message

Distributed Transactions at Scale in Amazon DynamoDB

- August 17, 2023

This paper appeared in July at USENIX ATC 2023. If you haven't read about the architecture and operation of DynamoDB, please first read my summary of the DynamoDB ATC 2022 paper . The big omission in that paper was discussion about transactions. This paper amends that. It is great to see DynamoDB, and AWS in general, is publishing/sharing more widely than before. Overview A killer feature of DynamoDB is predictability at any scale. Do read Marc Brooker's post to fully appreciate this feature. Aligned with this predictability tenet, when adding transactions to DynamoDB, the first and primary constraint was to preserve the predictable high performance of single-key reads/writes at any scale. The second big constraint was to implement transactions using update in-place operation without multi-version concurrency control. The reason for this was they didn't want to mock with the storage layer which did not support multi-versioning. Satisfying both of the above constraints may s

Going Beyond an Incident Report with TLA+

- August 02, 2023

This is a short article at usenix-login about the use of TLA+ to explain root cause for an Azure incident. It seems like this is part of a longer arxiv paper on the use of TLA+ for exploring/explaining data consistency in Cosmos DB. Markus Kuppe is an author here. Markus is heading the TLA+ model checker, TLC, development for more than a decade. If you've used the TLA+ model checker, you used Markus's work. If you are an intermediate/advanced user, it is very likely that you interacted with Markus, and got help from him at some point. The incident The Microsoft Azure incident is a 28-day outage. It looks like the incident got undetected/unreported for 26 days, because it was a partial outage. " A majority of requests did not fail -- rather, a specific type of request was disproportionately affected, such that global error rates did not reveal the outage despite a specific group of users being impacted. " For each request serviced by the work dispatcher, it would (1)