Posts

Showing posts matching the search for Tla+ distributed systems

Modeling a Replicated Storage System in TLA+, Project 1

Image
Why a TLA+ project? The first project assignment in my distributed systems class this semester was modeling a replicated storage system in TLA+. Assigning a TLA+ project makes me a rarity among distributed systems professors. A common project would be a MapReduce programming assignment or a project to implement a simple distributed service (such as a key-value store) in Java. I think that a MapReduce deployment project does not teach much about distributed systems, because MapReduce is a very crude abstraction and hides all things distributed under the hood. Using MapReduce for the distributed systems class project would be like handing people a mechanics certification upon successful completion of a driving test. Implementing a simple distributed service, on the other hand, would teach students that indeed programming and debugging distributed systems is very hard. However, I would suspect that much of the hardship in that project would be due to accidental complexities of the i...

Debugging designs with TLA+

Image
This post talks about why you should model your systems and exhaustively test these models/designs with the TLA+ framework. In the first part, I will discuss why modeling your designs is important and beneficial, and in the second part I will explain why TLA+ is a very suitable framework for modeling, especially for distributed and concurrent systems. Modeling is important If you have worked on a large software system, you know that they are prone to corner cases , failed assumptions , race conditions , and cascading faults . There are many corner cases because there are many parameters, and these do interfere in unanticipated ways with each other. The corner cases violate your seemingly reasonable implicit assumptions about the system components and environment, e.g.,"1-hop is faster than 2-hops", "0-hop is faster than 1-hop", and "processes work with the same rate". There are abundant race conditions because today (with the rise of SOA, cloud, and ...

Learning about distributed systems: where to start?

This is definitely not a "learn distributed systems in 21 days" post. I recommend a principled, from the foundations-up, studying of distributed systems, which will take a good three months in the first pass, and many more months to build competence after that. If you are practical and coding oriented you may not like my advice much. You may object saying, "Shouldn't I learn distributed systems with coding and hands on? Why can I not get started by deploying a Hadoop cluster, or studying the Raft code." I think that is the wrong way to go about learning distributed systems, because seeing similar code and programming language constructs will make you think this is familiar territory, and will give you a false sense of security. But, nothing can be further from the truth. Distributed systems need radically different software than centralized systems do.  --A. Tannenbaum This quotation is literally the first sentence in my distributed systems syllabus. Inst...

Feedback from my distributed systems class

I am done with my distributed systems class this semester . At the end of the semester, I polled my students for feedback about the class. I was especially interested in hearing about their feedback on using TLA+ in the course and course projects, and using research paper reviewing for homework assignments. Of course, I didn't mention these when I solicited their feedback. I asked them to write a brief review for what they have learned in my distributed systems class this semester. I received overwhelmingly positive feedback. Below I share some of the comments about TLA+ and the paper review assignments. (I removed the flattering feedback about my teaching, while I enjoyed reading them.  I put a lot of effort in my teaching and I am happy when the students recognize how much I care.  That keeps me motivated. ) Feedback about using TLA+ I assigned two TLA+ projects. The first project was modeling Voldemort key-value store with client-side routing , and the second project ...

Distributed systems course revamped

I revamped my distributed systems course. I cut out algorithms that do not have much practical use ---even though they are elegant--- such as dining philosophers, termination detection, and self-stabilizing token ring algorithms. I increased coverage of modern distributed systems (distributed databases, big data analysis systems, and cloud computing services). The new lean and mean course schedule The first 6 weeks of the course covers the fundamentals, and the second part covers the technologies that build on these foundations. Introduction, 2 phase-commit Reasoning about distributed programs, safety/progress Consensus, Paxos Failure detectors, Faults and fault-tolerance Time: logical clocks, State: distributed snapshots Datacenter computing, Cloud computing NoSQL databases, CAP theorem, Distributed databases Big data, Big data analytics Decentralized ledgers and blockchains I believe it is important to teach reasoning about distributed programs. I don't expect ...

Using TLA+ for teaching distributed systems

I am teaching CSE 4/586 Distributed Systems class  again this Fall (Fall 2014). This is the course I have most fun teaching. (I would like to think my students also feel that way :-) I teach the course with emphasis on reasoning about the correctness of distributed algorithms. Here are the topics I cover in sequence: Introduction, Syntax and semantics for distributed programs, predicate calculus Safety and progress properties Proof of program properties Time: logical clocks, State: distributed snapshots Mutual exclusion, Dining philosophers Consensus, Paxos Fault-tolerance, replication, rollback recovery, self-stabilization Programming support for distributed systems Data center computing and cloud computing  CAP theorem and NOSQL systems Distributed/WAN storage systems I put emphasis on reasoning about distributed algorithms because concurrency is very tricky; it truly humbles human brain. More than 3 actions in a distributed program and your intuitions will f...

Foundational distributed systems papers

I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.)  I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section. Time and State in Distributed Systems Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM,  1978. Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985. Virtual Time and Global States of Distributed Systems.  Mattern, F. 1988. Practical uses of synchronized clocks in distributed systems. B. Liskov, 1991. Exp...

TLA+ conference 2024

Image
The TLA+ foundation got established last year in April as an independent, non-profit organization under the umbrella of the Linux Foundation. This April 15th, we colocated our annual TLA+ conference with the Linux OSSNA event to reach out to developers attending this event. Being under Linux Foundation, we didn't have to pay for the venue, but we paid $3K for professional recording of the sessions, so we can share the talks with the world at large. The videos will be available within a couple weeks.  Marc Brooker: 15 years of TLA+ Marc Brooker gave the keynote this Monday in TLA+ conference 2024. His keynote was a good mix of inspiration for formal methods and challenging formal methods to do better. Brooker is a VP/distinguished engineer at AWS. He has been on-call for 15 years, and were involved with many popular AWS services, including Aurora, Lambda, EC2, EBS, API gateway, iot, bedrock. He defines himself as mostly a practitioner, building large scale distributed systems. Br...

[Paper review] TaxDC: A Taxonomy of nondeterministic concurrency bugs in datacenter distributed systems

Image
This paper appeared in ASPLOS 2016 and the authors are Leesatapornwongsa, Lukman, Lu, and Gunawi. The paper provides a comprehensive study on real-world distributed concurrency bugs and is a good complement to the paper I reviewed before: "Why does cloud stop computing" . While the previous paper looked at all possible bugs that lead to service outages, this paper focuses only on distributed concurrency (DC) bugs. This type of bug happens to be my favorite kind of bug. I am fascinated with "failures", and with DC bugs doubly so. DC bugs are execution timing/ordering/scheduling related, they occur nondeterministically, and are extremely difficult to detect, diagnose, and fix in production systems. I love seeing those long traces of improbable DC bugs surface when I model check distributed algorithms in TLA+ . The experience keeps me humble. The paper presents analysis of 104 DC bugs from Cassandra, HBase, Hadoop MapReduce, and ZooKeeper. These bugs are categorize...

Beyond the Code: TLA+ and the Art of Abstraction

Image
I have been teaching a TLA+ miniseries inside AWS. I just finished the 10th week, with a one hour seminar each week. I wanted to pen down my learnings from this experience. The art of abstraction Let's start with abstraction. Abstraction is a powerful tool for avoiding distraction. The etimology of the word abstract comes from Latin for cut and draw away. With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. For example, if you are interested in the consistency model of your distributed system, you can abstract away the mechanics of communication in the system when that is an unnecessary distraction. In his 2019 talk, Leslie Lamport said : Abstraction, abstraction, abstraction! That's how you win a Turing Award. Inside Amazon abstraction is also very valuable. We see skilled engineers look at large systems, slice out a protocol/subsystem, cut to the essence of it, and dive deep to make th...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems