Distributed systems course revamped

I revamped my distributed systems course. I cut out algorithms that do not have much practical use ---even though they are elegant--- such as dining philosophers, termination detection, and self-stabilizing token ring algorithms. I increased coverage of modern distributed systems (distributed databases, big data analysis systems, and cloud computing services).

The new lean and mean course schedule

The first 6 weeks of the course covers the fundamentals, and the second part covers the technologies that build on these foundations.
  • Introduction, 2 phase-commit
  • Reasoning about distributed programs, safety/progress
  • Consensus, Paxos
  • Failure detectors, Faults and fault-tolerance
  • Time: logical clocks, State: distributed snapshots

  • Datacenter computing, Cloud computing
  • NoSQL databases, CAP theorem, Distributed databases
  • Big data, Big data analytics
  • Decentralized ledgers and blockchains

I believe it is important to teach reasoning about distributed programs. I don't expect the students to prove invariants about distributed programs, but I want them to understand and internalize the concept: Each action of the program should preserve the invariant (starting from a state satisfying the invariant). Then the induction does its job, and regardless of which order the actions are executed at any of the processes, the invariant still holds for the distributed system.

Also, in the first part, I take a lot of time to make sure I teach Paxos right. I give TLA+ specifications for Paxos. I get students to role-play Paxos under several scenarios so they can understand how Paxos covers all possible corner cases.

For the second part, I believe it is important for the students to understand that real systems involve tradeoffs. In order to help students develop critiquing skills, I assign several papers to read on a topic, and ask students to write a critical review of one of the papers as the assignment. The students initially find critical reading hard, but then they get to appreciate the experience.

Teaching rigorously can sometimes be frustrating, because many students are just looking for an easy class and getting done with the material. Many students are turned off, when I start talking about invariants, consistency guarantees, using TLA+, critiquing tradeoffs in distributed systems. But I try not to compromise the quality and rigor in the course, because I know that at least half the students are motivated, and they deserve to get a proper introduction to the field.
Hello Professor, 
Hope you're doing well. I was a Computer Science Masters student who graduated last year in Feb. During my masters, you were our Distributed Systems professor, and I wanted to thank you for the way in which you'd taught the course. Reading various papers, and thinking about them is a great way of understanding and reasoning about them. Although, I didn't fully appreciate this method while actually studying the course. 
However, three months ago, just out of curiosity, I started reading the Dynamo DB paper, this time with only one goal, to take as much time as required, to read it purposefully, to internalize it properly. And, it took almost 15 days to understand almost 95% of the paper. I spent all my commute reading it. By the time I finished reading it, I was not only able to understand how the various smaller components fit together, I was able to appreciate the beauty of those complex systems. That I feel, what your goal was! And since then, I've read 5 other papers like Paxos, Aurora, BigTable, Zookeeper, Spanner and I will continue to read more. 
I've now realized what I want to work towards in the long term, and would like to thank you for your teaching, which have been a crucial in shaping my thoughts. :) 
Thank you Professor! It's been a privilege!
Best Regards,

Textbook

I don't follow a textbook because there isn't a good textbook on modern distributed systems. Some of the textbooks are too theoretical. Most of them are about regurgitating the existing algorithms and subsystems. That is no way to learn a material. When you read a description of an algorithm from a book, you would say "yeah that should work", but even if that was an incorrect algorithm, and you would still say, "yeah that should work". By just reading a description of something and knowing its name, you don't learn that thing. You should get a sense of why it works, how does it cover corner cases, why is it designed in this particular way out of a dozen alternatives.

I refer the students to two free textbooks if they want some reference material:
+ Maarten van Steen Andrew S. Tanenbaum. Distributed Systems
+ Paolo Sivilotti, Introduction to Distributed Systems, 2005 (This is useful for learning about reasoning about distributed algorithms.)

I try to keep an open mind about textbooks, so if you have a textbook to recommend, let me know.

Project

I give a TLA+ project each semester. Since I teach the course with an emphasis on reasoning about the correctness of distributed algorithms, TLA+ is a good fit and complement for my course. Integrating TLA+ to the class gives students a way to get a hands-on experience in algorithms design and dealing with the intrinsic complexities of distributed systems: concurrent execution, asymmetry of information, concurrency bugs, and a series of untimely failures. TLA+ has a lot to offer for practice and implementation of distributed systems as well. At Amazon and Microsoft, engineers use TLA+ for modeling production systems that see a lot of updates and new features. TLA+ helps the engineers find several critical bugs introduced by updates/features in the design stage, which if not found would have resulted in large amount of engineering effort later on.

Ideally I would also like to give a programming project using our Paxi framework. Paxi provides most of the elements that any Paxos implementation or replication protocol needs, including network communication, state machine of a key-value store, client API, and multiple types of quorum systems. The developer only needs to fill in two modules for describing the distributed coordination protocol (many Paxos variants are already implemented using Paxi). In order to provide a leveled playground for protocol evaluation and comparison, Paxi also includes benchmarking support to evaluate the protocols in terms of their performance, availability, scalability, and consistency.

Ailidani has implemented Paxi in Go programming language and made it available as opensource at https://github.com/ailidani/paxi. (It has been starred more than 300 times on GitHub.)

On the other hand, I don't want to burden students further by asking them to learn Go and hack on Paxi, as they have a lot of other things going at this course and 2-3 other courses they are taking in the same semester. Moreover the class has more than 150 students, and supporting extra projects at a class of this size require more resources than we are provided with. As a middle ground, I will assign students who want to go the extra distance optional/ungraded projects in Paxi. Some examples we are thinking of are: Mencius, RingPaxos, ScatterNet, Reconfiguration strategies, Vector-clock based multi-master replication, PBFT, and some decentralized ledgers protocols. If you are interested in hacking on this, you are welcome to try on your own, and contact us if you have questions.

Here are the couple first lectures of my course. At the end of the semester, I will consider making all the lectures available on GitHub.

In the next post, I will provide the reading list for my distributed systems class. It is long, and it is better to give it separately.

Comments

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Linearizability

Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)