Learning about distributed systems: where to start?

This is definitely not a "learn distributed systems in 21 days" post. I recommend a principled, from the foundations-up, studying of distributed systems, which will take a good three months in the first pass, and many more months to build competence after that.

If you are practical and coding oriented you may not like my advice much. You may object saying, "Shouldn't I learn distributed systems with coding and hands on? Why can I not get started by deploying a Hadoop cluster, or studying the Raft code." I think that is the wrong way to go about learning distributed systems, because seeing similar code and programming language constructs will make you think this is familiar territory, and will give you a false sense of security. But, nothing can be further from the truth.
Distributed systems need radically different software than centralized systems do. 
--A. Tannenbaum
This quotation is literally the first sentence in my distributed systems syllabus. Instead of trying to relate distributed systems constructs to centralized constructs, you should treat distributed systems as a radical novelty. You should first dive into the inherent difficulties (reasoning about concurrency and fault-tolerance) in distributed system, rather than dipping your toe in the accidental difficulties (implementation of a framework).

For a principled foundations-up studying of distributed systems, this is what I recommend.

Predicate logic, reasoning about safety and progress

To appreciate the challenges of reasoning with concurrency, it is important to start with a quick crash course on predicate logic, and reasoning about safety properties (next, stable, invariant), and liveness properties (transient, ensures, leads-to, variant functions).

Paolo Sivilotti, who was a student of one of the creators of the UNITY pseudocode and reasoning framework, has a nice pedagogical approach to teach these concepts in his book, which is available as a free download. I use this book for the first month of my classes. (Thank you Paul!)

This may look like grunt work to you, as it does to many of my students. But let me tell you, as I tell them, this is what you need to learn/internalize first so you can start to do dist-sys kungfu. (To motivate students for this legwork, I show students scenes from Karate Kid.)

TLA+ to play with algorithms

After this background on reasoning about safety and progress, you can start using TLA+/Pluscal framework where you can write distributed algorithms and the model checker can tell you what a fool you are.

You can find information about how TLA+ can help you when learning about your distributed systems in many posts in my blog. 

Hillel's TLA+ introduction website is great for getting you started. After that you can read through many books and videos Lamport has made available for free.

Impossibility results

I then recommend studying the impossibility results in distributed systems. There is nothing better than seeing how you can't do even the most basic coordination tasks under common failure employments to drive the point home that
  1. distributed systems are radically different than centralized systems, and
  2. fault-tolerance needs to be treated as a first-class citizen in distributed systems. 
There are two big impossibility results, coordinated attack and FLP impossibility results.

The coordinating attack result says that if the communication channels can drop messages you cannot solve distributed consensus using a deterministic protocol in finite rounds. OK, let's assume reliable, or eventually for a sufficient period  reliable channels. Pow! In your face. FLP shows that even with reliable channels, under an asynchronous model, you cannot solve distributed consensus using a deterministic protocol in finite rounds, in the presence of a single crash failure. CAP theorem considers the coordinating attack model for the atomic storage problem, an easier problem than distributed consensus, and shows that with arbitrarily unreliable channels, you cannot solve the atomic storage problem either.

To provide a smooth introduction to the impossibility results, I use the two-phase commit as a working example, and show via TLA+ modeling how these impossibility results play out in the context of the simple two-phase commit protocol.

After you learn these impossibility results, you can then start to learn about ways to circumvent (not to beat) these impossibility results. And that takes us to the consensus and fault-tolerance discussion.

I am unable to point to a good textbook for a good coverage of the impossibility results and distributed consensus and atomic storage protocols. Let me know if you know a textbook that provides a good coverage of these topics. A supplementary free pdf book is Maarten van Steen Andrew S. Tanenbaum. Distributed Systems. Freely available from https://www.distributed-systems.net/index.php/books/ds3/

From now on, you should try to read the original research papers and competent blog posts that explain them. Some sources I can recommend are:

Consensus and fault-tolerance

It is good to follow up the impossibility results with the Paxos protocol and variants to show how they skillfully circumvent these results.

It takes considerable time to learn Paxos well. But one day you finally understand Paxos and feel things fall into place, and you will celebrate your victory. But you should know that this is a premature celebration. You will need to be confused and re-learn the algorithm several times before you properly internalize it. (I think I have more than 100 posts related to Paxos one way or another, including a post about Paxos jokes.)

You can learn about failure detectors and fault-tolerance in a hands-on manner during learning about Paxos and distributed consensus. They can go hand in hand, so your study of failure detectors and fault-tolerance can build on some concrete ground.

Along with these, you can also study about many Paxos variants, Replicated State Machines (RSM), and chain replication type advanced atomic storage protocols.

Managing time and state in distributed systems 

We delayed studying about time and state, but it is now time to look at this, otherwise we will be in a regretful state.

The "There is no now" article provides a coverage of difficulties and practical implications of dealing with time and state in distributed systems.

Logical clocks, vector clocks, hybrid logical clocks are fun to learn. (Paul's book includes a good explanation of logical and vector clocks and snapshots.) They form the basis on which CRDTs, version vectors in NoSQL databases, and snapshot reads and commits in distributed SQL databases build upon.

Now what?

Now start reading about and hacking on cloud computing frameworks, NoSQL databases, and stream processing platforms. Martin Kleppmann's Designing Data Intensive Applications book is helpful for some of these topics.

Other than that, you are again on your own for gathering information from good blog posts and relevant research papers. Going to the original research paper, and learning from first principles are invaluable. This post is about where to start. I hope I can write another post to give a list of important papers for each topic I touched on this post.

These being said, I think we are over due for a book on modern distributed systems, that distills all recent developments and presents them in one place. Distributed systems concepts are tricky, and a bit help from an expert teacher can save a lot of frustration and many hours for each concept/topic.

Comments

mfricke1947 said…
Thank you for your blog posts on distributed systems (I'm working my way through some of them).

I'm hesitant to push my own barrow, but https://softoption.us might help with the Predicate Logic. It does not have the meta-theorems, but, in compensation, it does include some modal logic and lambda calculus.
Would you recommend any open-source project which one can study somewhat easily ?So, for example, Flink's Chandy-Lamport implementation could help. Once we read the paper it is very tricky to write code(simple) directly even to check
using TLA+. Don't you think so?

Popular posts from this blog

Hints for Distributed Systems Design

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book

Use of Time in Distributed Databases (part 2): Use of logical clocks in databases