Why you should use modeling [with TLA+/PlusCal]

I recently gave a two day seminar on "debugging your designs with TLA+/PlusCal" at Dell. So I wanted to write some of the motivation for modeling and debugging your models while this is still fresh in my mind.

You need modeling

No, not that kind of modeling! Actually the naming clash is not accidental after all: fashion designers need models to test/showcase their designs.

You need modeling because:

  • Failing to plan is planning to fail 
  • Everything is a distributed system
  • The corner cases ... they are so many
  • Do it for the development process
  • Being smart does not scale

Failing to plan is planning to fail

This is from the paper, "Use of formal methods at Amazon Web Services, 2014".
"Before launching any complex service, we need to reach extremely high confidence that the core of the system is correct. We have found that the standard verification techniques in industry (deep design reviews, code reviews, static code analysis, stress testing, fault-injection testing, etc.) are necessary but not sufficient.
Human fallibility means that some of the more subtle, dangerous bugs turn out to be errors in design; the code faithfully implements the intended design, but the design fails to correctly handle a particular ‘rare’ scenario. We have found that testing the code is inadequate as a method to find subtle errors in design."

Modeling shows you how sloppy your "design" is. You think you got the design right, but for a complex service worth its salt you almost always get it wrong (more on this below). You won't find what you got wrong unless you model your design and validate it. And you want to find that out early on, without the sunken investment of correctly implementing your flawed design. Otherwise, even after Jepsen shows you that you screwed up, you are already too much invested into this flawed design, and you try to patch it, and you end up with a slow and bloated system.

Everything is a distributed system

There's just no getting around it: You're building a distributed system.

In this process, you are very likely to make an assumption that will bite you back, such as one hop is faster than two hops, zero hops is faster than one hop, and the network is reliable. What you assumed was an atomic block of execution will be violated because another process will execute concurrently and change the system state in a way you didn't anticipate. And don't even get me started on faults, they are in a league of their own, they will collude with your program actions to screw you up.

The corner cases, they are so many

In the 2004, "Consensus on Transaction Commit" paper, Lamport and Gray mentioned that they could not find a correct three-phase commit protocol in database textbooks/papers because each one fails to account for a corner case.
"Three-Phase Commit protocols ... have been proposed, and a few have been implemented [3, 4, 19]. They have usually attempted to “fix” the Two-Phase Commit protocol by choosing another TM if the first TM fails. However, we know of none that provides a complete algorithm proven to satisfy a clearly stated correctness condition. For example, the discussion of non-blocking commit in the classic text of Bernstein, Hadzilacos, and Goodman [3] fails to explain what a process should do if it receives messages from two different processes, both claiming to be the current TM. Guaranteeing that this situation cannot arise is a problem that is as difficult as implementing a transaction commit protocol."

Do it for the development process

Modeling is good for achieving clarity of thinking and communication. Lamport used TLA+ without a model checker from 1990s to 2010. Even without the model checker, he still found value in modeling. It made him nail down the specifications and communicate them with others precisely. When you write things down precisely, it enables your brain to move on and do more with it. Clarity begets more clarity. Focus begets more focus.

Once you abstract away the clutter, come up with a precise model in Pluscal, and validate it with exhaustive model-checking, you can focus on the essence of the problem, and see alternative ways to implement it. And through this development process where you refine/implement the design, the PlusCal model will help a lot for communicating the design with other engineers, and check which implementations would work for each subsystem.

Being smart does not scale; exhaustive model checking comes to the rescue

After you get the design down in pseudocode (but not in TLA+ or PlusCal), couldn't that work for invariant-based design? Can't you just check each action in your pseudocode to see if it preserves your safety/invariant conditions, and be done with this? There is no need to model with TLA+/PlusCal and model-check, right?

Sigh. Did you read the above carefully? Everything is a distributed system, and there are many corner cases. A sloppy pseudocode is not going to cut it. And don't trust your deduction abilities for proving that each action preserves the safety conditions you identify. That works for simple toy examples, but for complicated examples you need to do a lot of extra mental inferencing/linking of concepts/being creative which is very error-prone.

Consider the hygienic philosophers example I discussed earlier. Your invariant will talk about being in critical section, but the actions talk about ... forks ... per edges ... over a dynamic priority graph. So doing that mental mapping would be very hard. Instead TLA+/PlusCal model checker gets you covered with exhaustive checking on the breadth first traversal of all possible permutations of action scheduling and show you if there is any possible execution (including the fault actions you model) that can violate your invariants.

This is why I so happily adopted TLA+/PlusCal for my distributed systems class.  Even for sophisticated algorithms, I can refer the students to the TLA+/PlusCal model to practice and play with the algorithm, so they can internalize what is going on.


This already got long. So in a later post, I will write more about the modeling/abstracting process, the mathematical/invariant-based thinking, and about some shortcomings of modeling.

Drop me a message if you are interested in having me over for a training/talk!

MAD questions

This section is here because of my New Year's commitment.

1) In addition to protocol/system modeling, workflow modeling, business logic modeling, TLA+/PlusCal has also been used for data modeling. Are there any other uses? If you have interesting use cases, please let me know as I am curious. 

2) Actually, I am aware of another use case for PlusCal model-checking, but it seems to be mostly for toy examples so far. You define actors/operators that can act at any order, and you challenge the model checker and claim that concurrent operation of those actors/operators cannot ever satisfy a condition that you like to happen. And the model checker, being the jerk it is, responds with a trace showing that it is possible, and you adopt this as the solution. The die hard puzzle is an example of this. This approach is useful for scheduling, maybe even cluster job scheduling under concerns and even anticipating some statical failures and still hitting the deadline. But I am not aware of any real-world use of this. Is this used in practice?

3) Is there a bad time for modeling? When should you not model?
Sometimes you may need to go bottom up to figure out the domain/problems first. After you have an idea of the domain, then you can start to model and go top down. I think it would not make sense to be opinionated and making modeling calls, before you are informed about the domain and issues. I think modeling is just thinking made more rigorous, and you should get the ball rolling on the thinking/understanding part a bit first before attempting to model.


Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

Learning about distributed systems: where to start?

Strict-serializability, but at what cost, for what purpose?

CockroachDB: The Resilient Geo-Distributed SQL Database

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

Anna: A Key-Value Store For Any Scale

Your attitude determines your success