TLA+ conference 2024

- April 19, 2024

The TLA+ foundation got established last year in April as an independent, non-profit organization under the umbrella of the Linux Foundation. This April 15th, we colocated our annual TLA+ conference with the Linux OSSNA event to reach out to developers attending this event. Being under Linux Foundation, we didn't have to pay for the venue, but we paid $3K for professional recording of the sessions, so we can share the talks with the world at large. The videos will be available within a couple weeks.

Marc Brooker: 15 years of TLA+

Marc Brooker gave the keynote this Monday in TLA+ conference 2024. His keynote was a good mix of inspiration for formal methods and challenging formal methods to do better. Brooker is a VP/distinguished engineer at AWS. He has been on-call for 15 years, and were involved with many popular AWS services, including Aurora, Lambda, EC2, EBS, API gateway, iot, bedrock. He defines himself as mostly a practitioner, building large scale distributed systems. Brooker's blog is a must-subscribe. His most recent blog post is a brief summary of this keynote talk, but do read on for my notes from his talk.

Brooker started building circuits on breadboard in high-school, and started noticing the patterns in circuits, and got fascinated by realizing that with math we can quantify the output behavior of systems at design time, understand safety margins, and understand system interactions. He found this so exciting, that after studying ECE, he started PhD on building distributed radar system. He was again delighted to see that using math they could predict the output of the system they are designing and building.

Then we got into computer systems, and found that math was not there in the practice. A prevalent attitude in developing and operating computer systems was "it's just vibes, man!", and this was super disappointing to him.

To inject some rigor in to designs, he thought of modeling via distributed state machines, and asked his smarter colleagues, who recommended Alloy, Promela+Spin, and TLA+. He bought Leslie Lamport's book, and read all examples, and coded them, which completely changed his approach to thinking about these systems. They have explained the process and their experiences with TLA+ modeling in their CACM 2015 article "how amazon web services uses formal methods". The paper is 6 pages long, and Broker says one of the most important part of the paper is Table 1. (I had gotten to see those specs when I was at AWS, and many more specs followed these since 2015.)

He was so transformed by these experiences that he is convinced that formal methods is one of the best engineering best practices. He says: If you’re a software engineer, especially one working on large-scale systems, distributed systems, or critical low-level system, and are not using formal methods as part of your approach, you’re probably wasting time and money. Because, ultimately, engineering is an exercise in optimizing for time and money.

"It would be well if engineering were less generally thought of, and even defined, as the art of constructing. In a certain important sense it is rather the art of not constructing; or, to define it rudely but not inaptly, it is the art of doing that well with one dollar, which any bungler can do with two after a fashion." Arthur Wellington

TLA+, rather than slowing things, accelerate engineering. Finding bugs earlier allows us to move faster. Formal methods also help us to optimize designs and protocols without risk, and crisply communicate exact reasoning.

In sum, Brooker believes formal methods are just good engineering practice! But this comes completely disconnected from many people's perception as we failed to communicate the value of this. How do we communicate this to them? Brooker says that's why he is excited about this field.

The present

Since the paper's publication TLA+ had been used for database services (Aurora, DynamoDB), compute services (Lambda), storage (EBS control plane physalia) and many other places

What caught Brooker by surprise was that: TLA+ role as an extremely precise format for protocol documentation was perhaps the most useful.

In the present, AWS has a broad automated reasoning practice:

cedar (new lang for expressive fast safe and analyzable authorization designed to be analyzable from the ground up)
message chains for distributed system verification
zelkova: a billion SMT queries a day (authorization at S3)
using lightweight formal methods to validate kv store node in S3 (applied wide range of formal methods)

Personal

Brooker says he continues to be excited about TLA+, and he uses P more and more often, and is learning Lean as well.

Future

Prediction 1: systems will continue to become more complex. (And soon their behavior will exceed human ability to reason about them.)

Prediction 2: systems will continue to become more critical. We saw this in our lives, and this snuck up on us in the last 30 years.

Prediction 3: cost, efficiency, sustainability, and productivity will become more important.

Prediction 4: AI will be a big deal.

Discussion

A quantitative understanding of system behavior is also sorely needed:

failure probabilities
latency and throughput distributions
system performance under different workloads
automatic optimization of system designs

It would be great to have faster exploration of the design space in an automated way. For example, there has been decades of debate on 2f+1 (Paxos) vs f+1 (chain) replication designs. Still there is no clear winner, because things are failure model and workload dependent. But given workload properties, matching design to workload should be trivial. So can we use automated tools to get an answer to this? A similar example is OCC vs locking concurrency control.

How do we reach working engineers (and working eng executives)?

Should tools get narrower and more specific? Is packaging and ui holding us back? Are we talking to decision makers about moving faster and with less risk?

Code vs model gap: people have been dismissive about this as this didn't turn out to be as bad as it sounds, but do we get any improvement on this. (Markus Kuppe's talk in the afternoon reported great progress on this problem.)

Brooker said model checking vs proof gap is not personally/practically much important to him, and that he remains to be extremely excited about TLA+.

Here are Brooker's slides from the keynote.

The rest of the day

We had four more industry talks in the morning. This was a big change/improvement over the previous years. We see accelarated adoption of TLA+ in the industry. Thousands seeds are sprouting in different companies, and that is really nice to see. We had great talks.

TLA+ @ LinkedIn: Ambry and Venice, Ankur Agrawal & Zac Policzer, LinkedIn
Practical Insights from Datadog’s Use of TLA+ and Simulations, Arun Parthiban, Datadog
How we designed and model-checked MongoDB reconfiguration protocol, Siyuan Zhou, MongoDB (slides to come, but I blogged about the models here)
Reverse-Engineering with TLA+, Calvin Loncaric, Oracle

The companies (industrial use) have all the same basic story. They work on a problem/feature, which seem simple first, and gets captured initially by a couple paragraphs of description. But due to concurrency and nondeterminism, the cornercases keep popping up as they delve (not a chatbot I just like the word) in to the problem, and the document becomes a 40 page document many weeks later. When they are frustrated, out of desperation they pick up TLA+ modeling, which gives them assurance, and also simplifies the design. The good news is that you don't need advanced language features to be productive with TLA+ modeling. The modeling is the important thing, and you get very good return on investment on it. They are very happy about using TLA+, but as the implementation marches on they move on to other things, to pick up TLA+ another fateful Spring when the cycle repeats again for the next big feature/protocol/design they need to deliver. As I keep joking to Markus, the TLA+ stickers should say []<> rather than <>[], as the common use pattern seems to be always eventually using TLA+.

In the afternoon we had a session on advancements on TLA+ core concept and another on TLA+ ecosystem.

Specification-Driven Development With TLA+, Guo Hua, Scupt & East China Normal University
Validating System Executions with the TLA+ Tools, Markus A. Kuppe, Microsoft
Promises and Challenges in Bridging TLA+ Designs with Implementations, Finn Hackett, University of British Columbia
You Deserve Unicode TLA+ and Other Nice Things Too, Andrew Helwer, Disjunctive Consulting LLC
Formal Methods in the Enterprise, David McNeil
Tackling State Space Explosion in TLA+ Visualizations, Daniel Stachnik Hasso, Plattner Institute University of Potsdam
Towards Better Interactive Formal Specifications, William Schultz Northeastern University

Do check on the interactive TLA-web tool. I am excited about this, and many people in the audience were delighted watching the demo of this tool.

Overall we had a really fun day. We topped this off with the TLA+ dinner, which was really friendly and colloquial. We walked to the waterfront, below the Pike Place, and dined at the crabpot sea food restaurant. We had an informal dinner, for a formal methods conference.

What is next for TLA+? We should get our flywheel going. We should get more companies using it, which in turn would be creating demand for support/features, and the budding TLA+ ecosystem should be delivering them with funding. TLA+ ecosystem will stay strong and thrive only if we get more users with needs.