Monday, October 8, 2018

Everything is broken

Last Wednesday, I attended one of the monthly meetings of the "Everything is Broken" meet up at Seattle. It turns out I selected a great meeting to attend, because both speakers, Charity Majors and Tammy Butow, were excellent.

Here are some select quotes without context.

Observability-driven development - Charity Majors


Chaos engineering is testing code in production. "What if I told you: you could test both in and before production."

Deploying code is not a binary switch; deploying code is a process of increasing your confidence in your code.

"Microservices are hard!" as a caption for a figure comparing the LAMP stack 2005 versus the complexity of the Parse stack 2015.

We are all distributed systems engineers and unknowns outnumber the knowns!
Distributed systems have an infinite number of almost-impossible failures!

Without observability you don't have chaos engineering, you have a chaos.

Monitoring systems have not changed significantly in 20 years, from Nagios. Complexity is exploding everywhere, but our tools are designed for a predictable world.

Observability for software engineers: can you understand what is happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output?

For the LAMP stack monitoring was sufficient for identifying the problems.
For microservices, it is unclear what we are supposed to monitor for. We need observability!
The hard part is not debugging your code, but to find which part to debug!

Facebook's  Scuba was ugly, but it helped us slice and dice and improve our debugging! It improved things a lot. I understand Scuba was hacked to deal with MySQL problems.

You don't know what you don't know, so dashboards are very limited utility. Dashboards are only for anticipated cases: every dashboard is an artifact of past failures. There are too many dashboards, and they are too slow.

Aggregates are the kiss of death; important details get lost.

Black swans are the norm; you must care about 99.9%, epsilons, corner cases.

Watch things run in production in the normal case; get used to observing your systems when they aren't on fire.

Building Resilient Systems Using Chaos Engineering - Tammy Butow

Chaos engineering is "thoughtful planned experiments designed to show weak points in the system".

Top 5 popular ways to use chaos engineering now: kubernetes, kafka, aws ecs, cassandra, elasticsearch.

Fullstack chaos engineering: inject faults at api, app, cache, database, os, host, network, power

We are exploring a new direction and collaborating with the UI engineers on ways to hide impact of faults.

prerequisites for chaos engineering:
1. monitoring & observability
2. on-call & incident management
3. know the cost of your downtime per hour (British Airlines's 1 day outage costed $150 millon)

How to choose a chaos experiment?
+ identify top 5 critical systems
+ choose 1 system
+ whiteboard the system
+ select attack: resource/state/network
+ determine scope

How to run your own gameday: http://gremlin.com/gameday

Outage post-mortems: https://github.com/danluu/post-mortems

First chaos engineering conference this year: http://twitter.com/chaosconf

Some notes about the venue: Snap Inc

There were fancy appetizers, very fancy. They had a kitchen there at the fifth floor (and every floor?). Do they provide free lunch to snap employees?

At the 5th floor, where the meeting took place, we had a great view of Puget Sound bay. The Snap building is just behind the Pike Market Place. There were about 80-100 people. I think the 30+ folks outnumbered 40+ folks, but not severely. Good show up from female engineers. There was ambient music in the beginning from 6-6:30pm, but it was loud.

By the way, I never used snapchat... I am old. But I don't have a Facebook account, so maybe I am not that old.

MAD questions

1. Do you need to test in production? 
The act of sabotaging parts of your system/availability may sound crazy to some people. But it puts forth a very firm commitment in place. You should be ready for these faults, as they will happen in one of these Thursdays. It establishes a discipline that you would test, gets you prepared with writing the instrumentation for observability, and toughens you up. It puts you into a useful paranoid mindset: the enemy is always at bay and never sleeps, I should be ready to face attacks. (Hmm, here is an army analogy: should you train with live ammunition? It is still controversial because of the lives on the line.)

Why not wait till faults occur in production by themselves, they will happen anyways. But when you do chaos testing, you have control in the inputs/failures, so you already know the root cause. And this can be give you much better opportunity to observe the percolation effects.

2. Analogies for chaos engineering
I have heard vaccination used as an analogy. It is a tactful analogy (much better than the live firing analogy). Nobody can argue against usefulness of vaccinations.

Other things chaos testing evokes could be blood letting and antifragility. I had read somewhere that the athletes in ancient Greek would induce a diarrhea on purpose a couple weeks before competitions, so that their body can recover and get much stronger at the time of the competition. I guess the reasoning goes as "too much of a monotone is a bad thing" and it is beneficial to stress/shake the system to avoid a local maxima. That reminds me of this YouTube video I show in my distributed systems class on the topic of resilience. 

3. Debugging designs with TLA+
Even after you have a verified design, the implementation can still introduce errors, so using chaos engineering tools is valuable and important even then.

It helps even for "verified" systems for its nonverified parts:
Folks encouraged us to try testing verified file systems; we were skeptical we would find anything, but to our surprise, when we tested MIT’s FSCQ file system, we found it did not persist data on fdatasync()! Apparently they had a bug in the un-verified portion of their code (Haskell-C bindings), which was caught by Crashmonkey! This shows that even verified file systems have un-verified components which are often complex, and which will have bugs.

4. Chaos tag
Turns out I have several posts mentioning chaos engineering, so I am creating a chaos tag to be available for use for future posts.

Sunday, October 7, 2018

Debugging designs with TLA+

This post talks about why you should model your systems and exhaustively test these models/designs with the TLA+ framework. In the first part, I will discuss why modeling your designs is important and beneficial, and in the second part I will explain why TLA+ is a very suitable framework for modeling, especially for distributed and concurrent systems.

Modeling is important

If you have worked on a large software system, you know that they are prone to corner cases, failed assumptions, race conditions, and cascading faults.

There are many corner cases because there are many parameters, and these do interfere in unanticipated ways with each other. The corner cases violate your seemingly reasonable implicit assumptions about the system components and environment, e.g.,"1-hop is faster than 2-hops", "0-hop is faster than 1-hop", and "processes work with the same rate". There are abundant race conditions because today (with the rise of SOA, cloud, and microservices) all systems are distributed systems. Code that is supposedly "atomic block of execution" fails due to other processes executing concurrently. Finally, faults happen and their effects are almost always underestimated pre-deployment. Faults take your system to unanticipated states, and from there on with the interleaving of recovery actions with normal system actions, the system may be thrown to even more unanticipated states.

In large software systems, which are inevitably distributed systems, there are many unknown-unknowns and an infinite number of highly-improbable ways things can go wrong. Human brain and reasoning cannot scale to handle all these possibilities. To alleviate these problems, the industry developed tools for better observability and even testing in production for improving availability. These tools are very important and indispensable. But by the time you figure out some inherent problems with your design it may be too hard and expensive to fix things. What you thought would be the last 10% of the project ends up taking 90% of your time at production and operations.

If you model your designs first and exhaustively test and debug these models for correctness against corner cases, failed assumptions, concurrency, and failures, you can catch errors at the design time and fix them before they develop into problems and become costly to fix.

  • Modeling first does not extend your development time, on the contrary it saves you time by reducing futile development attempts. Embarking on development with a flawed design almost always ensures that the implementation is flawed. While having a precise and correct model at hand does not guarantee that your implementation of the model is correct, it helps you avoid the big/intricate problems and also provides a good reference for testing your implementation against.
  • Constructing a precise model of your system gives you clarity of thinking and supports your development immensely.  By modeling you discover about the inherent complexities of the problem; that helps you focus your attention and ignore accidental/byproduct complexities.
  • The model also helps you to communicate precisely with your team and others as you avoid the ambiguity of natural language and the hand-waving and generalizations involved.
  • Finally with the model at hand, you also have a chance to gradually introduce design decisions, and see alternative ways to implement the design. 


TLA+ is great for modeling

TLA+ is a formal language for describing and reasoning about distributed and concurrent systems. It is developed by Dr. Leslie Lamport, Turing Award winner 2013. Lamport is a very important figure in distributed systems due to his logical clocks work and Paxos work among many others. For the last decade, he is very involved with improving the TLA+ framework to help make distributed systems more manageable.

TLA+ uses basic math to model and reason about algorithms: practical logic, set theory, and temporal logic are used for specifying  systems. Best of all, the framework integrates a model checker that exhaustively tests your models to the face of corner cases, failed assumptions, concurrency, and failures. The model checker tries all executions possible for your model and tells you for which executions, your invariants and system guarantees break.

Invariant-based reasoning
TLA+ framework promotes invariant-based reasoning to prevent the problems that arise from operational reasoning. In operational reasoning, you start with a "happy path", and then you try to figure out "what can go wrong?" and how to prevent them. Of course, you always fall short in that enumeration of problem scenarios and overlook corner cases, race conditions, and cascading failures. In contrast, invariant-based reasoning focuses on "what needs to go right?" and how to ensure this properties as invariants of your system at all times. Invariant-based reasoning takes a principled state-based rather than operation/execution-based view of your system.

To attain invariant-based reasoning, we specify safety and liveness properties for our models. Safety properties specify "what the system is allowed to do". For example, at all times, all committed data is present and correct. Liveness properties specify "what the system should eventually do". For example, whenever the system receives a request, it must eventually respond to that request. In other words, safety properties are concerned with "nothing bad happens", and liveness properties with "something good eventually happens".

Modeling with TLA+
The TLA+ framework supports you in building a model and figuring out its invariant properties in two major ways. Firstly, the math-based formal language helps you achieve precision while still working with high-level declarative statements. Secondly, the integrated model checker exhaustively debugs your model to the face of concurrency and failures, and produces counterexamples for which your candidate invariants fail. (After years of working with TLA+, I am still surprised about the counterexamples the model checkers spit out for my models: It is very easy to overlook some scenarios, but the model checker sets you straight.) You address these problems by improving your model or sometimes by relaxing your candidate invariants, and after many iterations converge to an exhaustively debugged model which guarantees the invariants.

Building a TLA+ model is beneficial even for systems that are already implemented and running. Through building the model, you learn about your system better, and figure out some latent failure modes and correct them before they occur in production.

Finally, maintaining a TLA+ model of your system provides important benefits for continuous development. While software systems need to be extended with new features frequently, these extensions may interfere in unanticipated way with the system and lead to downtimes. With the TLA+ model at hand, you can first add these features to your model, and catch/debug the problems at the design-level using the model-checker. This way you resolve potential issues before they even become problems.

TLA+ is practical
Since using TLA+ actually saves time for building large software systems, TLA+ modeling is adopted as a practice by many software companies.

I am on sabbatical at Cosmos DB, Microsoft globally distributed cloud-native database. The team has been using TLA+ to model the replication and global distribution protocols and exhaustively tests the designs for correctness against failures. We have recently published the customer-facing part of the model which precisely defines the 5 consistency levels offered by Cosmos DB.

Amazon has also used TLA+ modeling for some of their AWS offerings and has written a nice experience report on this. There are also reports of using TLA+ for modeling hardware systems as well.

For the last 4 years, I have been incorporating TLA+ in my distributed systems classes. TLA+ enables students to learn about concurrency and invariant-based reasoning and it provides them hands-on experience with distributed protocols. I also use TLA+ exhaustively in my research on new distributed algorithms.

In my experience, it is possible to pick up TLA+ up in a couple weeks. This is firstly because TLA+ adopts a very simple state-machine approach to model systems. A system consists of: (1) A set of variables which define the state of the system, and (2) A finite set of assignments/actions that serves to transition the system from one state to another.

Furthermore, PlusCal provides syntactic a sugar for the TLA+, which has a tendency to grow long (due to its low-level state-transition centric syntax) and look cryptic for some people. PlusCal is a pseudocode for writing algorithms at a higher-level of abstraction, and it is translated to the underlying TLA+ specifications for model checking. To give you some idea about the PlusCal, here is an example of a PlusCal code for a database replica process. While this is a straightforward code, you can see a nondeterministic choice construct "either or" in action. The model checker will exhaustively test all possible combinations of these "either or" actions and check if a certain sequence would break one of your safety and liveness specifications.

To learn more

There is a very active TLA+ forum at Google Groups. Leslie Lamport chimes in several threads.

My blog includes many examples of TLA+/PlusCal modeling of distributed algorithms/systems.

LearnTLA provides a user-friendly introduction to TLA+/PlusCal.

Lamport's site includes TLA+/PlusCal resources (videos/books/examples) and links to download the toolkit.