Everything is broken

Last Wednesday, I attended one of the monthly meetings of the "Everything is Broken" meet up at Seattle. It turns out I selected a great meeting to attend, because both speakers, Charity Majors and Tammy Butow, were excellent.

Here are some select quotes without context.

Observability-driven development - Charity Majors


Chaos engineering is testing code in production. "What if I told you: you could test both in and before production."

Deploying code is not a binary switch; deploying code is a process of increasing your confidence in your code.

"Microservices are hard!" as a caption for a figure comparing the LAMP stack 2005 versus the complexity of the Parse stack 2015.

We are all distributed systems engineers and unknowns outnumber the knowns!
Distributed systems have an infinite number of almost-impossible failures!

Without observability you don't have chaos engineering, you have a chaos.

Monitoring systems have not changed significantly in 20 years, from Nagios. Complexity is exploding everywhere, but our tools are designed for a predictable world.

Observability for software engineers: can you understand what is happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output?

For the LAMP stack monitoring was sufficient for identifying the problems.
For microservices, it is unclear what we are supposed to monitor for. We need observability!
The hard part is not debugging your code, but to find which part to debug!

Facebook's  Scuba was ugly, but it helped us slice and dice and improve our debugging! It improved things a lot. I understand Scuba was hacked to deal with MySQL problems.

You don't know what you don't know, so dashboards are very limited utility. Dashboards are only for anticipated cases: every dashboard is an artifact of past failures. There are too many dashboards, and they are too slow.

Aggregates are the kiss of death; important details get lost.

Black swans are the norm; you must care about 99.9%, epsilons, corner cases.

Watch things run in production in the normal case; get used to observing your systems when they aren't on fire.

Building Resilient Systems Using Chaos Engineering - Tammy Butow

Chaos engineering is "thoughtful planned experiments designed to show weak points in the system".

Top 5 popular ways to use chaos engineering now: kubernetes, kafka, aws ecs, cassandra, elasticsearch.

Fullstack chaos engineering: inject faults at api, app, cache, database, os, host, network, power

We are exploring a new direction and collaborating with the UI engineers on ways to hide impact of faults.

prerequisites for chaos engineering:
1. monitoring & observability
2. on-call & incident management
3. know the cost of your downtime per hour (British Airlines's 1 day outage costed $150 millon)

How to choose a chaos experiment?
+ identify top 5 critical systems
+ choose 1 system
+ whiteboard the system
+ select attack: resource/state/network
+ determine scope

How to run your own gameday: http://gremlin.com/gameday

Outage post-mortems: https://github.com/danluu/post-mortems

First chaos engineering conference this year: http://twitter.com/chaosconf

Some notes about the venue: Snap Inc

There were fancy appetizers, very fancy. They had a kitchen there at the fifth floor (and every floor?). Do they provide free lunch to snap employees?

At the 5th floor, where the meeting took place, we had a great view of Puget Sound bay. The Snap building is just behind the Pike Market Place. There were about 80-100 people. I think the 30+ folks outnumbered 40+ folks, but not severely. Good show up from female engineers. There was ambient music in the beginning from 6-6:30pm, but it was loud.

By the way, I never used snapchat... I am old. But I don't have a Facebook account, so maybe I am not that old.

MAD questions

1. Do you need to test in production? 
The act of sabotaging parts of your system/availability may sound crazy to some people. But it puts forth a very firm commitment in place. You should be ready for these faults, as they will happen in one of these Thursdays. It establishes a discipline that you would test, gets you prepared with writing the instrumentation for observability, and toughens you up. It puts you into a useful paranoid mindset: the enemy is always at bay and never sleeps, I should be ready to face attacks. (Hmm, here is an army analogy: should you train with live ammunition? It is still controversial because of the lives on the line.)

Why not wait till faults occur in production by themselves, they will happen anyways. But when you do chaos testing, you have control in the inputs/failures, so you already know the root cause. And this can be give you much better opportunity to observe the percolation effects.

2. Analogies for chaos engineering
I have heard vaccination used as an analogy. It is a tactful analogy (much better than the live firing analogy). Nobody can argue against usefulness of vaccinations.

Other things chaos testing evokes could be blood letting and antifragility. I had read somewhere that the athletes in ancient Greek would induce a diarrhea on purpose a couple weeks before competitions, so that their body can recover and get much stronger at the time of the competition. I guess the reasoning goes as "too much of a monotone is a bad thing" and it is beneficial to stress/shake the system to avoid a local maxima. That reminds me of this YouTube video I show in my distributed systems class on the topic of resilience. 

3. Debugging designs with TLA+
Even after you have a verified design, the implementation can still introduce errors, so using chaos engineering tools is valuable and important even then.

It helps even for "verified" systems for its nonverified parts:
Folks encouraged us to try testing verified file systems; we were skeptical we would find anything, but to our surprise, when we tested MIT’s FSCQ file system, we found it did not persist data on fdatasync()! Apparently they had a bug in the un-verified portion of their code (Haskell-C bindings), which was caught by Crashmonkey! This shows that even verified file systems have un-verified components which are often complex, and which will have bugs.

4. Chaos tag
Turns out I have several posts mentioning chaos engineering, so I am creating a chaos tag to be available for use for future posts.

Comments

A non-analogy for chaos engineering would be understanding that distributed systems have lots of properties typically identified with a Cynefin complex domain:
- causation can only be determined in hindsight
- the same experiment will likely result in different outcomes when repeated (the system changed in the meantime)
- heuristic for living interacting within a Cynefin complex system is probe-sense-respond
- every action changes constraints of the system which changes the available actions
- it is the land of continuous feedback loops and constantly interacting with the environment

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book