Going Beyond an Incident Report with TLA+

- August 02, 2023

This is a short article at usenix-login about the use of TLA+ to explain root cause for an Azure incident. It seems like this is part of a longer arxiv paper on the use of TLA+ for exploring/explaining data consistency in Cosmos DB.

Markus Kuppe is an author here. Markus is heading the TLA+ model checker, TLC, development for more than a decade. If you've used the TLA+ model checker, you used Markus's work. If you are an intermediate/advanced user, it is very likely that you interacted with Markus, and got help from him at some point.

The incident

The Microsoft Azure incident is a 28-day outage. It looks like the incident got undetected/unreported for 26 days, because it was a partial outage. "A majority of requests did not fail -- rather, a specific type of request was disproportionately affected, such that global error rates did not reveal the outage despite a specific group of users being impacted."

For each request serviced by the work dispatcher, it would

(1) write some metadata to Cosmos DB, and then it would
(2) enqueue a request to do work onto the Azure Service Bus along with the ID of the metadata it just wrote.

At the receiving end of the service bus, some worker service would

(3) dequeue the work request including the ID of relevant metadata
(4) look up that metadata in Cosmos DB and fail to find it.

Item (4) is the only anomalous step: the intended scenario is that the worker successfully reads the metadata it needs and gets to work.

While this does scream a "data consistency problem", for some reason the ops team treated this differently: "The original incident report identified the problem scenario above and included measurements indicating that a reduction in latency between (1) and (3) was the main contributing factor leading to the outage. The resolution, therefore, was to revert a set of performance improvements that reduced that latency."

Modeling of the problem

Markus and coauthors already had a TLA+ model of Cosmos DB. (Markus commented below to give credit to the other author Finn Heckett for modeling.) Back in 2018, I was involved in getting a simple TLA+ model of CosmosDB, and Markus and Finn's recent work provides a much improved and sophisticated model.

I like reading Markus's models, because every time it gives me an opportunity to learn new techniques, tricks, best practices about modeling with TLA+. I do like modeling with Pluscal, and Markus models in TLA+ level, but that is quite all right. Here is their model on the incident, including a README that explains the breakdown of the files in the directory. Below is a short explanation of the modeling.

We modeled one dispatcher, one worker, a message queue, and an instance of Cosmos DB. ... We don’t need to model what the work is, how the expected network communication happens, or any particular details of where the dispatcher gets its instructions. If the worker can gather the necessary metadata, then the issue has not occurred, and further events are not of interest. ... Without further evidence of [the message queue] doing anything controversial, we chose to model it as simply as we could.

Cosmos DB, on the other hand, was the service whose behavior was directly involved in the problem outputs. That is not to say that it had a bug -- it was in fact behaving as designed. *This issue is an example of a cross-system interaction, and the underlying issue could be categorized similarly to a grey failure.* The fact that it was behaving in a way that was not fully explained meant that we needed to model its behavior in more detail, because an over-simplified representation of its expected behavior would simply not exhibit the issue. On the other hand, an overly detailed model would just complicate the incident model without increasing its usefulness.

The culprit and the fix

Turns out the missing consistency token is the culprit!

Of interest to our model is “session consistency”, which is a recommended default setting. This setting operates via tokens, which are opaque identifiers that indicate a “session”. This was the mode used in the incident scenario. Clients may share tokens and connect to Cosmos DB while identifying themselves using a shared token. *If a client connects without a token, it will be issued an arbitrary one.* Session consistency has two possible behaviors: if two clients identify themselves with the same token, then they will see each other’s written data as if operating under strong consistency. If two clients do not have the same token, then no guarantees are offered and they will see each other’s behavior as if operating under eventual consistency. In the incident in question, this detail was fundamental to the bug observed: due to incorrect token handling, the system saw eventual consistency guarantees rather than the expected strong consistency-like guarantees.

The counter-example [from TLA+ model checking] makes clear a logical problem with how the worker and dispatcher behave. Speaking intuitively, because Cosmos DB is operating in session consistency mode, if they do not share a session token then the clients are opting out of any consistency between reads and writes. It is entirely possible that the worker accesses a Cosmos DB server that has not yet received the value written by the dispatcher, and without a session token that server will just respond with the information it has on hand: "taskKey" does not refer to a value. With an up to date session token, the server would see evidence encoded into the token itself that its state is stale, and it would have to wait before responding.

Looking at this explanation, it also becomes clear why the issue could not be found using conventional testing, nor could it easily be reproduced at the implementation level. *The issue was of similar complexity to a grey failure, being rare and load-dependent, meaning that any implementation-level reproduction would have to reliably recreate those complex and difficult to control factors.* That is why incident reports are usually prose and figures, because without an alternative strategy like modeling, it is simply impractical to do anything else. With the addition of high-level modeling however, creating a demo of this kind of rare issue becomes possible.

Using TLA+ as a tool to support incident investigation and correction of error (COE)

Following the observation from the last paragraph, Markus and co-authors make the following recommendation to integrate TLA+ modeling to COE process whenever appropriate.

We propose that modeling techniques, which are normally used during system design to predict and avoid incidents, are just as valuable after the fact. Modeling allows site reliability engineers to go beyond a prose-and-figures incident report and produce precise, interactive and analyzable models of incidents.

Prose can be ambiguous. To make things more precise, once the facts are established, engineers should also produce a machine-analyzable model representing the systems involved and the configuration that lead to the incident scenario.

Tools do exist that help developers reason about unusual behaviors their code might exhibit, but they have to stochastically sample a subset of possible implementation-level events rather than produce a comprehensive summary. As a result, a light-weight machine-analyzable representation of all of a system’s allowable behaviors is an important asset when reasoning about production incidents.

Benefits of the modeling process

I agree with the recommendation. TLA+ models could be a useful tool to employ in COEs. This is yet another benefit to add to my old "Why you should use modeling [with TLA+/PlusCal]" post, if I get to update it again.

I took a look at the larger arxiv paper. This paragraph caught my attention.

Our modeling process was based on iterative discussion with author 2, a principal engineer working on the Cosmos DB implementation. We followed existing user-facing documentation, asked for feedback, learned more about the realities of Cosmos DB’s design, and incorporated that new knowledge into our specification. *We repeated this feedback loop until we found no more corrections, when our model began to predict counter-intuitive but possible behaviors of the real system.*

As a 25+ years distributed systems veteran, I know exactly what they mean in this last sentence. This reminds of the Douglas Adams quote about the impossibility drive. LOL!

We have normality. I repeat, we have normality. Anything you still can't cope with is therefore your own problem. --Douglas Adams

The abstract expands on this as follows:
By focusing a formal specification effort on precisely defining the expected user-facing behaviors of the Azure Cosmos DB service at Microsoft, we were able to write a formal specification of the database that was significantly smaller and conceptually simpler than any other specification of Cosmos DB, while representing a wider range of valid user-observable behaviors than existing more detailed specifications. Many of the additional behaviors we documented were previously poorly understood outside of the Cosmos DB development team, even informally, leading to data consistency errors in Microsoft products that depend on it. Using this model, we were able to raise two key issues in Cosmos DB’s public-facing documentation, which have since been addressed. We were also able to offer a fundamental solution to a previous high-impact outage within another Azure service that depends on Cosmos DB.

The full arxiv paper is definitely worth a read. I intend to do that, but there are two tall stacks of papers looking back at me on my desk. Just knowing what the paper is about does not give you even 20% of the benefit of reading the paper fully, grappling with it, and understanding it to a satisfactory degree. And I am a ~~+slow+~~ deep reader.

Search This Blog

Metadata