Monday, October 8, 2018

Everything is broken

Last Wednesday, I attended one of the monthly meetings of the "Everything is Broken" meet up at Seattle. It turns out I selected a great meeting to attend, because both speakers, Charity Majors and Tammy Butow, were excellent.

Here are some select quotes without context.

Observability-driven development - Charity Majors


Chaos engineering is testing code in production. "What if I told you: you could test both in and before production."

Deploying code is not a binary switch; deploying code is a process of increasing your confidence in your code.

"Microservices are hard!" as a caption for a figure comparing the LAMP stack 2005 versus the complexity of the Parse stack 2015.

We are all distributed systems engineers and unknowns outnumber the knowns!
Distributed systems have an infinite number of almost-impossible failures!

Without observability you don't have chaos engineering, you have a chaos.

Monitoring systems have not changed significantly in 20 years, from Nagios. Complexity is exploding everywhere, but our tools are designed for a predictable world.

Observability for software engineers: can you understand what is happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output?

For the LAMP stack monitoring was sufficient for identifying the problems.
For microservices, it is unclear what we are supposed to monitor for. We need observability!
The hard part is not debugging your code, but to find which part to debug!

Facebook's  Scuba was ugly, but it helped us slice and dice and improve our debugging! It improved things a lot. I understand Scuba was hacked to deal with MySQL problems.

You don't know what you don't know, so dashboards are very limited utility. Dashboards are only for anticipated cases: every dashboard is an artifact of past failures. There are too many dashboards, and they are too slow.

Aggregates are the kiss of death; important details get lost.

Black swans are the norm; you must care about 99.9%, epsilons, corner cases.

Watch things run in production in the normal case; get used to observing your systems when they aren't on fire.

Building Resilient Systems Using Chaos Engineering - Tammy Butow

Chaos engineering is "thoughtful planned experiments designed to show weak points in the system".

Top 5 popular ways to use chaos engineering now: kubernetes, kafka, aws ecs, cassandra, elasticsearch.

Fullstack chaos engineering: inject faults at api, app, cache, database, os, host, network, power

We are exploring a new direction and collaborating with the UI engineers on ways to hide impact of faults.

prerequisites for chaos engineering:
1. monitoring & observability
2. on-call & incident management
3. know the cost of your downtime per hour (British Airlines's 1 day outage costed $150 millon)

How to choose a chaos experiment?
+ identify top 5 critical systems
+ choose 1 system
+ whiteboard the system
+ select attack: resource/state/network
+ determine scope

How to run your own gameday: http://gremlin.com/gameday

Outage post-mortems: https://github.com/danluu/post-mortems

First chaos engineering conference this year: http://twitter.com/chaosconf

Some notes about the venue: Snap Inc

There were fancy appetizers, very fancy. They had a kitchen there at the fifth floor (and every floor?). Do they provide free lunch to snap employees?

At the 5th floor, where the meeting took place, we had a great view of Puget Sound bay. The Snap building is just behind the Pike Market Place. There were about 80-100 people. I think the 30+ folks outnumbered 40+ folks, but not severely. Good show up from female engineers. There was ambient music in the beginning from 6-6:30pm, but it was loud.

By the way, I never used snapchat... I am old. But I don't have a Facebook account, so maybe I am not that old.

MAD questions

1. Do you need to test in production? 
The act of sabotaging parts of your system/availability may sound crazy to some people. But it puts forth a very firm commitment in place. You should be ready for these faults, as they will happen in one of these Thursdays. It establishes a discipline that you would test, gets you prepared with writing the instrumentation for observability, and toughens you up. It puts you into a useful paranoid mindset: the enemy is always at bay and never sleeps, I should be ready to face attacks. (Hmm, here is an army analogy: should you train with live ammunition? It is still controversial because of the lives on the line.)

Why not wait till faults occur in production by themselves, they will happen anyways. But when you do chaos testing, you have control in the inputs/failures, so you already know the root cause. And this can be give you much better opportunity to observe the percolation effects.

2. Analogies for chaos engineering
I have heard vaccination used as an analogy. It is a tactful analogy (much better than the live firing analogy). Nobody can argue against usefulness of vaccinations.

Other things chaos testing evokes could be blood letting and antifragility. I had read somewhere that the athletes in ancient Greek would induce a diarrhea on purpose a couple weeks before competitions, so that their body can recover and get much stronger at the time of the competition. I guess the reasoning goes as "too much of a monotone is a bad thing" and it is beneficial to stress/shake the system to avoid a local maxima. That reminds me of this YouTube video I show in my distributed systems class on the topic of resilience. 

3. Debugging designs with TLA+
Even after you have a verified design, the implementation can still introduce errors, so using chaos engineering tools is valuable and important even then.

It helps even for "verified" systems for its nonverified parts:
Folks encouraged us to try testing verified file systems; we were skeptical we would find anything, but to our surprise, when we tested MIT’s FSCQ file system, we found it did not persist data on fdatasync()! Apparently they had a bug in the un-verified portion of their code (Haskell-C bindings), which was caught by Crashmonkey! This shows that even verified file systems have un-verified components which are often complex, and which will have bugs.

4. Chaos tag
Turns out I have several posts mentioning chaos engineering, so I am creating a chaos tag to be available for use for future posts.

Sunday, October 7, 2018

Debugging designs with TLA+

This post talks about why you should model your systems and exhaustively test these models/designs with the TLA+ framework. In the first part, I will discuss why modeling your designs is important and beneficial, and in the second part I will explain why TLA+ is a very suitable framework for modeling, especially for distributed and concurrent systems.

Modeling is important

If you have worked on a large software system, you know that they are prone to corner cases, failed assumptions, race conditions, and cascading faults.

There are many corner cases because there are many parameters, and these do interfere in unanticipated ways with each other. The corner cases violate your seemingly reasonable implicit assumptions about the system components and environment, e.g.,"1-hop is faster than 2-hops", "0-hop is faster than 1-hop", and "processes work with the same rate". There are abundant race conditions because today (with the rise of SOA, cloud, and microservices) all systems are distributed systems. Code that is supposedly "atomic block of execution" fails due to other processes executing concurrently. Finally, faults happen and their effects are almost always underestimated pre-deployment. Faults take your system to unanticipated states, and from there on with the interleaving of recovery actions with normal system actions, the system may be thrown to even more unanticipated states.

In large software systems, which are inevitably distributed systems, there are many unknown-unknowns and an infinite number of highly-improbable ways things can go wrong. Human brain and reasoning cannot scale to handle all these possibilities. To alleviate these problems, the industry developed tools for better observability and even testing in production for improving availability. These tools are very important and indispensable. But by the time you figure out some inherent problems with your design it may be too hard and expensive to fix things. What you thought would be the last 10% of the project ends up taking 90% of your time at production and operations.

If you model your designs first and exhaustively test and debug these models for correctness against corner cases, failed assumptions, concurrency, and failures, you can catch errors at the design time and fix them before they develop into problems and become costly to fix.

  • Modeling first does not extend your development time, on the contrary it saves you time by reducing futile development attempts. Embarking on development with a flawed design almost always ensures that the implementation is flawed. While having a precise and correct model at hand does not guarantee that your implementation of the model is correct, it helps you avoid the big/intricate problems and also provides a good reference for testing your implementation against.
  • Constructing a precise model of your system gives you clarity of thinking and supports your development immensely.  By modeling you discover about the inherent complexities of the problem; that helps you focus your attention and ignore accidental/byproduct complexities.
  • The model also helps you to communicate precisely with your team and others as you avoid the ambiguity of natural language and the hand-waving and generalizations involved.
  • Finally with the model at hand, you also have a chance to gradually introduce design decisions, and see alternative ways to implement the design. 


TLA+ is great for modeling

TLA+ is a formal language for describing and reasoning about distributed and concurrent systems. It is developed by Dr. Leslie Lamport, Turing Award winner 2013. Lamport is a very important figure in distributed systems due to his logical clocks work and Paxos work among many others. For the last decade, he is very involved with improving the TLA+ framework to help make distributed systems more manageable.

TLA+ uses basic math to model and reason about algorithms: practical logic, set theory, and temporal logic are used for specifying  systems. Best of all, the framework integrates a model checker that exhaustively tests your models to the face of corner cases, failed assumptions, concurrency, and failures. The model checker tries all executions possible for your model and tells you for which executions, your invariants and system guarantees break.

Invariant-based reasoning
TLA+ framework promotes invariant-based reasoning to prevent the problems that arise from operational reasoning. In operational reasoning, you start with a "happy path", and then you try to figure out "what can go wrong?" and how to prevent them. Of course, you always fall short in that enumeration of problem scenarios and overlook corner cases, race conditions, and cascading failures. In contrast, invariant-based reasoning focuses on "what needs to go right?" and how to ensure this properties as invariants of your system at all times. Invariant-based reasoning takes a principled state-based rather than operation/execution-based view of your system.

To attain invariant-based reasoning, we specify safety and liveness properties for our models. Safety properties specify "what the system is allowed to do". For example, at all times, all committed data is present and correct. Liveness properties specify "what the system should eventually do". For example, whenever the system receives a request, it must eventually respond to that request. In other words, safety properties are concerned with "nothing bad happens", and liveness properties with "something good eventually happens".

Modeling with TLA+
The TLA+ framework supports you in building a model and figuring out its invariant properties in two major ways. Firstly, the math-based formal language helps you achieve precision while still working with high-level declarative statements. Secondly, the integrated model checker exhaustively debugs your model to the face of concurrency and failures, and produces counterexamples for which your candidate invariants fail. (After years of working with TLA+, I am still surprised about the counterexamples the model checkers spit out for my models: It is very easy to overlook some scenarios, but the model checker sets you straight.) You address these problems by improving your model or sometimes by relaxing your candidate invariants, and after many iterations converge to an exhaustively debugged model which guarantees the invariants.

Building a TLA+ model is beneficial even for systems that are already implemented and running. Through building the model, you learn about your system better, and figure out some latent failure modes and correct them before they occur in production.

Finally, maintaining a TLA+ model of your system provides important benefits for continuous development. While software systems need to be extended with new features frequently, these extensions may interfere in unanticipated way with the system and lead to downtimes. With the TLA+ model at hand, you can first add these features to your model, and catch/debug the problems at the design-level using the model-checker. This way you resolve potential issues before they even become problems.

TLA+ is practical
Since using TLA+ actually saves time for building large software systems, TLA+ modeling is adopted as a practice by many software companies.

I am on sabbatical at Cosmos DB, Microsoft globally distributed cloud-native database. The team has been using TLA+ to model the replication and global distribution protocols and exhaustively tests the designs for correctness against failures. We have recently published the customer-facing part of the model which precisely defines the 5 consistency levels offered by Cosmos DB.

Amazon has also used TLA+ modeling for some of their AWS offerings and has written a nice experience report on this. There are also reports of using TLA+ for modeling hardware systems as well.

For the last 4 years, I have been incorporating TLA+ in my distributed systems classes. TLA+ enables students to learn about concurrency and invariant-based reasoning and it provides them hands-on experience with distributed protocols. I also use TLA+ exhaustively in my research on new distributed algorithms.

In my experience, it is possible to pick up TLA+ up in a couple weeks. This is firstly because TLA+ adopts a very simple state-machine approach to model systems. A system consists of: (1) A set of variables which define the state of the system, and (2) A finite set of assignments/actions that serves to transition the system from one state to another.

Furthermore, PlusCal provides syntactic a sugar for the TLA+, which has a tendency to grow long (due to its low-level state-transition centric syntax) and look cryptic for some people. PlusCal is a pseudocode for writing algorithms at a higher-level of abstraction, and it is translated to the underlying TLA+ specifications for model checking. To give you some idea about the PlusCal, here is an example of a PlusCal code for a database replica process. While this is a straightforward code, you can see a nondeterministic choice construct "either or" in action. The model checker will exhaustively test all possible combinations of these "either or" actions and check if a certain sequence would break one of your safety and liveness specifications.

To learn more

There is a very active TLA+ forum at Google Groups. Leslie Lamport chimes in several threads.

My blog includes many examples of TLA+/PlusCal modeling of distributed algorithms/systems.

LearnTLA provides a user-friendly introduction to TLA+/PlusCal.

Lamport's site includes TLA+/PlusCal resources (videos/books/examples) and links to download the toolkit.

Wednesday, September 26, 2018

The last mile problem in trust

Blockchains are supposed to solve the trust problem. But blockchains attack only the easy part of the trust problem, and avoid the hard part. The easy part is to store the transactions in a tamper-resistant database. The hard part is to attest to physical world actions and state.

The blockchain is a database technology and it does not attempt to attest to physical world actions/state. It solves the problem of tamper-proofing the state after it is added to the database. It doesn't attempt to validate/test/certify if the state is correct as it is added to the database. If humans create the state, there is inherently a trust problem: Were the lettuce bad before it was loaded to the trucks, or are the truck conditions to blame? Did the farmer or the trucker lie?

If sensors create the state, this is still a very hard problem, but not because the sensors may have been tampered with ---that is a relatively easy problem to solve in hardware. The problem is hard because of the corner-cases involved; how do you even start to pretend that the sensors have complete coverage (or good/fair sampling) and the detection/verdict is accurate? It is really a very complex and messy problem. As far as complete coverage of food supply-chains are concerned, you need DNA-sequencing and metagenomics.

This is a classic last mile problem. The last mile problems are always hardest to solve because of the massive fan-out both in terms of scale and in terms of corner cases to handle. The last mile problems haunted many domains, most notoriously the telecommunications and transportation domains.

Walmart, Lettuce, and Blockchains

A couple days ago there was a lot of hype about Walmart starting to use blockchain in its supply chain, to pinpoint where the lettuce come from in an E.Coli contamination event.

Ok, let's get to the bottom of this. "Walmart, Lettuce, Blockchain." It felt very weird to type this in Google search, but I did it anyways... for science.

See, I knew there was a lot of hype: "The giant retailer will begin requiring lettuce and spinach suppliers to contribute to a blockchain database that can rapidly pinpoint contamination."

De-hyped, this just says Walmart wants the farmers to record transactions in a database.  And actually the article makes sense if you replace blockchain with database:  "Walmart says it now has a better system for pinpointing which batches of leafy green vegetables might be contaminated. After a two-year pilot project, the retailer announced on Monday that it would be using a blockchain, the type of database technology behind Bitcoin, database to keep track of every bag of spinach and head of lettuce."

I blame IBM's over-excitement in blockchains for the hype in the article. Supply-chains is a very complex topic, and this use of a database to record information doesn't come close to scratching the surface of it. There are many automation and logistics problems that remain to be solved. And the dreaded last mile problem of course.

MAD questions

1. What is the nature of trust?

What or who do you trust?

Trusting a deterministic machine with few inputs/environmental-parameters is reasonable. Especially if you verified and validated it, and tested it extensively.

But what would make you trust humans? Humans are complex nondeterministic beings, and the input and environment surrounding humans are also very complex.

Reid Hoffman defines trust as consistency through time. But this is assuming the conditions don't change. If conditions change, that is the inputs/environmental conditions change, the other side can change its actions.

The answer to the trust puzzle has got to do with "consequences", right?

It is easier to trust in a situation where you have little to lose, but the other side has a lot at stake. And ironically, this makes the other side have problems trusting you, since you have little at stake, and she is risking a lot. For mutual trust and better collaboration, all parties should have skin in the game.

So what is at stake? This can be reputation, if reputation is a currency valued by the individual and his environment. What is at stake can be jail time, if one breaks laws and get caught. This is assuming one doesn't enjoy jail. Under certain conditions, people commit crimes to get into jail to get fed and have reliable healthcare, and even not to feel lonely.

I think trust is not complicated, rather the calculation, alignment, and managing of consequences/incentives is complicated. And this again harkens back to the last mile problem in trust.

I believe the parties involved are going to push the limits of what they can get away with as long as the deterrents do not outweigh the incentives.

I don't know if there is a technology solution here.

At a recent A16Z podcast, one speaker was rightfully complaining that we have a lot of trust issues and fight among complementary business rather than substitute/alternative business. For example even though iphone apps and iphone platform are complementary businesses, there is a lot of fight there. Or consider the Yelp versus Google fight. Or the fights Facebook, the platform, picks with the applications it enables. The speaker was implying that with the right incentivization and cuts from cryptocurrencies like ethereum gas, the parties will actually synergize and grow together rather than fight.

This sounds nice and simple, but I don't think I buy this. The fights are due to the greedy nature of humans and companies. To repeat what I said said above, I believe the parties involved are going to push the limits of what they can get away with as long as the deterrents do not outweigh the incentives. Even if cryptocurrencies and Ethereum gas is used between platforms and applications enabled, next we will see fights over how much of the payment is fair etc. I don't know if technology can fix that. Maybe this is supposed to be a dynamic equilibrium with constant push-backs and small-battles erupting from the parties involved.

2. What is the verdict?

I don't hate/despise blockchains, as I have seen some colleagues do. That is a radical and unreasonable position. There are many smart people working on this domain, they cannot be all and completely wrong.

I am still ambivalent about blockchains. I believe there is still a big contribution potential coming from blockchains and smartconracts. But the hype news make things harder to see.

Saturday, September 8, 2018

Book review. Ignorance: How it drives science

I picked this up from my local library, because the title was interesting. I wrote about this earlier.
Once you get a B.S., you think "you know everything". Once you get an M.S., you realize "you know nothing". Once you get a Ph.D., you realize that "yes, you know nothing, but that is not a problem, because nobody knows anything!"
This turned out to be a nice read. The author, Stuart Firestein, has a very interesting background. He was working at a theater, and started a biology undergraduate at 30, and got his PhD at 40.

Here are some tidbits from the book.

Leibniz. page 38
The 17th-century German philosopher and mathematician Gottfried Leibniz, one of the inventors of calculus, had a lifelong project to construct a "basic alphabet of human thoughts" that would allow one to take combinations of simple thoughts and form any complex idea, just as a limited number of words can be combined endlessly to form any sentence -- including sentences never before heard or spoken. Thus, with a few primary simple thoughts and the rules of combination one could generate computationally (although in Leibniz's day it would have been mechanically) all the possible human thoughts. It was Leibniz's idea that this procedure would allow one to determine immediately if a thought were true or valuable or interesting in much the same way these judgments can be made about a sentence of an equation -- is it properly formed, does it make sense, is it interesting? He was famously quoted as saying that any dispute could be settled by calculating-- "Let us calculate!" he was apparently known to blurt out in the middle of a bar brawl. It was this obsession that led Leibniz to develop the branch of mathematics known today as combinatorics. This in turn sprang from the original insight that all truths can be deduced from a smaller number of primary or primitive statements, which could be made no simpler, and that mathematical operations (multiplication was the one Leibniz proposed but also prime factorization) could derive all subsequents thoughts. In many ways this was the beginning of modern logic' indeed, some consider his /On the Art of Combinations/ the major step leading from Aristotle to modern logic, although Leibniz himself never made such claims.

Godel. page 41
What Godel showed, using a strange new correspondence between mathematics and logic that he invented, was that if a system were the rules of that system. This means that something that could be shown to be true using the system could not in fact be proved to be so. Since proofs are the foundation of mathematics, it is quite curious when obviously true statements cannot be proved.

Godel. page 42
Was this the end of the messianic program to establish the primacy of mathematics and of logical thinking? As it turns out, quite the contrary. Godel's small, by comparison, but revolutionary output is so asttonishing because of the technical and philosophical research opportunities it has created. Previously unconsidered ideas about reccursiveness, paradox, algorithms, and even consciousness owe their foundations to Godel's ideas about imcompleteness. What at first seems like a negative --eternal incompleteness-- turns out to be fruitful beyond imagining. Perhaps much of computer science, an area one might think was most dependent on empirical statements of unimpeachable logic, could not have progressed without the seminal ideas of Godel. Indeed, unknowability and incompleteness are the best things that ever happened to science.

Hilbert. page 48
In fact, one of the most predictable things about predictions is how often they're wrong. Nonetheless, they are a measure, even if somewhat imprecise, of our ignorance. They are a catalog of what we think the important ignorance is, and perhaps also a judgment of what we think is the most solvable ignorance.

May ignorance lead your research. page 55
Ignorance is not just an excuse for poor planning. We must think about how ignorance works, and we have to be explicit about how to make it work to our advantage. While for many experienced scientists this is intuitive, it is not so obvious to the layperson, and it often seems not so apparent to young scientists starting out their career and worrying about grant support and tenure.

Grants. page 59
How do scientists ponder these big questions about ignorance? How do they get from these and other interesting and important issues to an actual scientific research program? Well, at the most pedestrian, but nonetheless critical level, there are grant proposals. Every scientist spends a significant percentage of his or her time writing grants. Many complain about this, but I actually think it's a good idea. These documents are, after all, a detailed statement of what the scientist hopes to know, but doesn't, as well as  a rudimentary plan for finding it out.

Models. page 70
This strategy of using smaller questions to ask larger ones, is, if not particular to science, one of its foundations. In scientific parlance this is called using a "model system". As Marvin Minsky, one of the fathers of artificial intelligence, points out, "In science one can learn the most by studying the least". Think how much more we know about viruses and how they work than about elephants and how they work. The brain, for example, is a very complicated piece of biological machinery. Figuring out how it works is understandably one of humankind's great quests. But, unlike a real machine, a man-made, designed machine, we have no schematic. We have to discover, uncover, the inner workings by dissection-- we have to take it apart. Not just physically but also functionally. That's a tall order since there are some 80 billion nerve cells that make up the human brain, and they make about 100 trillion connections with each other. ... So instead of a human brain, neuroscientists study rat and mouse brains, fly brains because they  can do some very fancy genetics on them, or even the nervous system of the nematode worm, which has exactly 302 neurons.

MAD questions

1. What do you feel ignorant about in your line of work?

Friday, August 31, 2018

TLA+ specification of the bounded staleness and strong consistency guarantees

In my previous post, I had presented a TLA+ modeling of distributed data store that provides the consistent prefix property. In this post, I extend this model slightly to build bounded and strong consistency. In fact the strong consistency specification is achieved when we take the Delta on the bounded consistency as 1.

The TLA+ (well, PlusCal to be more accurate) code for this model is available at https://www.dropbox.com/s/qvmhhgjf9iycaca/boundedstrongP.tla?dl=0

The system model

As in the previous post, we assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions.
WriteRegion == 1
ReadRegions == 2..NumRegions
chan = [n \in 1..NumRegions |-> <<>>]; 

We use D to denote the Delta on the bounded staleness consistency. Bounded staleness ensures that read results are not too stale. That is, the read operation is guaranteed to see at least all writes that precedes D  number of updates before the read started. The read may potentially see some more recently written values.

Strong consistency ensures that a read operation returns the value that was last written for a given object. This is achieved by using D=1.

The write region actions


The write region has 3 actions to perform: Commit the write in the write region, Multicast the write to the read regions, and Receive/process an ack message from a read region.

These actions can be performed in any arbitrary order inside the process loop, and the model checker will methodically explore all possible combinations of these actions interleaved with the read region actions to expose any flaws in the protocol.

The first action selection is to commit a write in the write region. (We don't get into how that is implemented in the region; maybe commit is done by replicating the update to a write quorum in the region.) As a result the CommittedLSN is incremented. Note that, in contrast to prefix consistency model, in the bounded staleness model the write is throttled to not advance more than D ahead of any of the read regions.

The second action selection is to multicast a committed write to the read regions through the FIFO channels. and this is an asynchronous replication. These may be sent whenever this action is chosen, and is not blocked on waiting any acknowledgements from the read regions.

This action is exactly the same as in the prefix consistency model. The SentLSN variable denotes the last CommittedLSN write that is forwarded to the read regions and is used for ensuring ordered replication.

The final action selection is to receive an Ack message from the channel from a read region. The Progress variable keeps track of the Ack messages, and the CompletedLSN is updated to reflect the highest write that is acknowledged by all the read regions. Harking back to action 1, notice that the write region lazily disseminates this CompletedLSN information with the read regions by piggybacking this to the commit-write messages. In this model the read-regions do not utilize this CompletedLSN information, but as I start to explore in the MAD questions, this can be useful.

The read region actions


The read regions only react to the replicated messages from the write region. The first action selection is to receive a message pending in the channel from the write region. The second action selection is to send back an Ack message for any replication message that is not acknowledged yet. The actions are almost the same except for the line updating the CompletedLSN at the read region.

Invariant checking and testing

The consistent prefix invariant still holds for this model as we refined that model to obtain this one.
CP == [][\A i \in ReadRegions: 
               CommittedLSN'[i] = CommittedLSN[i] 
            \/ CommittedLSN'[i] = CommittedLSN[i] + 1]_vars

The BoundedC invariant is to check that the read regions are always maintained to be within the staleness bound of the most recent CommittedLSN.  (Since I used CommittedLSN variable for both read and write regions, the TLA translator assigned CommittedLSN_ "with underscore" to the write region's version to distinguish it from that of the read regions.)
BoundedC  == \A i \in ReadRegions : 
                      CommittedLSN[i]=< CommittedLSN_[1] 
                   /\ CommittedLSN[i]>= CommittedLSN_[1] -D

The SyncStep invariant is to check the relationship between the CompletedLSN at the write region and the copies maintained at the read regions.
SyncStep  == \A i \in ReadRegions : 
                      CompletedLSN[i] =< CompletedLSN_[1]
                   \/ CompletedLSN[i] > CompletedLSN_[1] -D

I first wrote this predicate with "CompletedLSN[i] > CompletedLSN_[1] -1" but the model checker was quick to tell me I was wrong. This is bounded by D and "not 1" as receive operations at the read regions can be asynchronous within the D staleness bound. Here the write region received Acks for its two commits back to back so the CompletedLSN at the write region was 2 versions ahead of those in the read regions.

MAD questions

1. Did we explore the design space thoroughly within this simple model?

No, it turns out, there is still surprisingly interesting and useful tricks we can pull within this simple model.

As I mentioned in the review of the "Many Faces of Consistency" paper, there is the "operational definitions of consistency" exposed to the client and there is the "state based definitions consistency" used by the developers, and there is a gap between the two where you can play interesting tricks and expose the client operational consistency it cares about in an efficient way.

In our model we approached things from the state consistency perspective and made sure everything works safely. We can still add a lot of efficiency and functionality by slightly changing how we expose things to the client from an operational consistency perspective. Azure Cosmos DB performs many interesting tricks under the hood to implement many consistency models in concert. More on this later...

Wednesday, August 29, 2018

TLA+ specification of the consistent prefix guarantee

We had covered consistency properties in the recent posts. Now to keep it more concrete and to give you a taste of TLA+, I present a model of a distributed data store that provides the consistent prefix property.

In my next post, I will extend this model slightly to present bounded and strong consistency properties. And, in the post after that, I will add a client model to show how this data store can be extended to provide session consistency. The TLA+ (well, PlusCal to be more accurate) code for this model is available here. 

The system model

We assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions.
WriteRegion == 1
ReadRegions == 2..NumRegions
chan = [n \in 1..NumRegions |-> <<>>]; 

The write region actions


The write region has 3 actions to perform:

  • Commit the write in the write region
  • Multicast the write to the read regions
  • Receive/process an ack message from a read region

These three actions can be performed in any arbitrary order inside the process loop, and the model checker will ruthlessly try all possible combinations of these actions interleaved with the read region actions to check if there is a violation of the invariant properties entered.

The first action selection is to commit a write in the write region. (We don't get into how that is implemented in the region; maybe commit is done by replicating the update to a write quorum in the region.) As a result the CommittedLSN is incremented. CommittedLSN keeps track of the write operation. We are not modeling the content of the write, we will model the write only with its CommittedLSN and try to get the writes replicated in the read regions in the same order.

The second action selection is to multicast a committed write to the read regions. To send the writes in order through the FIFO channels, the write region maintains a SentLSN variable to denote the last CommittedLSN write that is forwarded to the read regions. If there are some CommittedLSN writes that are yet to be forwarded,  these are multicasted in order of the LSN numbers without any gap. However, note that, this is an asynchronous replication. The messages may be sent whenever this action is chosen, and this send action is not blocked on waiting any acknowledgements from the read regions.

The final action selection is to receive an Ack message from a read region. This model does not use the Ack messages for updating anything, but they can be used for tracking the progress of the read regions. This will become important for the bounded consistency and strong consistency properties.

The loop is bounded by a constant MAXN so that the model checking space is limited. MAXN=5 is an acceptable number. If you have worked with a model checker, you know that checking 5 repeat of the operation without the invariant broken is a very good indication it works. Yes, this is not induction proof, but since the model checker tries everything that could possibly go wrong, it would have found a counterexample (e.g., race-condition that violates an invariant) even with 3 repeats or so.

The read region actions


The read regions only react to the replicated messages from the write region. So they have only two actions.

The first action selection is to receive a message pending in the channel from the write region. CommittedLSN variable is then updated to match the LSN in the message. Here we don't check for the gaps, but since the channels are FIFO and the messages are sent by the write region in the increasing LSN order without gaps, the consistent prefix property holds ---as we later confirm when we model-check with the invariant.

The second action selection is to send back an Ack message for any replication message that is not acknowledged yet. Since this action is also asynchronously chosen, the acknowledgement can be cumulative, it can skip over LSNs and acknowledge the highest seen, and that's OK.

Invariant checking and testing

Our invariant is short and sweet. It checks that at any read region, the CommittedLSN --if updated-- is always incremented by 1 over its previous value. That is, there are no gaps in the commit sequence.
CP == [][\A i \in ReadRegions: 
               CommittedLSN'[i] = CommittedLSN[i] 
            \/ CommittedLSN'[i] = CommittedLSN[i] + 1]_vars

An alternative would be to insert a PrefixConsistency boolean variable in the specification, which gets set to FALSE at the read region if the replica receives an out-of-order commit request. But that makes the model ugly; it is not nice to entangle the model and property to be checked.

Another alternative would be to write a client, and to check for operation consistency among the client reads. But that is cumbersome, because you need to introduce a client and a read operation at the read regions. Furthermore, that will also cause the model checker to take longer time to complete.

What are some other properties we can test here?

In the consistent prefix guarantee the write region can commit asynchronously and freely without waiting for the read regions to catchup. The read regions can catchup on their own pace.

Let's write some conditions, "fake invariants", to test that this is the case.
SyncStep  == \A i \in ReadRegions  : 
                   CommittedLSN[i]> CommittedLSN_[1] -3
SyncStep2 == \A i,j \in ReadRegions: 
                   CommittedLSN[i]# CommittedLSN[j]  -3

The model checker is quick to find counterexamples for these conditions. For the SyncStep the error trace provides a short counterexample where the write region raced ahead to commit 3 updates without broadcasting any updates to the read regions. (Since I used CommittedLSN variable for both read and write regions, the TLA translator assigned CommittedLSN_ "with underscore" to the write region's version to distinguish it from that of the read regions.)


For the SyncStep2, the error trace is longer because it needs some setup. Here the write region performed 3 writes and broadcasted these to the read regions, but only the read region 2 performed the updates and get to CommittedLSN=3, while the read region 3 has not performed any receive action from its channel.


Things I tried to speed up the TLA model checking

I believe it is important to share the mistakes committed as well as the end result, so others can learn better.

My first model was large. My improvements involved trimming and simplifying that model.
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. --Antoine de Saint-Exupery.
To keep model checking feasible I used MAXN. But I found that the model checker would still be crunching scenarios after ten minutes. I knew that this is a simple protocol and it shouldn't take so long to model check. Then I noticed that the problem is with bounding the run on the wrong thing: I was bounding the CompletedLSN variable with MAXN, but the CommittedLSN was still able to race unbounded above the MAXN. After I noticed my mistake the model checker took only a couple seconds for MAXN=5.

What is CompletedLSN? In my earlier version of the model, the write region used CompletedLSN to keep track of the progress of the read regions, but I realized this variable was unnecessary for checking the consistent prefix, so I took that out entirely.

I also performed other simplifications to reduce the model checking state space. Instead of sending messages to the regions one by one, I modified the write region to queue the messages at once via the multicast operation.


MAD questions

1. How do you relax this for eventual consistency?
For eventual consistency, we don't need to maintain CommittedLSN and a sequencing of writes. The write region can maintain a set for writes/updates, and multicast any write/update from that set at random. The read region just receives an update and performs it. There is also no need for acknowledgement from the read region. The repeated multicasts will eventually establish that every write/update is replicated.

2. How would you extend this to the stronger consistency guarantees?
In my next post, I will extend this model to specify bounded staleness and strong consistency properties. As part of this extension, the write region would need to keep track of the updates performed at the read regions and slow down to make sure it is not going too far ahead of the read regions.

Since session consistency is specific to a client, we will write a client model and let the client share/copy LSN numbers and get served using that LSN number. The "LSN" will be our session state.

3. Was this too simple an example to be worth modeling?
As I wrote above, I had several blunders before I could simplify this model to what I presented. And I learned a lot from developing this simple model. So, this was definitely worth modeling and working on.

I had written about the benefits of modeling before here.

Friday, August 24, 2018

Mind Your State for Your State of Mind

This article by Pat Helland appeared at ACM Queue on July 2018. Pat is a self-confessed apostate. He is also a database philosopher; look at the title of his recent publications: Standing on distributed shoulders of giants, Life beyond distributed transactions, Immutability changes everything, Heisenberg was on the "write" track, consistently eventual, etc.

This "mind your state for your state of mind" article looks at the history of interactions of applications and storage/databases, and charts their co-evolution as they move into the distributed and scalable world.

The evolution of state, storage, and computing

Storage has evolved from disks directly attached to your computer to shared appliances such as SANs (storage area networks) leading the way to storage clusters of commodity servers contained in a network.

Computing evolved from a single-process on a single server, to multiple processes communicating on a single server, to RPCs (remote procedure calls) across a tiny cluster of servers. In the 2000s, the concept of SOA (service oriented architecture) emerged to provide trust isolation so that the distrusted outsider cannot modify the data. As the industry started running services at huge scale, it learned that breaking a service into smaller microservices provides advantages through better modularity/decoupling and through stateless and restartable operation. Today microservices have emerged as the leading technique to support scalable applications.

Databases also evolved tremendously. Before database transactions, there were complexities in updating data even in a single computer, especially if failures happened. Database transactions dramatically simplified the life of application developers. But as solutions scaled beyond a single database, life got more challenging. First, we tried to make multiple databases look like one database. Then, we started hooking multiple applications together using SOA; each service had its own discrete database with its own transactions but used messaging to coordinate across boundaries. Key-value stores offered more scale but less declarative functionality for processing the application's data.  Multirecord transactions were lost as scale was gained. Finally, when we started using microservices, a microservice instance did not have its own data but reached directly to a distributed store shared across many separate services. This scaled better—if you got the implementation right.

Minding durable and session state in microservices

Durable state is stuff that gets remembered across requests and persists across failures. Durable state is not usually kept in microservices. Instead, it is kept in back-end databases and key-value stores. In the next section we look at some of these distributed stores and their example use cases.

Session state is the stuff that gets remembered across requests in a session but not across failures. Session state exists within the endpoints associated with the session, and is hard to maintain when the session is smeared across service instances: the next message to the service pool may land at a different service instance.

Without session state, you can't easily create transactions crossing requests. Typically, microservice environments support a transaction within a single request but not across multiple requests. Furthermore, if a microservice accesses a scalable key-value store as it processes a single request, the scalable key-value store will usually support only atomic updates to a single key. Programmers are on their own when changing values tied to multiple keys. I will discuss the implications of this in MAD questions section at the end.

Different stores for different uses

As we mentioned as part of evolution of databases, to cope with scalable environments, data had to be sharded into key values. Most of these scalable key-value stores ensured linearizable, strongly consistent updates to their single keys. Unfortunately, these linearizable stores would occasionally cause delays seen by users. This led to the construction of nonlinearizable stores with the big advantage that they have excellent response times for reads and writes. In exchange, they sometimes give a reader an old value.


Different applications demand different behaviors from durable state. Do you want it *right* or do you want it *right now*? Applications usually want the latter and are tolerant of stale versions. We review some example application patterns below.


Workflow over key-value. This pattern demonstrates how applications perform workflow when the durable state is too large to fit in a single database. The workflow implemented by careful replacement will be a mess if you can't read the last value written. Hence, this usage pattern will stall and not be stale. This is the "must be right" even if it's not "right now" case.

Transactional blobs-by-ref. This application runs using transactions and a relational database and stores big blobs such as documents, photos, PDFs etc at a data store. To modify a blob, you always create a new blob to replace the old one. Storing immutable blobs in a nonlinearizable database does not have any problems with returning a stale version: since there's only one immutable version, there are no stale versions. Storing immutable data in a nonlinearizable store enjoys the best of both worlds: it's both right and right now.

E-Commerce shopping cart. In e-commerce, each shopping cart is for a separate customer. There's no need or desire for cross-cart consistency. Customers are very unhappy if their access to a shopping cart stalls. Shopping carts should be right now even if they're not right.

E-Commerce product catalog. Product catalogs for large e-commerce sites are processed offline and stuffed into large scalable caches. This is another example of the business needing an answer right now more than it needs the answer to be right.

Search. In search, it is OK to get stale answers, but the latency for the response must be short. There's no notion of linearizable reads nor of read-your-writes.


Each application pattern shows different characteristics and tradeoffs. As a developer, you should first consider your application's requirements carefully.
  • Is it OK to stall on reads?
  • Is it OK to stall on writes?
  • Is it OK to return stale versions?
You can't have everything!

If you don't carefully mind your state, it will bite back, and degrade your state of mind.

The Cosmos DB take

I am doing my sabbatical at Microsoft Cosmos DB. So I try to put things in context based on what I see/work on here. This is how Cosmos DB fits in this picture and provides answers to these challenges.

We had talked about this in a previous post. "Clients should be able to choose their desired consistency. The system cannot possibly predict or determine the consistency that is required by a given application or client."

Most real-world application scenarios do not fall nicely into the two extreme choices of consistency models that are typically offered by databases, e.g., strong and eventual consistency. Cosmos DB offers five well-defined and practical consistency models to accommodate real-life application needs. Of the 5 well-defined consistency models to choose from, customers have been overwhelmingly selecting the relaxed consistency models (e.g. Session, Bounded-Staleness and Consistent Prefix). Cosmos DB brings a tunable set of well-defined consistency models that users can set at the click of a button ---there is no need to deal with messy datacenters, databases, and quorum configurations. Users can later override the consistency level they selected on each individual request  if they want to. Cosmos DB is also the 1st commercial database system to have exposed Probabilistic Bounded Staleness (PBS)  as a metric for customers to determine "how eventual is eventual consistency".


I will have a series of posts on global distribution/replication in Cosmos DB soon. Below is just an appetizer :-)

Even for a deployment over multiple regions/continents, the write operation for session, consistent prefix, and eventual consistency levels are acknowledged by the write region without blocking for replies from other regions. Bounded consistency write often gets replied quickly from the write region without waiting for replication to other regions: the bound of staleness is not reached means, the region can give the write a green light locally. Finally, in all these cases including strong consistency, with the multimaster general availability rollout, the write region is always the closest region to the client performing the write.

Latency of the reads are guaranteed to be fast backed up by 99.99% SLAs.  Reads are always answered within the region (nearest region) contacted by the client. While serving reads from the nearest region, the consistency level selected (a global property!) is guaranteed by the clever use of logical sequence numbers to check that the provided data guarantees the selected consistency level. Within the region, a read is answered by one replica (without waiting for other replica or region) for consistent prefix and eventual consistency levels. Reads may occasionally wait for the second read replica (with potentially waiting for fresh data to arrive from another region) for session consistency. Finally, reads are answered by a read quorum of 2 out of 4 for bounded and strong consistency.

In other words, while the tradeoffs between predictable read latency, predictable write latency, and read your writes (session consistency) inherently exists, Cosmos DB handles them lazily/efficiently but surely by involving client-library (or gateway) collaboration. As we mentioned in a previous post, it is possible to achieve more efficiency with lazy state synchronization behind the curtain of operation consistency exposed to the client.

MAD questions


1. What is the next trend in the co-evolution of computing and storage?

The pattern of software-based innovation has always been to virtualize a physical thing (say typewriters, libraries, publishing press, accountants), and then improve on it every year thanks to the availability of exponentially more computing/querying power. The cloud took this to another level with its virtually infinite computing and storage resources provided on demand.

Software is eating the world. By 2025, we will likely have a virtual personal assistant, virtual nanny, virtual personalized teacher, and a virtual personalized doctor accessible through our smartphones.

If you think about it, the trend for providing X as a service derives and benefits from the trend of virtualizing X. Another term for virtualization is making it software-defined. We had seen software-defined storage, software-defined networks, software-defined radios, etc. (Couple years ago I was joking about when we will see software-defined software, now I joke about software-defined software-defined software.)

This trend also applies to and shapes the cloud computing architecture.

  • First virtual machines (VMs) came and virtualized and shared the hardware so multiple VMs can colocate on the same machine. This allowed consolidation of machines, prevented the server sprawl problem, and reduced costs as well as improving manageability. 
  • Then containers came and virtualized and shared the operating system, and avoided the overheads of VMs. They provided faster startup times for application servers. 
  • "Serverless" took the virtualization a step ahead. They virtualize and share the runtime, and now the unit of deployment is a function. Applications are now defined as a set of functions (i.e., lambda handlers) with access to a common data store. 

A new trend is developing for utilizing serverless computing even for the long running analytics jobs. Here are some examples.

The gravity of virtualization pulls for disaggregation of services as well. The paper talked about the trend about disaggregation of services, e.g., computing from storage. I think this trend will continue because this is fueled by the economies of scale cloud computing leverage on.

Finally, I had written a distributed systems perspective prediction of new trends for research earlier here.


2. But, you didn't talk about decentralization and blockchains?

Cloud and modern datacenter computing for that matter provides trust isolation. The premise of blockchain and decentralized systems is to provide trustless computation. They take trustless as an axiom.  While it is clear that trust isolation is a feature organizations/people care (due to security and fraud protection), it is not clear if trustless computing is a feature organizations/people care.

Finally, as I wrote about this earlier several times, logical centralization (i.e., the cloud model) has a lot of advantages over decentralization in terms of efficiency, scalability, ease of coordination, and even fault-tolerance (via ease of coordination and availability of replication). Centralization benefits from the powerful hand of economy of scale. Decentralized is up against a very steep cliff.

Here is High Scalability blog's take on it as well.


3. How do we start to address the distributed coordination challenge of microservices?

Despite the fact that transactional solutions do not work well with microservices architectures, we often need to provide some of the transactional guarantees to operations that span multiple (sometimes dozens of) microservices. In particular, if one of the microservices in the request is not successful, we need to revert the state of the microservices that have already changed their states. As we discussed it is hard to maintain session state with microservices. These corrective actions are typically written at the coordinator layer of the application in an ad-hoc manner and are not enforced by some specialized protocol.

This is a real problem, compounded with the tendency of microservice instances to appear/disappear on demand.

One thing that helps is to use distributed sagas pattern to instill some discipline on undoing the side-effects of a failed operation that involves many microservices.

I had proposed that self-stabilization has a role to play here. Here is my report on this; section 4.1 is the most relevant part to this problem.

Last Friday I had attended @cmeik's end of internship talk at Microsoft Research. It was on building a prototype middleware that helps with the fault-tolerance of across-microservices operations.

Tuesday, August 21, 2018

Logical index organization in Cosmos DB

This post zooms into the logical indexing subsystem mentioned in my previous post on "Schema-Agnostic Indexing with Azure Cosmos DB". 

With the advent of big data, we face a big data-integration problem. It is very hard to enforce a schema (structure/type system) on data, and irregularities and entropy is a fact of life. You will be better off if you accept this as a given, rather than pretend you are very organized, you can foresee all required fields in your application/database, and every branch of your organization will be disciplined enough to use the same format to collect/store data.

A patch employed by relational databases is to add sparse new columns to accommodate for possibilities and to store a superset of the schemas. However, after you invoke an alter-table on a big data set, you realize this doesn't scale well, and start searching for schema-agnostic solutions.

Achieving schema agnosticism

As we discussed in the previous post, JSON provides a solution for easier schema management. JSON's type system is simple and lightweight (in contrast to XML) and is self-documenting. JSON supports a strict subset of the type systems of Javascript and many modern programming languages. Today it is the lingua franca of Internet and is natively supported in most languages.

Using JSON helps NoSQL datastores to operate without a schema for data ingestion purposes and for possible application changes in the future. But this doesn't automatically make them fully schema agnostic. For querying the database, those solutions still require a schema. The user is typically asked to provide indexing fields, and the queries are performed on those indexes.

The Cosmos DB approach to achieving full schema agnosticism is by automatically indexing everything upon data ingest and allowing the users to query for anything without having to deal with schema or index management.

The question then becomes: what is a good indexing structure to solve the fully schema-agnostic querying problem?

Be the tree

Relational databases have been doing indexing for half century, but indexing there is highly optimized for relational schema databases and has limitations. Often a B-tree index per column is employed. While this achieves very fast reading and querying performance, it becomes inadequate for high volume writes on big data. Newly inserted data would need to be indexed for each column in the schema using B-trees and will cause write amplification problems. A newly inserted column or change in schema would lead to updating all the leafs.

Instead of creating an index tree for each column, Cosmos DB employs one index for the whole database account, i.e., the Cosmos DB container (e.g., a table, a collection or a graph). This one-index-tree-to-rule-them-all grows as new documents get added to the container. Since the schema variance is often not  very wild, the number of shared paths over intermediate schema nodes remain small compared to the number of leaf nodes (instance values). This way the index tree achieves efficiency for updates upon new data/schema insertion and enables for searching (range or point query) for any arbitrary schema or value in the container.

I try to explain how this works in the rest of the post. First I'd like to clarify that we restrict ourselves to the logical organization of the indexing, and don't go down the stack to discuss physical organization of the index structures. At the logical layer, we don't have to think about the various B-tree implementations in the physical layer: we will just treat the index organization as a sorted-map structure (e.g., as in sorted map in Java). At the physical organization layer, to score even more efficiency optimizations, Cosmos DB employs the Bw-tree data structure implementation of this logical index on flash/SSDs. There are many other efficient implementations of B-trees for different storage devices and scenarios based on write-ahead-log and log-structure-merge-tree ideas.

I would like to thank Shireesh Thota at Cosmos DB for giving me a crash course on the logical indexing topic. Without his clear explanations, I would be grappling with these concepts for a long long time.


Logical indexing

In our previous post, we discussed how the tree representation of the JSON documents allows the database engine to treat the structure of the document as well as the instance values homogeneously.

We also introduced the index tree that is constructed out of the union of all of the trees representing the individual documents within the container. Each node of the index tree is an index entry containing the label and position values, called the term, and the ids of the documents containing the term, called the postings.

The logical index organization

For cost effective persistence and lookup, the index tree needs to be converted into a storage efficient representation.  At the logical indexing layer, CosmosDB maps the paths in the index tree to key-value tuples. The value consists of the postings list of the encoded document (or document fragment) ids. The key consists of the term representing the encoded path information of the node/path in the index tree, concatenated with a posting entry selector (PES) that helps partition the postings horizontally.

This way, the terms are mapped to the corresponding Doc IDs (i.e., postings) containing them. The resulting sorted map enables the query processing to identify the documents that match the query predicates very quickly. On this sorted map, byte compare is employed for enabling range queries. There is also a reverse path representation to enable efficient point queries. As we'll see below, the logical indexing has a direct impact on what kind of queries the database can support.

We discuss how Cosmos DB represents the terms and the postings efficiently in the next two sections.

Representing terms efficiently


Cosmos DB uses a combination of partial forward path representation for paths to enable range querying support, and  partial reverse path representation to enable equality/hash support.

The terms for forward paths are byte encoded to be able to enable range queries such as SELECT * FROM root r WHERE r.Country < "Germany". Yes, you read that right, you can compare at the string level, because strings are byte-encoded to allow that.

The terms for the reverse paths are hash encoded for efficient point querying such as SELECT * FROM root r WHERE r.location[0].country = "France".

Finally, the path representations also allow wild card queries such as SELECT c FROM c JOIN w IN c.location WHERE w = "France". This is achieved by bunching the forward and backward paths always in 3 segments, such as location/0/city and 0/city/"Paris" rather than using the full path $/location/0/city/"Paris".  This is like the the n-gram idea the search engines use. This also reduces the storage cost of the index.

Partial forward path encoding scheme. To enable efficient range and spatial querying, the partial forward path encoding is done differently for numeric and non-numeric labels. For non-numeric values, each of the 3 segment paths are encoded based on all the characters. The least significant byte of the resultant hash is assigned for the first and second segments. For the last segment, lexicographical order is preserved by storing the full string or a smaller prefix based on the precision specified for the path.
For the numeric segment appearing as the first or second segments, a special hash function is applied to optimize for the non-leaf numeric values. This hash function exploits the fact that most non-leaf numeric values (e.g. enumerations, array indices etc.) are frequently concentrated between 0-100 and rarely contain negative or large values. A numeric segment occurring in the third position is treated specially: the most significant n bytes (n is the numeric precision specified for the path) of the 8 byte hash are applied, to preserve order.

Partial reverse path encoding scheme.  To enable point querying, the term generated in the reverse order, with the leaf having higher number of bits in the term, placed first. This scheme also serves wildcard queries like finding any node that contains the value "Paris", since the leaf node is the first segment.

Representing posting lists efficiently

The postings list captures the document ids of all the documents which contain the given term. The posting list is bitmap compressed for efficient querying/retrieval as well. In order to represent a postings list dynamically (i.e. without a fixed sized/static scheme or pre-reserved space), compactly and in a manner amenable to computing fast set operations (e.g., to test for document presence during query processing), Cosmos DB uses the below two techniques.

Partitioning a postings list. Each insertion of a new document to a container is assigned a monotonically increasing document ID. The postings list for a given term consists of a variable length list of postings entries partitioned by postings entry selector (PES). A PES is a variable length (up to 7 bytes), offset into the postings entry. The number of PES bytes is a function of the number of documents in a container. The number of postings entries --for a given size of a PES-- is a function of document frequency for the document id range which falls within the PES range. Document ids within 0-16K will use the first postings entry, document ids from 16K-4M will use the next 256 posting entries, and so on. For instance, a container with 2M documents will not use more than 1 byte of PES and will only ever use up to 128 postings entries within a postings list.

Dynamic encoding of posting entries. Within a single partition (pointed by a PES), each document needs only 14 bits which can be captured with a short word. However, Cosmos DB also optimizes this. Depending on the distribution, postings words within a postings entry are encoded dynamically using a set of encoding schemes including (but not restricted to) various bitmap encoding schemes inspired primarily by WAH (Word-Aligned Hybrid). The core idea is to preserve the best encoding for dense distributions (like WAH) but to efficiently work for sparse distributions (unlike WAH).

Customizing the index

The default indexing policy automatically indexes all properties of all documents. Developers can choose certain documents to be excluded or included in the index at the time of inserting or replacing them to the container. Developers can also choose to include or exclude certain paths (including wildcard patterns) to be indexed across documents.

Cosmos DB also supports configuring the consistency of indexing on a container.

Consistent indexing is the default policy. Here the queries on a given container follow the same consistency level as specified for the point-reads (i.e. strong, bounded-staleness, session or eventual). The index is updated synchronously as part of the document update (i.e. insert, replace, update, and delete of a document in a container). Consistent indexing supports consistent queries at the cost of possible reduction in write throughput. This reduction is a function of the unique paths that need to be indexed and the consistency level. The consistent indexing mode is designed for "write quickly, query immediately" workloads.

To allow maximum document ingestion throughput, a container can be configured with lazy consistency; meaning queries are eventually consistent. The index is updated asynchronously when a given replica of a container's partition is quiescent. For "ingest now, query later" workloads requiring unhindered document ingestion, the lazy indexing mode is more suitable.

MAD questions

1. Is this too specialized information?
I am a distributed systems/algorithms person. Logical indexing is a specialized database topic. Does understanding this help me become a better distributed systems researcher?

I would argue yes.  First of all, developing expertise in multiple branches, being a Pi-shaped academician, provides advantages. Aside from that, learning new things stretches your brain and makes it easier to learn other things.

2. How is filtering done within a document?
Cosmos DB represents documents also as binary encodings for efficient storage and querying. When a query returns documents that match the query predicates, instead of filtering records inside the document, Cosmos DB uses the binary encoding features and performs byte-compares to skim within the document quickly to jump/skip over irrelevant parts quickly. A lot of deduplication is also employed at these encoding. In the coming weeks, I may delve in to the physical organization of the index and documents, but I need to track down another expert to help me with that.

For topics that are too outside of my expertise it is very helpful to get a first introduction from an expert. Learning from Shireesh was very fun. An expert makes even the most complicated topics look easy and understandable. This is an interesting paradigm shift which you will have sometime if you haven't already: When you don't understand a topic, often the problem is, it is not presented very competently. The corollary to this epiphany is that if you are unable to explain something simply and in an accessible way, you haven't mastered it yet.

Friday, August 17, 2018

Replicated Data Consistency Explained Through Baseball

I had mentioned this report in my first overview post about my sabbatical at Cosmos DB. This is a 2011 technical report by Doug Terry, who was working at Microsoft Research at the time. The paper explains the consistency models through publishing of baseball scores via multiple channels. (Don't worry, you don't have to know baseball rules or scoring to follow the example; the example works as well for publishing of, say, basketball scores.)

Many consistency models have been proposed for distributed and replicated systems. The reason for exploring different consistency models is that there are fundamental tradeoffs between consistency, performance, and availability.

While seeing the latest written value for an object is desirable and reasonable to provide within a datacenter, offering strong consistency for georeplicated services across continents results in lower performance and reduced availability for reads, writes, or both.

On the opposite end of the spectrum, with eventual consistency, the data returned by a read operation is the value of the object at some point in the past, yet the performance and availability dimensions are greatly improved.

The paper argues that eventual consistency is insufficient for most of the use cases, but strong consistency is not needed either; most use-cases benefit from some intermediate consistency guarantees.

This concurs with Cosmos DB production workloads. As I mentioned in my first overview post, about 73% of Cosmos DB tenants use session consistency and 20% prefer bounded staleness. This limits strong-consistency and eventual-consistency to the fringes. After I introduce the model and consistency guarantees the paper considers, I will talk about how these map to the consistency levels used in Cosmos DB.

Read consistency guarantees

The paper considers 6 consistency guarantees. These are all examples of the read-write centric operation consistency we discussed in the "Many Faces of Consistency" paper.

The paper considers a simple abstract model for the data store. In this model, clients perform write operations to a master node in the data store. Writes are serialized and eventually performed/replicated in the same order (with that of the master) at all the servers (a.k.a. replicas). The client perform reads from the servers/replicas. Reads return the values of data objects that were previously written, though not necessarily the latest values. This is because the entire set of writes at the master may not yet be reflected in order and in entirety at the servers.

The 6 consistency guarantees below are defined by which set of previous writes are visible to a read operation.

Strong consistency ensures that a read operation returns the value that was last written for a given object. In other words, a read observes the effects of all previously completed writes: if write operations can modify or extend portions of a data object, such as appending data to a log, then the read returns the result of applying all writes to that object. (Note that in order to achieve strong consistency in the presence of crashes, the write operation at the master should be going lockstep with a quorum of replicas to permanently record writes, requiring synchronous replication!)

Eventual consistency is the weakest of the guarantees, so it allows the greatest set of possible return values. Such a read can return results from a replica that has received an arbitrary subset of the writes to the data object being read.

By requesting a consistent prefix, a reader is guaranteed to observe an ordered sequence of writes starting with the first write to a data object. In other words, the reader sees a version of the data store that existed at the master at some time in the past.

Bounded staleness ensures that read results are not too stale. That is, the read operation is guaranteed to see at least all writes that completed d time (or number of updates) before the read started. The read may potentially see some more recently written values.

Monotonic Reads (sometimes also called as a session guarantee) is a property that applies to a sequence of read operations that are performed by a given client. It states that if the client issues a read operation and then later issues another read to the same object(s), the second read will return the same value(s) or the results of later writes.

Read My Writes is also a session property specific to a client. It guarantees that the effects of all writes that were performed by the client are visible to the client's subsequent reads. If a client writes a new value for a data object and then reads this object, the read will return the value that was last written by the client (or some other value that was later written by a different client).

None of these last four read guarantees are stronger than each other, thus applications may want to combine a multiple of these guarantees. For example, a client could request monotonic reads and read my writes so that it observes a data store that is consistent with its own actions.


The table shows the performance and availability typically associated with each consistency guarantee. Strong consistency is desirable from a consistency viewpoint but offers the worst performance and availability since it generally requires reading from a majority of replicas. Eventual consistency, on the other hand, allows clients to read from any replica, but offers the weakest consistency. The table illustrates the tradeoffs involved as each guarantee offers a unique combination of consistency, performance, and availability.

Cosmos DB consistency levels


Cosmos DB allows developers to choose among 5 well-defined consistency models along the consistency spectrum. While the definitions of strong, eventual, and consistent prefix are the same as the ones discussed in the report, Cosmos DB strengthens the definitions for bounded staleness and session consistency, making them more useful for the clients.

More specifically, Cosmos DB's bounded staleness is strengthened to offer total global order except within the "staleness window". In addition, monotonic read guarantees exist within a region both inside and outside the "staleness window".

Cosmos DB's session consistency (again scoped to a client session) is strengthened to include consistent prefix, monotonic writes, read my writes, and write-follows-reads in addition to the monotonic read property. As such, it is ideal for all scenarios where a device or user session is involved. (You can check the clickable consistency map by Kyle Kingsburry to read about the definitions of monotonic writes and write-follows-reads which impose order on the writes.)

It is possible to sign up for a free trial (with no credit card and commitment) to test these guarantees on Cosmos DB. Once you create a Cosmos DB resource, you can also see an animation of the consistency guarantees after selecting the default consistency option from the tab. I am pasting screenshots for strong consistency and bounded staleness animations below.


While animations are nice for intuitively understanding consistency levels, inside Cosmos DB, the TLA+ specification language is used for specifying these models precisely and model checking them with the global distribution protocols considered. In the coming weeks, as I promised, I will try to sanitize and publish from a client-side perspective the TLA+ specifications for these consistency levels.

To put their money where their mouth is, Cosmos DB offers comprehensive 99.99% SLAs which guarantee throughput, consistency, availability, and latency for Cosmos DB database accounts scoped to a single Azure region configured with any of the five consistency levels, or database accounts spanning multiple regions, configured with any of the four relaxed consistency levels. Furthermore, independent of the choice of a consistency level, Cosmos DB offers a 99.999% SLA for read and write availability for database accounts spanning two or more regions. I will dedicate a separate blog post on SLAs in the coming weeks, as I start learning more about them.

Baseball analogy

This toy example assumes that the score of the game is recorded in the  key-value store in two objects, one for the number of runs scored by the "visitors" and one for the "home" team's runs. When a team scores a run, a read operation is performed on its current score, the returned value is incremented by one, and the new value is written back to the key-value store.

This sequence of writes is from a hypothetical baseball game with the inning-by-inning line score, and the game is currently in the middle of the seventh inning, and the home team is winning 2-5.

Different read guarantees may result in clients reading different scores for this game that is in progress. The table below lists the complete set of scores that could be returned by reading the visitors and home scores with each of the 6 consistency guarantees. The visitors' score is listed first, and different possible return values are separated by comas.

A strong consistency read can only return one result, the current score. On the other hand, an eventual consistency read can return one of 18 possible scores, many of which are ones that were never the actual score. The consistent prefix property limits the result to scores that actually existed at some time. The results that can be returned by a bounded staleness read  depend on the desired bound.

The paper considers 6 hypothetical participants querying the baseball database for scores: the scorekeeper, umpire, radio reporter, sportswriter, statistician, and the stat-watcher. The table lists the consistencies that these participant use. Of course, each participant would be okay with strong consistency, but, by relaxing the consistency requested for her reads, she will likely observe better performance and availability. Additionally, the storage system may be able to better balance the read workload across servers since it has more flexibility in selecting servers to answer weak consistency read requests.


The toy example is meant to illustrate that the desired consistency depends as much on who is reading the data as on the type of data. All of the 6 presented consistency guarantees are useful, because each guarantee appears at least once in the participant needs table. That is, different clients may want different consistencies even when accessing the same data.

Discussion

The report (here is the talk by Doug Terry on the report if you like to get a more immersive/extensive discussion of the topic) concludes as follows:
Clients should be able to choose their desired consistency. The system cannot possibly predict or determine the consistency that is required by a given application or client. The preferred consistency often depends on how the data is being used. Moreover, knowledge of who writes data or when data was last written can sometimes allow clients to perform a relaxed consistency read, and obtain the associated benefits, while reading up-to-date data. This could be of practical significance since the inherent trade-offs between consistency, performance, and availability are tangible and may become more pronounced with the proliferation of georeplicated services. This suggests that cloud storage systems should at least consider offering a larger choice of read consistencies.

As I discussed above, Cosmos DB fits the bill. It provides 5 consistency level choices in an all-in-one packet. It is easy to configure the consistency level on the fly, and the effects take place quickly. Cosmos DB also allows you the flexibility to override the default consistency level you configured for your account on a specific read request. It turns out only about 2% of Cosmos DB tenants override consistency levels on a per request basis.

MAD questions

1. What are the effects of granularity of writes on consistency?
The paper said: "We assume that the score of the game is recorded in a key-value store in two objects, one for the number of runs scored by the visitors and one for the home team's runs."

Why keep two separate objects "home" "visitor" though? Instead, what if we used just one object called "score" that consists of a tuple <visitor, home>. Then the reads would be more consistent by design; even the eventual consistency read will not return a score that was not an actual store at one point in the game.

Sometimes you don't get to design the data storage/access schemes, but when you get to decide this, you can act smartly and improve consistency. This reminds me of techniques used in self-stabilizing systems for compacting the state space to forbid bad states by construction.

2. What are the theoretical limits on the tradeoffs among consistency levels?
As we have seen, each proposed consistency model occupies some point in the complex space of tradeoffs. The CAP theorem shows a coarse tradeoff between consistency and availability. The PACELC model tries to capture further tradeoffs between consistency, availability, and latency.

More progress has been made in exploring the tradeoff space from the theoretical perspective since then. The causal consistency result showed that natural causal consistency, a strengthening of causal consistency that respects the real-time ordering of operations, provides a tight bound on consistency semantics that can be enforced without compromising availability and convergence. There has been plethora of papers recently on improvements on causal consistency georeplicated datastores, and I hope to summarize the prominent ones in the coming weeks.

3. Is baseball scoring also nonlinear?
Baseball is still foreign to me, even though I have been in US for 20 years now. While I try to be open-minded and eager to try new things, I can be stubborn about not learning some things. Here is another example where I was unreasonably close minded. Actually, though I had seen this paper earlier, I didn't read it then because I thought I would have to learn about baseball rules to understand it. Funny? No! I wonder what other opportunities I miss because of being peculiarly close minded on certain things.

Well, to win a battle against my obstinate and peculiar ignorance on baseball, I just watched this 5 minute explanation of baseball rules. Turns out, it is not that complicated. Things never turn out as scary/complicated/bad as I make them to be in my mind.

I had written about how bowling scoring is nonlinear and consequences of that. Does baseball scoring also have a nonlinear return? It looks like you can get a lot of runs scored in a single inning. So maybe that counts as nonlinearity as it can change the game score quickly, at any inning.