Posts

Showing posts with the label Cosmos DB

Going Beyond an Incident Report with TLA+

Image
This is a short article at usenix-login about the use of TLA+ to explain root cause for an Azure incident. It seems like this is part of a longer arxiv paper on the use of TLA+ for exploring/explaining data consistency in Cosmos DB. Markus Kuppe is an author here. Markus is heading the TLA+ model checker, TLC, development for more than a decade. If you've used the TLA+ model checker, you used Markus's work. If you are an intermediate/advanced user, it is very likely that you interacted with Markus, and got help from him at some point. The incident The Microsoft Azure incident is a 28-day outage. It looks like the incident got undetected/unreported for 26 days, because it was a partial outage. " A majority of requests did not fail -- rather, a specific type of request was disproportionately affected, such that global error rates did not reveal the outage despite a specific group of users being impacted. " For each request serviced by the work dispatcher, it would (1)...

Azure Cosmos DB: Microsoft's Cloud-Born Globally Distributed Database

Image
It has been almost 9 months since I started my sabbatical work with the Microsoft Azure Cosmos DB team.  I knew what I signed up for then, and I knew it was overwhelming. It is hard not to get overwhelmed. Cosmos DB provides a global highly-available low-latency all-in-one database/storage/querying/analytics service to heavyweight demanding businesses. Cosmos DB is used ubiquitously within Microsoft systems/services, and is also one of the fastest-growing services used by Azure developers externally. It manages 100s of petabytes of indexed data, and serves 100s of trillions of requests every day from thousands of customers worldwide, and enables customers to build highly-responsive mission-critical applications. But I underestimated how much there is to learn about, and how long it would be to develop a good sense of the big picture. By "developing a good sense of the big picture", I mean learning/internalizing the territory myself, and, when looking at the terrain, bei...

Paper review. Sharding the Shards: Managing Datastore Locality at Scale with Akkio

Image
This paper by Facebook, which appeared in OSDI'18, describes the data locality management service, Akkio. Akkio has been in production use at Facebook since 2014. It manages over 100PB of data, and processes over 10 million data accesses per second. Why do we need to manage locality?  Replicating all data to all datacenters is difficult to justify economically (due to the extra storage and WAN networking costs) when acceptable durability and request serving latency could be achieved with 3 replicas. It looks like Facebook had been doing full replication (at least for ViewState and AccessState applications discussed in the evaluation) to all the 6 datacenters back-in-the-day, but as the operation and the number of datacenters grew, this became untenable. So, let's find suitable home-bases for data, instead of fully replicating it to all datacenters. But the problem is access locality is not static. What was a good location/configuration for the data ceases to become suita...

SDPaxos: Building efficient semi-decentralized geo-replicated state machines

Image
In the last decade, the Paxos protocol family grew with the addition of new categories. Rotating leader: Mencius Leaderless: EPaxos, Fast Paxos Paxos federations: Spanner , vertical Paxos  Dynamic key-leader: WPaxos   This paper, which appeared in SOCC 18, proposes SDPaxos which prescribes separating the control plane (single leader) from the replication plane (multiple leaders). SD in SDPaxos stands for "semi-decentralized". The motivation for this stems from the following observation. Single leader Paxos approach has a centralized leader and runs into performance bottleneck problems. On the other hand, the leaderless (or opportunistic multileader) approach is fully decentralized but suffers from the conflicting command problems. Taking a hybrid approach to capture the best of both worlds, SDPaxos makes the command-leaders to be decentralized (the closest replica can lead the command), but the ordering-leader (i.e., the sequencer) is still centralized/unique in t...

TLA+ specification of the bounded staleness and strong consistency guarantees

Image
In my previous post , I had presented a TLA+ modeling of distributed data store that provides the consistent prefix property. In this post, I extend this model slightly to build bounded and strong consistency. In fact the strong consistency specification is achieved when we take the Delta on the bounded consistency as 1. The TLA+ (well, PlusCal to be more accurate) code for this model is available at https://www.dropbox.com/s/qvmhhgjf9iycaca/boundedstrongP.tla?dl=0 The system model As in the previous post, we assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions. WriteRegion == 1 ReadRegions == 2..NumRegions chan = [n \in 1..NumRegions |-> <<>>];  We use D to denote the Delta on the bounded staleness consistency. Bounded staleness ensures that read results are not too stale. ...

TLA+ specification of the consistent prefix guarantee

Image
We had covered consistency properties in the recent posts . Now to keep it more concrete and to give you a taste of TLA+ , I present a model of a distributed data store that provides the consistent prefix property. In my next post, I will extend this model slightly to present bounded and strong consistency properties. And, in the post after that, I will add a client model to show how this data store can be extended to provide session consistency.  The TLA+ (well, PlusCal to be more accurate) code for this model is available here.  The system model We assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions. WriteRegion == 1 ReadRegions == 2..NumRegions chan = [n \in 1..NumRegions |-> <<>>];  The write region actions The write region has 3 actions to perform: Comm...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems