Azure Cosmos DB

I started my sabbatical work with the Microsoft Azure Cosmos DB team recently. I have been in talks and collaboration with the Cosmos DB people, and specifically with Dharma Shukla, for over 3 years. I have been very impressed with what they were doing and decided that this would be the best place to spend my sabbatical year.

The travel and settling down took time. I will write about those later. I will also write about my impressions of the greater Seattle area as I discover more about it. This was a big change for me after having stayed in Buffalo for 13 years. I love the scenery: everywhere I look I see a gorgeous lake or hill/mountain scene. And, oh my God, there are blackberries everywhere! It looks like the Himalayan blackberries is the invasive species here, but I can't complain. As I go on an evening stroll with my family, we snack on blackberries growing along the sidewalk. It seems like we are the only ones doing so ---people in US are not much used to eating fruits off from trees/bushes.

Ok, coming back to Cosmos DB... My first impressions--from an insider-perspective this time-- about Cosmos DB is also very positive and overwhelming. It is hard not to get overwhelmed. Cosmos DB provides a global highly-available low-latency all-in-one database/storage/querying/analytics service to heavyweight demanding businesses. Cosmos DB is used ubiquitously within Microsoft systems/services, and is also one of the fastest-growing services used by Azure developers externally. It manages 100s of petabytes of indexed data, and serves 100s of trillions of requests every day from thousands of customers worldwide, and enables customers to build highly-responsive mission-critical applications.

I find that there are a lot of things to learn before I can start contributing in a meaningful and significant way to the Cosmos DB team. So I will use this blog to facilitate, speed up, and capture my learning. The process of writing helps me detach, see the big picture, and internalize stuff better. Moreover my blog also serves as my augmented memory and I refer back to it for many things.

Here is my first attempt at an overview post. As I get to know Cosmos DB better, I hope to give you other more-in-depth overview posts.

What is Cosmos DB?

Cosmos DB is Azure's cloud-native database service.

(The term "cloud-native" is a loaded key term, and the team doesn't use it lightly. I will try to unpack some of it here, and I will revisit this in my later posts.)

It is a database that offers frictionless global distribution across any number of Azure regions ---50+ of them! It enables you to elastically scale throughput and storage worldwide on-demand quickly, and you pay only for what you provision. It guarantees single-digit-millisecond latencies at the 99th percentile, supports multiple consistency models, and is backed by comprehensive service level agreements (SLAs).

I am most impressed with its all-in-one capability. Cosmos DB seamlessly supports many APIs, data formats, consistency levels, and needs across many regions. This alleviates data integration pains which is a major problem for all businesses. The all-in-one capability also eliminates the developer effort wasted into keeping multiple systems with different-yet-aligned goals in sync with each other. I had written earlier about the Lambda versus Kappa architectures, and how the pendulum is all the way to Kappa. Cosmos DB all-in-one gives you the Kappa benefits.

This all-in-one capability backed with global-scale distribution enables new computing models as well. The datacenter-as-a-computer paper from 2009 had talked about the vision of warehouse scale machines. By providing a frictionless globe-scale replicated database, CosmosDB opens the way to thinking about the globe-as-a-computer. One of the usecases I heard from some Cosmos DB customers amazed me. Some customers allocate a spare region (say Australia) where they have no read/write clients as an analytics region. This spare region still gets consistent data replication and stays very up-to-date and is employed for running analytics jobs without jeopardizing the access latencies of real read-write clients. Talk about disaggregated computation and storage! This is disaggregated storage, computing, analytics, and serverless across the globe. Under this model, the globe becomes your playground.

This disaggregated yet all-in-one computing model also manifests itself in customer acquisition and settling in Cosmos DB. Customers often come for the query serving level, which provides high throughput and low-latency via SSDs. Then they get interested and invest into the lower-throughput but higher/cheaper storage options to store terrabytes and petabytes of data. They then diversify and enrich their portfolio further with analytics, event-driven lambda, and real-time streaming capabilities provided in Cosmos DB.

There is a lot to discuss, but in this post I will only make a brief introduction to the issues/concepts, hoping to write more about them later. My interests are of course at the bottom of the stack at the core layer, so I will likely dedicate most of my coming posts to the core layer.

Core layer

The core layer provides capabilities that the other layers build upon. These include global distribution, horizontally and independently scalable storage and throughput, guaranteed single-digit millisecond latency, tunable consistency levels, and comprehensive SLAs.

Resource governance is an important and pervasive component of the core layer. Request units (allocating CPU, memory, throughput) is the currency to provision the resources. Provisioning a desired level of throughput through dynamically changing access patterns and across a heterogeneous set of database operations presents many challenges. To meet the stringent SLA guarantees for throughput, latency, consistency, and availability, Cosmos DB automatically employs partition splitting and relocation. This is challenging to achieve as Cosmos DB also handles fine-grained multi-tenancy with 100s of tenants sharing a single machine and 1000s of tenants sharing a single cluster each with diverse workloads and isolated from the rest. Adding even more to the challenge, Cosmos DB supports scaling database throughput and storage independently, automatically, and swiftly to address the customer's dynamically changing requirements/needs.

To provide another important functionality, global distribution, Cosmos DB enables you to configure the regions for "read", "write", or "read/write" regions. Using Azure Cosmos DB's multi-homing APIs, the app always knows where the nearest region is (even as you add and remove regions to/from your Cosmos DB database) and sends the requests to the nearest datacenter. All reads are served from a quorum local to the closest region to provide low latency access to data anywhere in the world.

Cosmos DB allows developers to choose among five well-defined consistency models along the consistency spectrum. (Yay, consistency levels!) You can configure the default consistency level on your Cosmos DB account (and later override the consistency on a specific read request).  About 73% of Azure Cosmos DB tenants use session consistency and 20% prefer bounded staleness. Only 2% of Azure Cosmos DB tenants override consistency levels on a per request basis. In Cosmos DB, reads served at session, consistent prefix, and eventual consistency are twice as cheap as reads with strong or bounded staleness consistency.

This lovely technical report explains the consistency models through publishing of baseball scores via multiple channels. I will write a summary of this paper in the coming days. The paper concludes: "Even simple databases may have diverse users with different consistency needs. Clients should be able to choose their desired consistency. The system cannot possibly predict or determine the consistency that is required by a given application or client. The preferred consistency often depends on how the data is being used. Moreover, knowledge of who writes data or when data was last written can sometimes allow clients to perform a relaxed consistency read, and obtain the associated benefits, while reading up-to-date data."

Data layer

Cosmos DB supports and projects multiple data models (documents, graphs, key-value, table, etc.) over a minimalist type system and core data model: the atom-record-sequence (ARS) model.

A Cosmos DB resource container is a schema-agnostic container of arbitrary user-generated JSON items and JavaScript based stored procedures, triggers, and user-defined-functions (UDFs). Container and item resources are further projected as reified resource types for a specific type of API interface. For example, while using document-oriented APIs, container and item resources are projected as collection and document resources respectively. Similarly, for graph-oriented API access, the underlying container and item resources are projected as graph, node/edge resources respectively.

The overall resource model of an application using Cosmos DB is a hierarchical overlay of the resources rooted under the database account, and can be navigated using hyperlinks.

API layer

Cosmos DB supports three main classes of developers: (1) those familiar with relational databases and prefer SQL language, (2) those familiar with dynamically typed modern programming languages (like JavaScript) and want a dynamically typed efficiently queryable database, and (3) those who are already familiar with popular NoSQL databases and want to transfer their application to Azure cloud without a rewrite.

In order to meet developers wherever they are, Cosmos DB supports SQL, MongoDB, Cassandra, Gremlin, Table APIs with SDKs available in multiple languages.

The ambitious future

There is a palpable buzz in the air in Cosmos DB offices due to the imminent multimaster general availability rollout (which I will also write about later). The team members keep it to themselves working intensely most of the time, but would also have frequent meetings and occasional bursty standup discussions. This is my first deployment in a big team/company, so I am trying to take this in as well. (Probably a post on that is coming up as well.)

It looks like the Cosmos DB team caught a good momentum. The team wants to make Cosmos DB the prominent cloud database and even the go-to all-in-one cloud middleware. Better analytic support and better OLAP/OLTP integration is in the works to support more demanding more powerful next generation applications.

Cosmos DB already has great traction in enterprise systems. I think it will be getting more love from independent developers as well, since it provides serverless computing and all-in-one system with many APIs. It is possible to try it for free at To keep up to date with the latest news and announcements, you can follow @AzureCosmosDB and #CosmosDB on Twitter.

My work at Cosmos DB

In the short term, as I learn more about Cosmos DB, I will write more posts like this one. I will also try to learn and write about the customer usecases, workloads, and operations issue without revealing details. I think learning about the real world usecases and problems will be one of the most important benefits I will be able to get from my sabbatical.

In the medium term, I will work on TLA+/PlusCal translation of consistency levels provided by Cosmos DB and share them here. Cosmos DB uses TLA+/PlusCal to specify and reason about the protocols. This helps prevent concurrency bugs, race conditions, and helps with the development efforts. TLA+ modeling has been instrumental in Cosmos DB's design which integrated global distribution, consistency guarantees, and high-availability from the ground-up. (Here is an interview where Leslie Lamport shares his thoughts on the foundations of Azure Cosmos DB and his influence in the design of Azure Cosmos DB.) This is very dear to my heart, as I have been employing TLA+ in my distributed systems classes for the past 5 years.

Finally, as I get a better mastery of Cosmos DB internals, I like to contribute to protocols on multimaster multirecord transaction support. I also like to learn more about and contribute to Cosmos DB's automatic failover support during one or more regional outages. Of course, these protocols will all be modeled and verified with TLA+.

MAD questions

1. What would you do with a frictionless cloud middleware? Which new applications can this enable?
Here is something that comes to my mind. Companies are already uploading IOT sensor data from cars to Azure CosmosDB continuously. Next step would be to build more ambitious applications that make sense of correlated readings and use these to coordinate actions. Applications of this could be in traffic shaping/management of self-driving car squadrons. These applications will likely combine OLAP (including powerful machine-learning processes) and OLTP (including real-time streaming and actions).

2. What are some patterns in globe-as-a-computer paradigm?
Is this a truly new paradigm or is it just a degree of convenience? Is it possible to argue that this is only incremental and not transformational? But, then, a great degree of incremental capability translates to transformational change.

3. What are other things you like to learn about at a global-scale database?
Let me know in the comments.


Anonymous said…
As a long-time Microsoftee and reader of your blog both, I think this is very cool!

If you're interested in hearing about other distributed data tech in Microsoft, please ping me (alias is zivc with the obvious suffix) -- I couldn't find your name in the GAL.
Unknown said…
We have been using Cosmos DB for almost one year now and I am always interested to know more about the internals. My current interest is in better understanding how Cosmos offers the available consistency levels ... i.e. how things work under the covers ... as this always seems to better enable me to internalize the tradeoffs when selecting one consistency level versus another. The documentation at present isn't terrible, but tends to be very high level ... I want more meat. :)

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The end of a myth: Distributed transactions can scale

There is plenty of room at the bottom

Distributed Transactions at Scale in Amazon DynamoDB

Dude, where's my Emacs?