Is Scalable OLTP in the Cloud a Solved Problem? (CIDR 2023)

This paper appeared in CIDR 2023 recently. It is a thought provoking big picture paper, coming from a great  team.

The paper draws attention to the divide between conventional wisdom on building scalable OLTP databases (shared-nothing architecture) and how they are built and deployed on the cloud (shared storage architecture). There are shared nothing systems like CockroachDB, Yugabyte, and Spanner, but the overwhelming trend/volume on cloud OLTP is shared storage, and even with single writer, like AWS Aurora, Azure SQL HyperScale, PolarDB, and AlloyDB.

The paper doesn't analyze this divide in detail, and jumps off to discuss shared storage with multiple writer as a more scalable way to build cloud OLTP databases. It would have been nice to go deep on this divide. I think the big benefit from shared storage is the disaggregation we get between compute and storage, which allows us to scale/optimize compute and storage resources independently based on demand. Shared storage also allows scaling reads, which where most of the workload skews to.  

As the paper promotes multi-writer shared storage, it glorifies the promises of coherent caching for improving throughput and scalability. The paper does mention the complexity and coordination cost of coherent caching but fails to talk about other downsides of caching with respect to metastable failures and modal behavior. The paper doesn't have a good answer for solving the complexity and coordination challenges for caching, and I am not sold on this part.

The second half of the paper is an overview of ScaleStore, a multi-writer shared-storage database the authors presented last year at Sigmod. ScaleStore provides sequential/prefix consistency and lacks transaction support, as it doesn't implement isolation yet. It will be a challenge to implement that in a scalable way, but the paper cites a couple directions for implementing coherent OLTP cache support more efficiently.

Now that you have gotten the TL;DR, read on if you are interested in diving deeper.

Partitioned writer (shared nothing) design

In this design, the database is split into partitions, each of which can be updated by only one RW-node. For splitting the database, modern systems use consistent hashing or fine-grained range partitioning. This design provides elasticity to scale, and can provide good performance for mostly local workload characteristics. A big drawback of this design is that it couples storage and compute together. When one part becomes the bottleneck, you need to split and get a new partition, even though the other part is under-utilized. 

Update: Cloud DBMS blur the lines here. It was pointed out that: "Spanner has a distributed disaggregated storage system underneath and can redistribute load without moving data (unlike CockroachDB or Yugabyte)." With this, it is possible to put Spanner on the spectrum of shared-multi-writer described below.

Single-writer design

In this design a single read-write node (RW-node) processes update transactions, while multiple read-only nodes (RO-nodes) handle read-only transactions. Read throughput can scale well by spawning more RO-nodes, but writes are limited by the capacity of the single RW-node. 

Do note that, while the figure shows shared storage as a single unit, it is actually replicated over many nodes (in the case of Aurora over 6 storage nodes) across availability zones (AZ).

This shared-storage single-writer design offers significant benefits:

  • cloud DBMSs with a replicated shared-storage can spawn RO-nodes fast and react to an increase in read-only transactions, so you get fast failover and elasticity for read-dominated workloads
  • the separation of compute and storage allows flexibility in handling growing data sets
  • the complexity of implementation and operation is low due to the simple single-writer design

The downside is the write bottleneck on the single writer. All writes need to be routed the single writer and needs to be applied there.

Shared-writer (multiple-writer) design

The shared-writer shared storage design allows multiple RW-nodes to modify data items. As such, they avoid the single writer bottleneck, and offer the  flexibility to react to changes in workload distribution dynamically.

In contrast to the previous two designs (partitioned-writer and single-writer shared storage), there are no clear swim-lanes between multiple writers. This increases the complexity of concurrency control for keeping the data consistent across writers. One way of doing this is to require that all writers always write and read from the shared storage to get the ground truth. This adds latency to the data access. Another way of doing this would be to maintain a  coherent cache. This reduces the latency of always going to the storage (especially for frequently accessed and infrequently updated data), but increases the complexity (also latency?) because now the writers need to coordinate to implement a cache coherence protocol.

This design does not fully solve the skewed writes problem. Each update must be exclusively handled by a single node to keep the data consistent. The cache coherence protocol may lead to the write privilege bouncing between compute nodes since all compute nodes can modify all data items. Solving this is challenging as it requires complex synchronization among writers.



As I mentioned earlier, the second half of the paper is an overview of ScaleStore, a multi-writer shared-storage database the authors presented last year at Sigmod. ScaleStore uses a directory-style invalidation-based consistency protocol which provides sequential consistency at page-granularity. Whenever a node plans to modify a page, other nodes must invalidate the page. Instead of broadcasting the modification intent to all nodes, ScaleStore uses directories to track nodes that currently cache the page and only notify them.

Every storage node is a directory for some of the pages. Storage nodes keep track of those pages that are stored on them. For instance, Node 3 is the directory of P1 and P4. The directories are mainly responsible for managing the metadata for the invalidation-based cache coherence protocol: this metadata is depicted in the tables on the storage nodes and reflects the state of the compute caches. For instance, P5 is held in exclusive-mode by Node 2 whereas P1 is cached on all compute nodes in shared-mode.

Eviction becomes an interesting problem here as well. When every node only uses a local eviction strategy egoistically (remember the directory service is at the storage side not at the compute side), latencies will increase. Hot items will be cached at each node, and as a result some items won't be cached in any node. This would force the nodes to read items off the storage rather than from the cache of another node, and increase latency. (It would be beneficial to analyze how much of an increase this is in a cloud setup, but this is not discussed in the paper.) To address the problem, one approach may be for the compute to cache directory service in a best effort basis as a hint, and use a local deterministic rule for eviction (which let's it deduce global actions to some extent as well).

Currently, ScaleStore is merely a distributed storage engine that provides page-level sequential consistency but without transaction support. There are some plans to support ACID transactions on top of an elastic caching-based DBMS. The strategy to implement isolation wrt cache coherence is still undecided. The paper mentions the following.

Mohan and Narang (1992) describe LP-Locking, a technique that avoids this piggybacked message altogether by granting local lock requests as long as the page is cached on the compute node. This approach is a good fit for ScaleStore’s modus operandi as pages are cached as long as there is no conflict (i.e., invalidation).

The previous schemes rely on pessimistic concurrency control (CC). Another direction is to evaluate modern optimistic CC schemes. Especially for ScaleStore, such schemes are interesting as they avoid all shared memory writes for read operations. That is, no pages must be invalidated and thus no network messages must be sent at all. AWS Aurora uses an optimistic scheme where conflicts are detected during log replay on storage servers.


ScaleStore looks like an update-in-place system. That would cause problems for scalability, especially since reading from earlier versions would not be supported.

Would it be possible to use snapshot isolation (SI) rather than serializability for this database? When using SI,  you can cross-layer design concurrency control and coherent caching together. Only writes invalidate cache, and before writing at one node, CC can invalidate all cache in other nodes.

It looks like this is planned for localized deployment. For cross AZ or cross region, implementing a cache coherency protocol would introduce additional challenges.

It would be interesting to discuss more about the fault-tolerance aspects of the designs.


Popular posts from this blog

Foundational distributed systems papers

Learning about distributed systems: where to start?

Strict-serializability, but at what cost, for what purpose?

CockroachDB: The Resilient Geo-Distributed SQL Database

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Anna: A Key-Value Store For Any Scale

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

The Seattle Report on Database Research (2022)

Learning a technical subject

Checking statistical properties of protocols using TLA+