CockroachDB Serverless: Sub-second Scaling from Zero with Multi-region Cluster Virtualization

This paper describes the architecture behind CockroachDB Serverless. At first glance, the design can feel like cheating. Rather than introducing a new physically disaggregated architecture with log and page stores, CRDB retrofits its existing implementation through logical disaggregation: It splits the binary into separate SQL and KV processes and calls it serverless. But dismissing this as fake disaggregation would miss the point. I came to appreciate this design choice as I read the paper (a well written paper!). This logical disaggregation (the paper calls it cluster virtualization) provides a pragmatic evolution of the shared-nothing model. CRDB pushes the SQL–KV boundary (as in systems like TiDB and FoundationDB) to its logical extreme to provide the basis for a multi-tenant storage layer. From here on, they solve the sub-second cold starts problem and admission control problems with good engineering rather than an architectural overhaul.


System Overview

If you split the stack at the page level, the compute node becomes heavy. It must own buffer caches, lock tables, transaction state, and recovery logic. Booting a new node may then take ~30+ seconds to hydrate caches and initialize these managers. CRDB avoids this by drawing the boundary higher and placing all heavy state in the shared KV layer.

  • The KV Node (Storage): This is a single massive multi-tenant shared process. It owns caching, transaction coordination, and Raft replication.
  • The SQL Node (Compute): These are lightweight stateless processes per tenant. They are responsible only for parsing queries, planning execution, and acting as the gateway to storage.

The SQL node is effectively stateless. The system maintains a pool of pre-warmed, generic SQL processes on a VM, and when a client connects, one of these processes is instantly assigned to that tenant and starts serving traffic in <650ms.

The shared KV storage relies on Log-Structured Merge (LSM) trees (specifically Pebble), where data is just a sorted stream of keys. Implementing multi-tenancy is as simple as prepending a Tenant ID to the key (e.g., /TenantA/Row1). LSMs help here because the underlying storage engine doesn't care; it just sees sorted bytes. B-tree based systems make this kind of fine-grained multi-tenancy hard because they tie data structures to file/pages and do not multiplex tenants naturally.

The security model is hybrid. Compute is strongly isolated, with separate processes per tenant, while storage uses soft isolation in a shared KV layer. The paper claims data leakage is unlikely because ranges are already treated as independent atomic units. In practice, this isolation depends on software checks such as key prefixes and TLS, not hardware boundaries like VMs or enclaves. As a result, a KV-layer bug has a larger blast radius than in a fully isolated design.


Trade-offs

Every query incurs a network hop between the SQL and KV layers, even within the same VM, introducing an unavoidable RPC overhead. For OLTP workloads, this impact is minimal, and benchmarks show performance on par with dedicated clusters for typical transactional operations. For OLAP workloads, however, the cost is significant, often resulting in a 2.3x increase in CPU usage.

Caching involves trade-offs as well. Placing the cache in the shared KV layer is much more expensive (in dollar terms as well) than local compute caching, as recent research on distributed caches shows. In a serverless environment, however, this inefficiency provides agility on startup.

It is worth noting the economics here. This multi-tenant model is a win for small customers who need low costs and elasticity. Large customers with predictable heavy workloads will still prefer dedicated hardware to avoid noisy neighbors entirely, and to get the most performance out of the deployment.


Noisy Neighbors & Admission Control

One of the biggest challenges in shared storage is the "Noisy Neighbor" problem. A single physical KV node can host replicas for thousands of ranges, participating in thousands of Raft groups simultaneously. To manage resource contention, the system implements a sophisticated Admission Control mechanism:

  • The system uses priority queues (heap) based on recent usage to ensure fairness. Short tasks naturally float to the top, while long-running scans yield and wait.
  • It estimates node capacity 1,000 times a second for CPU and every 15 seconds for disk bandwidth, adjusting limits in real-time.
  • It enforces per-tenant quotas using a distributed token bucket system that can "trickle" grants to smooth out traffic spikes.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

TLA+ modeling tips

Foundational distributed systems papers

Learning about distributed systems: where to start?

Optimize for momentum

The Agentic Self: Parallels Between AI and Self-Improvement

Advice to the young

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Scalable OLTP in the Cloud: What’s the BIG DEAL?