Enabling the Next Generation of Multi-Region Applications with CockroachDB (Sigmod 22)

- June 15, 2022

This paper describes how CockroachDB (CRDB) simplifies cross-region deployments by providing the user with high-level declarative SQL-like statements for expressing data access locality and availability goals. The paper also describes a new transaction management protocol that provides local, strongly consistent reads from any database replica (see the Global Transactions section at the end, just before the Evaluation). If you lack the basics on CRDB, the earlier Sigmod 2020 paper is a better place to start.

Motivating example

Consider a ride-sharing company called movr. Fig. 1a shows two tables from movr’s database schema. Fig. 1b shows some of the challenges associated with converting them to multi-region. The schema must be modified to add a partitioning column since no natural partitioning column exists. The application logic and DML must also be modified to use this new column. Partitioning is not viable for the promo_codes table, which has no locality of access. Instead the users want to perform low-latency reads of the promo_codes table from all regions while also guaranteeing strong consistency. Fig. 1c shows how these problems are addressed by CRDB by employing the table localities abstraction. (Well, the protocol for addressing global reads is provided much later, in the Global Transactions section.)

Screen Shot 2022-06-13 at 1.32.22 PM.png

SQL for multi-region abstractions

To create a multi-region database, users need to choose a PRIMARY region and optionally specify additional regions. All regions in CRDB can host leaseholder (i.e., primary) replicas. A PRIMARY region merely serves as the default region for data placement when an alternative region has not been specified. For example:
CREATE DATABASE movr PRIMARY REGION "us-east1" REGIONS "us-west1", "europe-west1";
ALTER DATABASE movr ADD REGION "australia-southeast1";
ALTER DATABASE movr DROP REGION "us-west1";

Survivability goals are set as follows:
ALTER DATABASE movr SURVIVE REGION FAILURE;
ALTER DATABASE movr SURVIVE ZONE FAILURE;

Every table in a multi-region database has a table locality setting, which is REGIONAL BY TABLE, REGIONAL BY ROW, or GLOBAL.
CREATE TABLE west_coast_users ( ... ) LOCALITY REGIONAL BY TABLE IN "us-west1";
CREATE TABLE users ( ... ) LOCALITY REGIONAL BY ROW;
ALTER TABLE promo_codes SET LOCALITY GLOBAL;

GLOBAL tables are optimized for low-latency reads from all regions, at the expense of slower writes. GLOBAL tables are useful for read-mostly data that cannot be partitioned by locality; a common use-case is rarely updated reference data that needs to be read from all regions.

REGIONAL BY TABLE locality represents the case where all rows in the table are primarily accessed from the same home region, and therefore there is no need for partitioning across regions (data may still be split across nodes within the same region).

In tables with REGIONAL BY ROW locality, individual rows are optimized for access from different regions. This setting divides a table and all of its indexes into partitions, with each partition optimized for access from a different region specified in an ENUM column of type crdb_internal_region.

Adding or dropping a region is equivalent to adding or removing a value in the crdb_internal_region. Dropping a region involves added complexity to validate that no row in a REGIONAL BY ROW table has its region value set to that region.
Altering to a REGIONAL BY TABLE or GLOBAL table simply implies a metadata operation followed by a zone configuration change. Altering to a REGIONAL BY ROW table additionally requires prefixing each index with the hidden region column. This operation is implemented by building a new index with the new column set, and once it is backfilled, swapping it with the old.

The home region of a table or partition is the region where all leaseholders of its Ranges are placed. Databases that are configured to survive only zone failures are set up to have 3 voting replicas for every Range, all in the home region. To support low-latency reads from other regions, one non-voting replica is placed in each non-home region.

Databases can only be configured to survive a region failure if they contain at least 3 regions. With REGION survivability, CRDB uses 5 voters with 2 in the home region.

Low-Latency Stale Reads

The leaseholder of a Range is the only replica allowed to serve up-to-date reads and writes. Follower replicas can serve read-only queries (follower-reads) on a sufficiently old MVCC snapshot. Follower reads are used for explicitly requested stale reads, for long-running transactions, and for global transactions.

CRDB is a serializable timestamp-based MVCC system, so a read with timestamp T needs to reflect all overlapping writes with timestamps less than T. A follower can perform a read at a given timestamp T iff:

No future writes can invalidate the read retroactively. The follower needs a guarantee that no new writes will be committed with MVCC timestamps less than T.
It has all the data necessary to serve the read. The follower needs to have applied the prefix of the Raft log that can contain writes at MVCC timestamps less than T.

The two conditions are ensured through the closed timestamps mechanism. A closed timestamp is a promise made by the leaseholder that it will not accept new writes at or below that MVCC timestamp (thus closing it). When a follower applies a command carrying a closed timestamp T, it knows that there will not be further commands writing at or below T. By default, leaseholders close timestamps that are 3 seconds old. Long running transactions complicate this scheme, and requires the followers to also check for write intents for those keys. (I wish the paper explained how long running transactions are handled explicitly. These are mentioned in a couple places without discussing how they are handled and how they interact with other transactions/mechanisms, and how they effect performance.)

Global Transactions

GLOBAL tables enable strongly-consistent low-latency reads that can be served locally by any replica, at the expense of slower writes. Transactions on GLOBAL tables write into the future by scheduling their writes to take effect at a future timestamp. This scheduled time is chosen such that by the time it becomes present-time, the transaction has likely released its locks, replication has propagated the updated data, and present hybrid logical clock (HLC) time is already closed on all replicas. This allows any replica, not just the leaseholder, to serve present-time reads locally using a regular transaction timestamp in the absence of conflicting writes. This is all because the “present-time” is closed on GLOBAL tables. The price is that transactions that write to GLOBAL tables have increased commit latency, as they are commit-timestamped in the future by 1 RTT for doing Raft with voting replicas, 0.5 RTT to replicate to all replicas the resolution (including non-voting replicas), plus max-clock-offset=250ms, which comes around 500-600ms.

The nice thing about this approach is that you are not waiting for the write to commit at every region. If one region is stuck seeing the write commit (which still has the locks), only the read in that regions suffers from that slowness until it is resolved. The other regions don't delay, and can commit soon after the future-timestamped write is committed in their regions. To minimize cross-region communication and coordination, this design leverages time delays rather than communication to resolve conflicts between readers and any concurrent writers.

Below we follow this global transaction execution in detail, after a recap of how uncertainty intervals work in CRDB.

Background: Uncertainty Intervals

CRDB guarantees serializability for transactions and linearizability for reads and writes at the level of a single key. (Here is the counterexample for strict-serializability of transactions in CRDB.) Linearizability ensures that operations appear to take place in some total order consistent with their real-time order. That is, a read operation r, invoked after a write w (on the same key) completes, observes the value written by w or newer. To achieve this, CRDB relies on loose clock synchronization and the concept of an uncertainty interval: a time window following a read's timestamp within which the reading transaction cannot make real-time ordering guarantees. The duration of uncertainty intervals is configured as the maximum tolerated clock skew between any two nodes, max_clock_offset. When reading from a leaseholder, a reader that encounters a provisional or committed write to the same key within its uncertainty interval is forced to ratchet up its timestamp and perform an uncertainty refresh, checking whether the values previously read by the transaction remain unchanged at the newer timestamp. If the values have changed, the reader must restart; in any case, the upper bound of the uncertainty interval does not change.

Uncertainty intervals ensure that the relative order of timestamps used by conflicting transactions that touch the same keys respects real-time order. Leaseholders use these timestamps to enforce serializability by blocking reads on the completion of writes to the same key with a timestamp equal to or lower than the read. Leaseholders also advance the timestamp of writes above the timestamp of any previously served reads or refreshes on the same key, preventing writes from invalidating a read’s result after it completes.

Future-time transactions

Normal transactions in CRDB start with a timestamp assigned from the transaction coordinator's HLC. As the transaction proceeds this timestamp may be ratcheted up, but never exceeds present time by more than max_clock_offset.

In contrast, future-time writes are initially invisible to present-time readers. To preserve linearizability (and read your writes) the coordinator delays completion of a write operation (i.e., its acknowledgement to the client) until its HLC advances beyond the transaction’s commit timestamp. At that point, no other clock in the system lags the commit timestamp by more than the maximum tolerated clock skew, hence every new read is guaranteed to observe the write through the uncertainly interval mechanism described above.

Because present-time is closed on GLOBAL tables, in the absence of conflicting writes, present-time reads can be served immediately by any replica. In cases of contention on the same key, however, the uncertainty interval rules apply: a read operation r encountering a write w (to the same key) with a higher timestamp but within r's uncertainty interval must observe the written value by bumping its read timestamp to w's timestamp and performing an uncertainty refresh. However, unlike with present-time writes, the existence of a future-time write w does not guarantee that all other clocks in the system are within max_clock_offset from w's timestamp. As a result, if the system were to allow r to observe the value written by w and immediately complete, a subsequent read r' performed on a node with a slower clock may fail to observe w in violation of linearizability (in any total order of operations, w appears before r since r returns its written value, and r appears before r' because of real-time order, but r' fails to observe w). The solution is similar to that of writes: if w was written with a future timestamp, r must not only perform an uncertainty refresh using w's timestamp, but must also commit wait before completing, until the local HLC of r's transaction coordinator advances beyond w's commit timestamp. This ensures that when r completes, all observed values are within the uncertainty interval of every node in the system, and any newly invoked read is also guaranteed to observe them.

Figure 2 depicts the typical flow of a global transaction and its interaction with present-time follower reads. A writing client (on the top) communicates with a transaction coordinator (gateway) which, in turn, communicates with the leaseholder of the relevant Range (1). The write is assigned a future MVCC timestamp and replicates to all replicas. The commit is only acknowledged to the client after commit wait, i.e., when the write’s timestamp has become current w.r.t. the coordinator’s local clock (2). A reading client (on the bottom) performs two reads, in two different transactions. They are both served by a nearby follower replica. The first read runs quickly and doesn't see the recently-written value because its timestamp is below the write timestamp (3). The timestamp of the second read is also lower than the write's, but this time the write falls within the reader's uncertainty window and forces the reader to observe the value. The read bumps its timestamp (which now becomes a future timestamp), performs an uncertainty restart and then commit waits until the timestamp becomes current at its coordinator (4).

When the max_clock_offset assumption is violated, the timestamp of a previously-committed write could be outside the uncertainty interval of a read transaction coordinated by this node, allowing for a stale read for the global read. On the other hand, since isolation does not rely on clock synchronization, CRDB’s transaction serializability guarantees are not impacted by clock skew.

Evaluation summary

GLOBAL tables are suitable for read-mostly workloads requiring fast reads from anywhere, while REGIONAL tables are best when a workload has high locality of access or stale reads are acceptable. To quantify the performance tradeoffs for REGIONAL and GLOBAL tables, the evaluation compares their performance. The deployment uses max_clock_offset 250ms and runs a CRDB cluster with nodes located in 5 regions. Each region hosts 3 CRDB nodes and 10 clients. The round-trip times between regions are summarized in Table 1. They use the YCSB-A benchmark with 1:1 ratio of reads to writes. Each client performs single-key reads and writes with keys drawn from a Zipf distribution. All five regions are added to the database and us-east is designated as PRIMARY, which holds leaseholders for the REGIONAL BY TABLE and GLOBAL tables. Each table is populated with 100k keys. Finally, each client sends 50k queries to a collocated CRDB node.

Screen Shot 2022-06-13 at 1.39.38 PM.png