Taurus MM: A Cloud-Native Shared-Storage Multi-Master Database

This VLDB'23 paper presents Taurus MM, Huawei's cloud-native, multi-master OLTP database built to scale write throughput in clusters between 2 to 16 masters. It extends the single-master TaurusDB design (which we reviewed yesterday) into a multi-master design while following its shared-storage architecture with separate compute and storage layers. Each master maintains its own write-ahead log (WAL) and executes transactions independently; there are no distributed transactions. All masters share the same Log Stores and Page Stores, and data is coordinated through new algorithms that reduce network traffic and preserve strong consistency.

The system uses pessimistic concurrency control to avoid frequent aborts on contended workloads. Consistency is maintained through two complementary mechanisms: a new clock design that makes causal ordering efficient, and a new hybrid locking protocol that cuts coordination cost.


Vector-Scalar (VS) Clocks

A core contribution is the Vector-Scalar (VS) clock, a new type of logical clock that combines the compactness of Lamport clocks with the causal precision/completenes of vector clocks.

Ordinary Lamport clocks are small but they fail to capture causality fully, in both directions. Vector clocks capture causality fully, but scale poorly. An 8-node vector clock adds 64 bytes to every message or log record, which turns into a prohibitive cost when millions of short lock and log messages per second are exchanged in a cluster. Taurus MM solves this by letting the local component of each node's VS clock behave like a Lamport clock, while keeping the rest of the vector to track other masters' progress. This hybrid makes the local counter advance faster (it reflects causally related global events, not just local ones) yet still yields vector-like ordering when needed.

VS clocks can stamp messages either with a scalar or a vector timestamp depending on context. Scalar timestamps are used when causality is already known, such as for operations serialized by locks or updates to the same page. Vector timestamps are used when causality is uncertain, such as across log flush buffers or when creating global snapshots.

I really like the VC clocks algorithm, and how it keeps most timestamps compact while still preserving ordering semantics. It's conceptually related to Hybrid Logical Clocks (HLC) in that it keeps per-node clock values close and comparable, but VS clocks are purely logical, driven by Lamport-style counters instead of synchronized physical time. The approach enables rapid creation of globally consistent snapshots and reduces timestamp size and bandwidth consumption by up to 60%.

I enjoyed the paper's pedagogical style in Section 5, as it walks the reader through deciding whether each operation needs scalar or vector timestamps. This  makes it clear how we can enhance efficiency by applying the right level of causality tracking to each operation.


Hybrid Page-Row Locking

The second key contribution is a hybrid page-row locking protocol. Taurus MM maintains a Global Lock Manager (GLM) that manages page-level locks (S and X) across all masters. Each master also runs a Local Lock Manager (LLM) that handles row-level locks independently once it holds the covering page lock.

The GLM grants page locks, returning both the latest page version number and any row-lock info. Once a master holds a page lock, it can grant compatible row locks locally without contacting the GLM. When the master releases a page, it sends back the updated row-lock state so other masters can reconstruct the current state lazily.

Finally, row-lock changes don't need to be propagated immediately and are piggybacked on the page lock release flow. This helps reduce lock traffic dramatically. The GLM only intervenes when another master requests a conflicting page lock.

This separation of global page locks and local row locks resembles our 2014 Panopticon work, where we combined global visibility and local autonomy to limit coordination overhead.


Physical and Logical Consistency

Taurus MM distinguishes between physical and logical consistency. Physical consistency ensures structural correctness of pages. The master groups log records into log flush buffers (LFBs) so that each group ends at a physically consistent point (e.g., a B-tree split updates parent and children atomically within LFB bounds). Read replicas apply logs only up to group boundaries, avoiding partial structural states without distributed locks.

Logical consistency ensures isolation-level correctness for user transactions (Repeatable Read isolation). Row locks are held until commit, while readers can use consistent snapshots without blocking writers.


Ordering and Replication

Each master periodically advertises the location of its latest log records to all others in a lightweight, peer-to-peer fashion. This mechanism is new in Taurus MM. In single-master TaurusDB, the metadata service (via Metadata PLogs) tracked which log segments were active, but not the current write offsets within them (the master itself notified read replicas of the latest log positions). In Taurus MM, with multiple masters generating logs concurrently, each master broadcasts its current log positions to the others, avoiding a centralized metadata bottleneck.

To preserve global order, each master groups its recent log records (updates from multiple transactions and pages) into a log flush buffer (LFB) before sending it to the Log and Page Stores. Because each LFB may contain updates to many pages, different LFBs may touch unrelated pages. It becomes unclear which buffer depends on which, so the system uses vector timestamps to capture causal relationships between LFBs produced on different masters. Each master stamps an LFB with its current vector clock and also includes the timestamp of the previous LFB, allowing receivers to detect gaps or missing buffers. When an LFB reaches a Page Store, though, this global ordering is no longer needed. The Page Store processes each page independently, and all updates to a page are already serialized by that page's lock and carry their own scalar timestamps (LSNs). The Page Store simply replays each page's log records in increasing LSN order, ignoring the vector timestamp on the LFB. In short, vector timestamps ensure causal ordering across masters before the LFB reaches storage, and scalar timestamps ensure correct within-page ordering after.

For strict transaction consistency, a background thread exchanges full vector (VS) timestamps among masters to ensure that every transaction sees all updates committed before it began. A master waits until its local clock surpasses this merged/pairwise-maxed timestamp before serving the read in order to guarantee a globally up-to-date view. If VS were driven by physical rather than purely logical clocks, these wait times could shrink further.


Evaluation and Takeaways

Experiments on up to eight masters show good scaling on partitioned workloads and performance advantages over both Aurora Multi-Master (shared-storage, optimistic CC) and CockroachDB (shared-nothing, distributed commit).

The paper compares Taurus MM with CockroachDB using TPC-C–like OLTP workloads. CockroachDB follows a shared-nothing design, with each node managing its own storage and coordinating writes through per-key Raft consensus. Since Taurus MM uses four dedicated nodes for its shared storage layer, while CockroachDB combines compute and storage on the same nodes, the authors matched configurations by comparing 2 and 8 Taurus masters with 6- and 12-node CockroachDB clusters, respectively. For CockroachDB, they used its built-in TPC-C–like benchmark; for Taurus MM, the Percona TPC-C variant with zero think/keying time. Results for 1000 and 5000 warehouses show Taurus MM delivering 60% to 320% higher throughput and lower average and 95th-percentile latencies. The authors also report scaling efficiency, showing both systems scaling similarly on smaller datasets (1000 warehouses), but CockroachDB scaling slightly more efficiently on larger datasets with fewer conflicts. They attribute this to CockroachDB’s distributed-commit overhead, which dominates at smaller scales but diminishes once transactions touch only a subset of nodes, whereas Taurus MM maintains consistent performance by avoiding distributed commits altogether.

Taurus MM shows that multi-master can work in the cloud if coordination is carefully scoped. The VS clock is a general and reusable idea, as it provides a middle ground between Lamport and vector clocks. I think VS clocks are useful for other distributed systems that need lightweight causal ordering across different tasks/components.

But is the additional complexity worth it for the workloads? Few workloads may truly demand concurrent writes across primaries. Amazon Aurora famously abandoned its own multi-master mode. Still from a systems-design perspective, Taurus MM contributes a nice architectural lesson.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Learning about distributed systems: where to start?

Advice to the young

Distributed Transactions at Scale in Amazon DynamoDB

Disaggregation: A New Architecture for Cloud Databases

Making database systems usable

Use of Time in Distributed Databases (part 1)