Taurus Database: How to be Fast, Available, and Frugal in the Cloud

- November 08, 2025

This SIGMOD’20 paper presents TaurusDB, Huawei's disaggregated MySQL-based cloud database. TaurusDB refines the disaggregated architecture pioneered by Aurora and Socrates, and provides a simpler and cleaner separation of compute and storage.

In my writeup on Aurora, I discussed how "log is the database" approach reduces network load, since the compute primary only sends logs and the storage nodes apply them to reconstruct pages. But Aurora did conflate durability and availability somewhat and used quorum-based replication of six replicas for both logs and pages.

In my review of Socrates, I explained how Socrates (Azure SQL Cloud) separates durability and availability by splitting the system into four layers: compute, log, page, and storage. Durability (logs) ensures data is not lost after a crash. Availability (pages/storage) ensures data can still be served while some replicas or nodes fail. Socrates stores pages separately from logs to improve performance but the excessive layering introduces significant architectural overhead.

Taurus takes this further and uses different replication and consistency schemes for logs and pages, exploiting their distinct access patterns. Logs are append-only and used for durability. Log records are independent, so they can be written to any available Log Store nodes. As long as three healthy Log Stores exist, writes can proceed without quorum coordination. Pages, however, depend on previous versions. A Page Store must reconstruct the latest version by applying logs to old pages. To leverage this asymmetry, Taurus uses synchronous, reconfigurable replication for Log Stores to ensure durability, and asynchronous replication for Page Stores to improve scalability, latency, and availability.

But hey, why do we disaggregate in the first place?

Traditional databases were designed for local disks and dedicated servers. In the cloud, this model wastes resources as shown in Figure 1. Each MySQL replica keeps its own full copy of the data, while the underlying virtual disks already store three replicas for reliability. Three database instances mean nine copies in total, and every transactional update is executed three times. This setup is clearly redundant, costly, and inefficient.

Disaggregation fixes this and also brings true elasticity! Compute and storage are separated because they behave differently. Compute is expensive and variable; storage is cheaper and grows slowly. Compute can be stateless and scaled quickly, while storage must remain durable. Separating them allows faster scaling, shared I/O at storage, better resource use, and the capability of scaling compute to zero and restarting quickly when needed.

Architecture overview

Taurus has two physical layers, compute and storage, and four logical components: Log Stores, Page Stores, the Storage Abstraction Layer (SAL), and the database front end. Keeping only two layers minimizes cross-network hops.

The database front end (a modified MySQL) handles queries, transactions, and log generation. The master handles writes; read replicas serve reads.

Log Store stores (well duh!) write-ahead-logs as fixed-size, append-only objects called PLogs. These are synchronously replicated across three nodes. Taurus favors reconfiguration-based replication: If one replicaset fails or lags, a new PLog is created. Metadata PLogs track active PLogs.

Page Store materializes/manages 10 GB slices of page data. Each page version is identified by an LSN, and the Page Store can reconstruct any version. Pages are written append-only, which is 2–5x faster than random writes and gentler on flash. Each slice maintains a lock-free Log Directory mapping (page, version) to log offset. Consolidation of logs into pages happens in memory. Taurus originally prioritized by longest chain first, but then reverted to oldest unapplied write first to prevent metadata buildup. A local buffer pool accelerates log application. For cache eviction, Taurus finds that LFU (least frequently used) performs better than LRU (least recently used), because it keeps these hot pages in cache longer, reducing I/O and improving consolidation throughput.

Storage Abstraction Layer (SAL) hides the storage complexity from MySQL by serving as an intermediary. It coordinates between Log Stores and Page Stores, manages slice placement, and tracks the Cluster Visible LSN, the latest globally consistent point. SAL advances CV-LSN only when the logs are durable in Log Stores and at least one Page Store has acknowledged them. SAL also batches writes per slice to reduce small I/Os.

Write path and replication

Did you notice the lack of LogStore to PageStore communication in Figure 2 and Figure 3? The paper does not address this directly, but yest there is no direct LogStore-to-PageStore communication. The SAL in the master mediates this instead. SAL first writes logs to the Log Stores for durability. Once acknowledged, SAL forwards the same logs to the relevant Page Stores. This ensures that Page Stores only see durable logs and lets SAL track exactly what each replica has received. SAL monitors per-slice persistent LSNs for Page Stores, and resends missing logs from the Log Stores if it detects regressions.

I think, this choice adds coupling and complexity. A chain-replication design, where LogStores streamed logs directly to PageStores, would simplify the system. This way, SAL wouldn't need to track every PageStore’s persistent LSN. And Log truncation could be driven by LogStores once all replicas confirmed receipt, instead of being tracked by SAL again.

Read path

Database front ends read data at page granularity. A dirty page in the buffer pool cannot be evicted until its logs have been written to at least one Page Store replica. This ensures that the latest version is always recoverable.

As mentioned above, SAL maintains the last LSN sent per slice. Reads are routed to the lowest-latency Page Store replica. If one is unavailable or behind, SAL retries with others.

Read replicas don't stream WAL directly from the master. Instead, the master publishes which PLog holds new updates. Replicas fetch logs from the Log Stores, apply them locally, and track their visible LSN. They don't advance past the Page Stores' persisted LSNs, keeping reads consistent. This design keeps replica lag below 20 ms even under high load and prevents the master from becoming a bandwidth bottleneck.

Recovery model

If a Log Store fails temporarily, writes to its PLogs pause. For long failures, the cluster re-replicates its data to healthy nodes.

Page Store recovery is more involved. After short outages, a Page Store gossips with peers to catch up. For longer failures, the system creates a new replica by copying another's data. If recent logs were lost before replication, SAL detects gaps in persistent LSNs and replays the missing records from Log Stores. Gossip runs periodically but can be triggered early when lag is detected.

If the primary fails, SAL ensures all Page Stores have every log record persisted in Log Stores. This is the redo phase (similar to ARIES). Then the database front end performs undo for in-flight transactions.

Nitpicks

I can't refrain from bringing up a couple of issues.

First, RDMA appears in Figure 2 as part of the storage network but then disappears entirely until a brief mention in the final "future work" paragraph.

Second, the evaluation section feels underdeveloped. It lacks the depth expected from a system of this ambition. I skipped detailed discussion of this section in my review, as it adds little insight beyond what is discussed in the protocols.