PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads

This paper from Alibaba group appeared in VLDB'23. It talks about how to perform low latency strongly-consistent reads from secondaries in PolarDB database deployments.

PolarDB adopts the canonical primary secondary architecture of relational databases. The primary is a read-write (RW) node, and the secondaries are read-only (RO) nodes. Having RO nodes help for executing queries, and scaling out in terms of querying performance.

This is essentially the AWS Aurora architecture. Durability is satisfied through shared storage, so we can ignore that and orthogonally focus on the optimizations for improving RO node performance.

The way to improve the RO node performance is by shipping the redo log (essentially WAL) to these RO nodes so they can  keep their buffers ready, and serve reads from the buffer quickly, rather than having to reach out to shared storage.

PolarDB architecture follows the same ideas. On top of this, they are interested in being able to serve strong-consistency reads from RO nodes. The paper says this is needed especially in Alibaba's e-commerce applications, and also for scenarios where the database is used to support interaction among microservices.  

Next, I am going to give some background on strong-consistency reads. Then I am going to talk about different ways you can do strong-consistency reads from the secondaries, and talk about the PolarDB's solution as a special case of this. In my write up I use the terms primary/secondary rather than RW/RO nodes.

Strong-consistency reads

Strong consistency is also known as linearizability. If we simplify things, and restrict the operations to just writes and reads, linearizability boils down to this: "the read should return the last written value" statement.

But what is last written value? Even between the operation-request and operation-response gap, there is interesting behavior to take into account.

Indeed, it is possible to take advantage of this gap between the read-request sent by the client and read-response observed by the client. One way to achieve linearizable reads is by delaying the read operation on the databases side a long time read, and return the response late. This way, if there was an update that was completed/acknowledged just before the read-request, we would have a good guarantee that on whatever node we are doing the read that update has been seen/propagated and included. Other later updates can also be included in that delayed read, but that is OK as far as the linearizability definition goes. The Probabilistic Bounded Staleness (PBS) paper even quantified and analyzed the wait times required to give pretty good guarantees on linearizability.

Of course, abusing this delayed read-execution approach is bad, and becomes counterproductive. Customers want fast responses, not slow reads. What is the soonest we can serialize/serve the read, yet provide guaranteed strong-consistency of the reads?

How do you perform linearizable reads from secondaries?

Let's take a step back and talk about what we know about doing linearizable reads. In our previous work, we had introduced Paxos Quorum Reads to perform linearizable reads only using the secondaries for Paxos/Raft replicated datastores. The idea here is to offload the learner role in Paxos (the phase1 quorum read operation) to the client that wants to get a strongly-consistent read. For this, the client contacts a Phase1 quorum of secondaries, and spares the primary which is busy serving updates/mutations.

PQR performs linearizable reads in two steps: barrier and rinse.
  • We determine the barrier point as the last value that may have been write-acked (and exposed to outside).
  • We then delay serving the read until the barrier point is reached in this secondary to guarantee that we are not serving a stale read.

The good news is that we don't see the need for rinsing often, especially for reading in key-value type data. For the read-keys we are interested, often there is no pending update, and so there is no wait required to server the reads. Check out this post for details.

Well, PQR is still not that efficient, because we are contacting a phase1 quorum of secondaries. Can we do the serializable read from one node?

If we can figure out the barrier point by just talking to a single node, yes, yes we can.

One option is to rely on the client to supply the barrier point, and delay the read on the secondary until the secondary is caught up to that barrier point. But this may not be good practice in many systems, because this is asking too much from the client. The client needs to keep track of this causality token, and pass it for the read without fumbling things. Also technically this won't be a strong-consistency read, if there are multiple clients. One client would not have information about the acks other clients receive, so rather than a strong-consistency read, this would be more of a session-consistency read.

Ok, without punting this to the client, how can we figure out the barrier point, i.e., the most recent "potentially" acknowledged update operation?

One way to realize this idea would be if you could have a front-end service that maintains this information (the last update point per key) say using a hash-table. Then the secondary can query this front-end service to learn what to barrier on, and serve the read only when that secondary is current up to that barrier point. This is pretty much the PolarDB idea. Except, they keep that hash table, which they call the modification tracking table (MTT), at the primary, and let the secondary read that table via uni-directional RDMA to bypass the primary CPU, so as not to burden the primary.  

Another way to realize this idea would be to use synchronized physical clocks. Consider Spanner, which uses TrueTime. To perform a linearizable read, the client can submit the read operation with TrueTime(now). This would be the barrier point we need in the secondary. The secondary can wait on the uncertainty if needed, and respond to the read when it is current with respect to the primary updates including this point. (An example of Hint#5: leverage time for performance.)

PolarDB architecture

Note the hierarchical modification tracking table (MTT) in the figure. This is the mechanism for serving the linearizable read from a single secondary. As we discussed above, it is a way of realizing the barrier and rinse method.

In PolarDB, the MTT is maintained at the primary to track all modifications in the system. The secondary checks the MTT (it reaches to the primary via unidirectional RDMA) to see what point to barrier on with respect to the read keys. This becomes the serialization point of the read. If the secondary already is caught up to this point, it serves the read, otherwise waits until it gets sufficiently current to serve the read.

MTT being hierarchical is just an optimization. MTT tracks the primary nodes latest modification points at three levels: global database's latest modification point, table's latest modification point, and the page level latest modification points. The secondary will first check (via unidirectional RDMA) the global level's watermark, then the requested table's and page's watermarks. Once one level is satisfied, it will directly process the request and will not check the next one. It only needs to wait for the log application when the last level (page level) is unsatisfied. Once a request is satisfied at the global level, there is no need to check the timestamp during following data accesses for this transaction because the whole database is up-to-date for the current request. If a request is only satisfied on one requested table/page, it has to check the timestamp when accessing the different tables/pages.

Log pushing

The MTT and unidirectional RDMA reading of the is the big contribution. The paper also talks about RDMA-based log shipment. But evaluations show that this did not make a big difference.

PolarDB-SCC utilizes the one-sided RDMA for the log shipment to reduce network overhead and to save CPU cycles. Each secondary has a log buffer for the Redo log (as we mentioned in the introduction). The primary node remotely fills these in. The secondary node's log buffer size is the same as the primary's. The primary's log data will always be remotely written to the same offset in the secondary node's log buffer using this circular buffer mechanism.


The paper includes a nice evaluation section. They compare different way of doing reads
  • Default: primary handles all queries in addition to the updates,
  • Stale-read: secondary runs queries immediately serving potentially stale values,
  • Read-wait: secondary checks primary's global timestamp and delays read until it is caught up to that point
  • SCC: secondary uses MTT table and the whole shebang mentioned in the paper
Great success! SCC scheme is almost as fast and as high throughput as the Stale-read scheme but serves linearizable reads.

Check out Jesse's blog for another review of the paper here.


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

Always Measure One Level Deeper

Dude, where's my Emacs?

There is plenty of room at the bottom

Know Yourself