Asymmetric Linearizable Local Reads
People want data fast. They also want it consistent. Those two wants pull in opposite directions. This VLDB'25 paper does another take on this conundrum. Rather than assuming a symmetric network environment where all replicas face similar latencies, the paper emphasizes that in practice, some replicas are closer to the leader, where others are stranded halfway across the globe. By embracing this asymmetry, the authors propose two new algorithms: Pairwise-Leader (PL) and Pairwise-All (PA). Both cut read latency compared to the prior approaches. PL could even achieve 50x latency improvements in some cases.
Aleksey and I did our usual thing. We recorded our first blind read of the paper. You can watch it here (link to come soon), if you like seeing two people puzzle through a paper in real time. I also annotated a copy while reading which you can access here.
We liked the ideas, even though the protocols themselves didn't thrill us particularly. I particularly liked finding another good example of the use of synchronized time in distributed database systems. Another study to add to my survey.
Background
At the heart of the problem is linearizability (see my explanation for a primer), a strong consistency condition that ensures operations on a distributed object appear to occur atomically in a single total order consistent with real time. If one operation finishes before another begins, then all replicas must reflect this order. This strong model spares developers from reasoning about concurrency anomalies: if you read after a write, you are guaranteed to observe it. That means there are no stale reads.
While linearizability makes life simple for the developers, it makes it harder for the system, which has to make sure every replica behaves like one machine. This is typically enforced by state machine replication (SMR) protocols such as Paxos or Raft, which order all operations through a leader. This works fine, except for reads. Reads dominate workloads in practice, and forcing every read to consult the leader or coordinate with a quorum introduces unnecessary WAN round trips. To improve efficiency, many systems have proposed linearizable local read algorithms, which allow followers to serve reads locally under certain conditions. However, this is tricky ground. Local reads introduce the risk of staleness, because that replica/follower has not yet applied the latest writes.
And here's the rub: WANs exacerbate the problem. Some replicas are close. Others are hopelessly far. The close ones see fresh data. The far ones lag. Prior work noticed this and tried various tricks to smooth out the unfairness. This paper doesn’t smooth it out. It embraces it as we will see in the coming sections.
The key problem addressed by the paper is: How can linearizable local reads be achieved in a system where replicas are asymmetric in their ability to keep up with the updates?
The Blocking Problem
Over the years, many protocols have been proposed for enabling linearizable local reads. The paper reviews these under three broad categories.
Invalidation-based algorithms (e.g., Megastore, PQL, Hermes): Replicas mark themselves invalid when they receive prepares and regain validity only after commits. But this can lead to unbounded blocking: If the leader issues prepares faster than it gathers acknowledgments in a write-heavy workload, the replicas stay perpetually invalid.
Eager Stamping algorithms (e.g., CHT): Reads are stamped with the latest prepare index, then block until the corresponding commits arrive. This avoids perpetual invalidation thanks to instance-based tracking of prepares, but it still results in blocking proportional to twice the leader's eccentricity.
Delayed Stamping algorithms (e.g., CockroachDB Global Tables): These use synchronized clocks to assign visibility windows. Here, blocking time depends on the clock skew bound Δ, which theoretically in the worst case would be bound to the relative network diameter. This theoretical worst case bound does not apply if GPS clock synchronization (e.g., Google's Truetime or AWS Timesync) is available, and so the paper, in order to make its case, assumes GPS based synchronization is not available (which is actually pretty available). https://muratbuffalo.blogspot.com/2024/12/utilizing-highly-synchronized-clocks-in.html
Ok, I owe you an explation of delayed stamping approach. But first let me set this up using the paper's unifying stop/go events framework. I like this framing: when you name something, you own it. In this model, each index i on each replica has two events:
- A stop event at which the replica stops assigning reads to indices less than i.
- A go event at which the replica can safely serve reads at i.
This abstraction ultimately guarantees linearizability by ensuring that, across all replicas, all go events for i occur at or after all stop events for i. The framework is clean and clarifies why blocking occurs.
Now let's walk through Fig. 2 and explain the two approaches using the stop/go framework.
Eager Stamping (Fig. 2a). A follower stamps a read with the highest index i it has seen a prepare for (the stop event). It can only apply the read after it has received all commits up to i (the go event). The leader is special here: since it is always up to date, its stop and go events collapse into one.
Delayed Stamping (Fig. 2b). This approach uses synchronized clocks to decouple stop and go events from message arrival. When the leader accepts an update, it assigns it a future visibility time t+α, where the visibility delay α is derived from estimated commit time of the update. Followers stop stamping reads with indices less than i once that visibility time has passed on their clocks. They then apply the read after an additional Δ (the clock skew/uncertainty) has elapsed. Unlike Eager Stamping, the leader does not have special stop and go events; it also follows this visibility rule. I had talked about this earlier when explaining CockroachDB's global transactions and the Aurora Limitless and DSQL future timestamping.
Table 2 summarizes the worst-case blockage time at followers for the three category of algorithms mentioned above and the two new algorithms introduced in this work (which we explain next).
I would be amiss (like the paper), if I do not emphasize a common flaw shared across all these algorithms: the leader in these algorithms requires acknowledgments from all nodes (rather than just a quorum) before it can commit a write! If you want a local linearizable read from a single node, the write protocol is forced into a write-all model to ensure that the one-node read quorum intersects with the write quorum. This design hurts both availability and tail-latency for the algorithms in Table 2. In contrast, in our Paxos Quorum Reads (PQR 2019) work we avoided the write-all model: PQR used only LSN tracking and quorum acknowledgments, and no clocks are needed. I discuss PQR at the end of this post.
Pairwise-Leader (PL)
The central idea of PL is to tailor blocking time to each replica's distance from the leader. Nearby replicas get near-zero latency, while distant ones may fare worse than before. Figure 3 provides the stepping stone for explaining the PL algorithm in Figure 4.
The central idea of PL is that blocking time should depend on how far a replica is from the leader. Replicas close to the leader see near-zero latency, while distant replicas may wait longer. To get there, the paper first introduces a stepping-stone algorithm (as shown in Figure 3). The trick is to deliberately time/delay prepare messages so that acknowledgments from all replicas reach the leader at the same time and hence blockage time is reduced for followers closer to the leader. Specifically, this ensures that for any replica, the gap between its prepare and commit messages is just the round-trip distance to the leader. That alone already improves over older algorithms that tied blocking to the full network diameter.
PL then builds on this stepping-stone by further decoupling stop and go events, borrowing the spirit of Delayed Stamping but applying it pairwise. Instead of relying on synchronized clocks, PL introduces a new event scheduling primitive that ensures a replica's stop event happens just before a visibility time, and its go event just after. Figure 4 illustrates this: each replica's worst-case blocking time becomes exactly 2 relative message delay between the leader and itself (see Table 1 for notation). In other words, nearby replicas get fast, almost instant reads, as the cost for distant ones reflects only their distance from the leader.
PL introduces a new pairwise event scheduling/synchronization primitive: Instead of requiring global clock synchronization, the leader coordinates stop and go events directly with each follower. This scheduling/synchronization ensures stop/go events happen at predictable real-time offsets relative to the leader's visibility time, while exploiting known lower bounds on delays to followers to maintain correctness. Yes, unfortunately the drawback is that the delays to followers need to be reliable/predictable for the correctness to work. I discuss this problem at the end of the post.
Pairwise-All (PA)
PL optimizes aggressively for leader-adjacent replicas, but it penalizes distant ones. PA extends the pairwise trick to all replicas using all-to-all communication as shown in Figure 6.
PA's stepping-stone in Figure 6 works like PL's but shifts the synchronization target: instead of aligning acknowledgments at the leader, it delays prepare messages so they all arrive every replica at the same time. Each process also sends acknowledgments to all others, not just the leader. The effect is that every replica can commit once its own eccentricity time has passed since receiving a prepare. As a result, the worst-case read blocking time for each replica is its relative eccentricity.
To further reduce blocking, PA applies the same decoupling (delayed stamping) idea as PL but with all replicas aligned at the visibility time (as shown in Figure 7). The leader schedules stop events near this visibility point, and each process waits for stopped events from others before issuing its own go. Since a go event is the maximum of all stopped events, correctness holds regardless of scheduling accuracy. This ensures that every replica's worst-case blocking time is bounded by its eccentricity rather than its leader-distance. In practice, that means nearby replicas don't get PL's extreme gains, but distant ones aren't punished.
That's it, that's the story. Asymmetry is real. PL exploits it ruthlessly. PA makes it fair. If you're close to the leader, PL is your friend. If you're far, PA keeps you from suffering.
Discussion
Both PL and PA assume stable latencies and non-faulty processes. The paper sketches how to tolerate failures, but variance and reconfiguration remain open issues. The funny thing is to justify predictable network latencies, the authors cite Aleksey's paper: "Cloudy Forecast: How Predictable is Communication Latency in the Cloud?" Ironically, that paper shows the opposite, the variance can be massive: up to 10x of median in WAN setup, and 3000x in same AZ setup! So either they didn't read it carefully, or they cited it for sport. Cloud networks aren't that tame and predictable for tenants.
The treatment of clock skew Δ is also odd. The paper insists Δ must be proportional to the network diameter, but that's a theoretical result, and I don't know how much it would apply to even NTP based synchronization. Moreover, in practice, GPS clocks exist, and AWS Timesync provides 50 microsecond clock uncertainty. Why not use these? The paper explicitly disallows GPS clocks to make the results from PL and PA look more favorable. A comparison against synchronized clocks would have been valuable. With clocks, blocking could be in less than millisecond (as we designed in AWS DSQL) with just delayed timestamping and that would not only be a lot more simple, but also beat anything from PL and PA significantly.
Our PQR work (2019) also tackled linearizable non-leader reads. You can also frame it as local reads, though PQR used multiple nodes. The key idea in PQR is to involve the client: The client contacts a quorum of nodes, usually gets a linearizable read in one shot, and in the rare case of an ongoing update, waits briefly and completes with a callback. PQR required no synchronized clocks and worked in a fully asynchronous model using only LSNs. It fits naturally in this space.
Comments