Linearizable Quorum Reads in Paxos

While there has been a lot of work on Paxos protocols, there has not been any study that considers the read operation in Paxos protocols thoroughly. Read operations do not mutate state, and in many applications the read operations outnumber the update operations.

Traditionally, there have been three main ways to perform reads.
  1. Treat the read as a regular command, let the leader clear it with a quorum, and return the response
  2. Use a lease on the leader (to prevent another leader emerging and committing an update), and read from the leader
  3. Read from one of the replicas
The first two approaches provide linearizable reads, but the last method is non-linearizable. For example, in ZooKeeper or Raft if you read from a replica, it returns stale values, because the leader commits first---after hearing from a quorum of replicas--- and the replicas follow behind. (Linearizability means that the distributed system emulates a single register where each client can read or write from that register. Each operation appears to have occurred instantaneously between the time when it is invoked and the time when it produces a response.)

While the first two approaches provide linearizable reads, they both involve the leader for read operations. However, the leader is already overwhelmed with write operations: for each write it is doing disproportionately large work. Involving the leader with the reads magnifies the bottleneck at the leader.

Paxos Quorum Reads

To solve this problem and provide linearizable reads in Paxos without involving the leader, we (Aleksey Charapko, Ailidani Ailijiang, and Murat Demirbas) have introduced Paxos Quorum Reads (PQR) in a recent work. PQR can work in an asynchronous setup without requiring leases or involving the leader.

A client multicasts the quorum-read request to a majority quorum of replicas and waits for their replies. Each reply message contains three fields, the highest accepted slot number $s$, the highest applied slot number $\underline{s}$, and the value $v$ of the object requested. (There are four possible states of a slot: empty, accepted $v$, committed $\hat{v}$, and applied $\underline{v}$.)

We say that a read is clean if a quorum of replicas return the same slot number for $s$ and $\underline{s}$ for which the requested data item is last updated. In this case, a quorum read is successful and the learned value can be returned immediately. Intuitively, a clean read means it not possible for the leader to have a higher committed value $\hat{s}$ than any of the replicas.

Note that it is possible for some replicas to return $s=\underline{s}=x$ and some replicas to return $s=\underline{s}=x+k$. This is still a clean read, and the client uses $s=x+k$, the higher value, as the clean read value. This read is clean, because it is impossible for the leader to have $\hat{s}$ greater than $s=x+k$. If that was the case, the read quorum would have returned at least one node that has $s$ greater than $\underline{s}$, violating the clean read.

Accordingly, a read is dirty if at least one node has seen a higher slot number in accepted state $s>\underline{s}$. The value learned from dirty read is unsafe to return as the same value is not guaranteed to be seen from subsequent reads. Therefore, a second phase of read is required to confirm that such a slot is finalized/cleaned. For the second phase, the rinse phase, the client retries any replica to see if $s$ is now applied/executed. If so, the read is completed as the current value in slot $s$. It is possible to perform this second phase as a callback from a replica to the client as well.

There are several optimizations possible over this basic scheme, and apply this to many Paxos flavors. We are currently investigating those optimizations.

To recap, the important thing about PQR is that it helps balance the load between Paxos leader and replicas. Relieving the leader from serving reads allows it to serve more writes. Reads use underutilized replicas and are performed by clients. This way PQR improves throughput, especially in write-heavy workloads. The figure shows that with a 75% writes workload, PQR  is able to provide better latency and higher maximum throughput.

You can read more about the PQR method in our HotStorage paper.


arthur said…
hi murat, I've read the paper and got 2 questions, hope can get your kind reply:

1. in table 1 of the paper, #messages handled by the client under Paxos Quorum Read is 2(1-r) + 2r(n/2+1), I know 2r(n/2+1) stands for the #read messages, but if 2(1-r) stands for the write, shouldn't it be 2 just as the other two scenarios? When client doing write, everything it need to do is just send and wait for the response.

2. in the rinse phase, does the client know the result of the ongoing write by periodically querying one of the followers ? I can't figure out other ways to let the client know that result. Do sth periodically does not sound like a good way.
Murat said…
Hi Arthur,
1. r is the probability of read operation, between 0 and 1. If r is 0, all operations are write operations (i.e., the probability of write operations become 1-r), and PQR client load would be 2. But if r is 1, there won't be any write operations, and the load of the client is only determined by the second part, 2r(n/2+1)
2. Picking one, probing and waiting works, unless that follower crashes. (That could be handled by re-trying after a long timeout.) As we mention in the paper rinse can be also proactively done by one of the followers: A follower noticing gap between accept and committed/applied, can do a callback to the client when the bartering condition is met.
arthur said…
Hi Murat, got another question, in this scenario:
1> client A issue a write request for a key to the leader.
2> client B issue a read request for the same key to the followers.
3> the followers check theirs local state and gave responses to B.
4> leader received the write request from A, did the multi-paxos process successfully and return a success msg to A. Here we can say that A finished writing with new value.
5> B received the responses and saw a clean read. But in fact the value has already been changed.

In total:
A issue write --> B issue read --> A finish writing --> B read an old value.

Isn't this a linearizability violation?

Murat said…
Hi Arthur.
No that is not a linearizability violation, because B issues the read before A finishes writing.
If B issued the read after A's write is acknowledged, yet B still read the old value, then it would be a linearizability violation.

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom