Saturday, September 7, 2019

Linearizable Quorum Reads in Paxos

While there has been a lot of work on Paxos protocols, there has not been any study that considers the read operation in Paxos protocols thoroughly. Read operations do not mutate state, and in many applications the read operations outnumber the update operations.

Traditionally, there have been three main ways to perform reads.
  1. Treat the read as a regular command, let the leader clear it with a quorum, and return the response
  2. Use a lease on the leader (to prevent another leader emerging and committing an update), and read from the leader
  3. Read from one of the replicas
The first two approaches provide linearizable reads, but the last method is non-linearizable. For example, in ZooKeeper or Raft if you read from a replica, it returns stale values, because the leader commits first---after hearing from a quorum of replicas--- and the replicas follow behind. (Linearizability means that the distributed system emulates a single register where each client can read or write from that register. Each operation appears to have occurred instantaneously between the time when it is invoked and the time when it produces a response.)

While the first two approaches provide linearizable reads, they both involve the leader for read operations. However, the leader is already overwhelmed with write operations: for each write it is doing disproportionately large work. Involving the leader with the reads magnifies the bottleneck at the leader.

Paxos Quorum Reads

To solve this problem and provide linearizable reads in Paxos without involving the leader, we (Aleksey Charapko, Ailidani Ailijiang, and Murat Demirbas) have introduced Paxos Quorum Reads (PQR) in a recent work. PQR can work in an asynchronous setup without requiring leases or involving the leader.

A client multicasts the quorum-read request to a majority quorum of replicas and waits for their replies. Each reply message contains three fields, the highest accepted slot number $s$, the highest applied slot number $\underline{s}$, and the value $v$ of the object requested. (There are four possible states of a slot: empty, accepted $v$, committed $\hat{v}$, and applied $\underline{v}$.)

We say that a read is clean if a quorum of replicas return the same slot number for $s$ and $\underline{s}$ for which the requested data item is last updated. In this case, a quorum read is successful and the learned value can be returned immediately. Intuitively, a clean read means it not possible for the leader to have a higher committed value $\hat{s}$ than any of the replicas.

Note that it is possible for some replicas to return $s=\underline{s}=x$ and some replicas to return $s=\underline{s}=x+k$. This is still a clean read, and the client uses $s=x+k$, the higher value, as the clean read value. This read is clean, because it is impossible for the leader to have $\hat{s}$ greater than $s=x+k$. If that was the case, the read quorum would have returned at least one node that has $s$ greater than $\underline{s}$, violating the clean read.

Accordingly, a read is dirty if at least one node has seen a higher slot number in accepted state $s>\underline{s}$. The value learned from dirty read is unsafe to return as the same value is not guaranteed to be seen from subsequent reads. Therefore, a second phase of read is required to confirm that such a slot is finalized/cleaned. For the second phase, the rinse phase, the client retries any replica to see if $s$ is now applied/executed. If so, the read is completed as the current value in slot $s$. It is possible to perform this second phase as a callback from a replica to the client as well.

There are several optimizations possible over this basic scheme, and apply this to many Paxos flavors. We are currently investigating those optimizations.

To recap, the important thing about PQR is that it helps balance the load between Paxos leader and replicas. Relieving the leader from serving reads allows it to serve more writes. Reads use underutilized replicas and are performed by clients. This way PQR improves throughput, especially in write-heavy workloads. The figure shows that with a 75% writes workload, PQR  is able to provide better latency and higher maximum throughput.

You can read more about the PQR method in our HotStorage paper.

No comments:

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...