## Tuesday, January 29, 2019

### Paper review: Probabilistically Bounded Staleness for Practical Partial Quorums

There is a fundamental trade-off between operation latency and data consistency in distributed database replication. The PBS paper (VLDB'12) examines this trade-off for partial quorum replicated data stores.

## Quorum systems

We can categorize quorum systems into strict versus partial quorums. Strict quorum systems ensure strong consistency by ensuring that read & write replica sets overlap: $R + W > N$. Here N is the total number of replicas in the quorum, R is the number of replicas that need to reply to a read query, and W is the number of replicas that need to reply to a write query.

Employing partial quorums can lower latency by requiring fewer replicas to respond, but R and W need not overlap: $R+W \leq N$. Such partial quorums offer eventual consistency.

Here is a visual representation of an expanding quorum system. The coordinator forwards a write requests to all N replicas, and wait for W acknowledgements for responding back to the client for completion of the write. The quorum system is called expanding because the third replica will also get the write request soon even though the coordinator waits for only W=2 acknowledgements to confirm the write as completed. Similarly the coordinator also forwards the read request to N nodes, and responds back to the client with the highest versioned read when responses from R replicas are received.

Many quorum-replicated data stores, such as Apache Cassandra, Basho Riak, and Project Voldemort offer a choice between strict quorums with strong consistency and partial quorums with eventual consistency. Cassandra often defaults to a partial/non-strict quorum system with N=3, R=W=1, for maximum performance. While Riak defaults to a strict quorum system with N=3 and R=W=2,  users suggest using  R=W=1, N=2 for low-value data. Finally, for applications requiring very low latency and high availability, LinkedIn deploys Voldemort with N=3 and R=W=1. (Unlike Dynamo style systems, Voldemort sends read requests to R of N replicas--not N of N--; this decreases load per replica and network traffic at the expense of read latency and potential availability.)

In Cosmos DB, inside a region, we offer quorums with N=4, W=3, and allow the user to choose R=1 for session-consistency, consistent-prefix, and eventual-consistency reads, and R=2 for strong-consistency and bounded-staleness consistency reads. Cosmos DB also provides these 5 well-defined consistency levels for global replication and across region reads.

Quorum staleness
How consistent is eventual? For the average case, can we offer staleness bounds with respect to version history and time? The Probabilistically Bounded Staleness (PBS) paper investigates this problem quantify the probability of staleness for partial quorums across versions via k-staleness and time via t-visibility metrics.

Let's start with some basic math to quantify staleness. What is staleness probability? It is the probability that the read quorum does not contain the last written version. We can obtain this by dividing the number of quorums of size $R$ composed of nodes that were not written to in the write quorum by the number of all possible read quorums:

$p_s = \frac{{{N-W} \choose R}}{{N \choose R}}$

For N=3, R=W=1, this probability comes to $p_s$=2/3, that is 0.666. But this is staleness with respect to the latest written version and with an immediate read after the write. We can generalize this staleness formula in both dimensions, with respect to k-versions (in lieu of latest version), and with respect to t unit time delayed read (in lieu of immediate, t=0, read).

#### k-staleness

A system obeys *k-staleness* consistency with probability $1-p_{sk}$ if at least one value in any read quorum has been committed within k versions of /the latest committed version when the read begins/.

Given the probability of a single quorum non-intersection p, the probability of non-intersection with one of the last $k$ independent quorums is $p^k$. (Note that this calculation ignores the effects of expanding quorums and constitutes a lower bound.)

$p_{sk} = \left( \frac{{{N-W} \choose R}}{{N \choose R}} \right)^k$

#### t-visibility

A system obeys t-visibility with probability $1-p_{st}$ if any read quorum started at least t units of time after a write commits returns at least one value that is at least as recent as that write.

Let $P_w$ ($W_r$,t) denote the cumulative density function describing the number of replicas W_r that have received version v exactly t time after v commits. For expanding quorums, W replicas have the value with certainty, and we can model t-visibility by summing the conditional probabilities of each possible $W_r$:

$p_{st} = \frac{{{N-W} \choose R}}{{N \choose R}} + \sum\limits_{c \in (W,N]} (\frac{{{N-c} \choose R}}{{N \choose R}} * [P_w(c+1,t)-P_w(c,t)])$

#### <k,t>-staleness

A system obeys <k,t>-staleness consistency with $1-p_{skt}$ if  at least one value in any read quorum will be within k versions of the latest committed version when the read begins, provided the read begins t units of time after the previous k versions commit.

$p_{skt} = \left( \frac{{{N-W} \choose R}}{{N \choose R}} + \sum\limits_{c \in (W,N]} (\frac{{{N-c} \choose R}}{{N \choose R}} * [P_w(c+1,t)-P_w(c,t)]) \right)^k$

Note that k-staleness equals  <k,0>-staleness consistency, and t-visibility equals <1,t>-staleness consistency.

## Monte Carlo modeling of t-staleness

Since t-staleness formula depends on $P_w$ the cumulative density function describing the expanding write quorums (i.e., anti-entropy), it is easier to analyze t-staleness using Monte Carlo simulations. So we first model the quorum systems using the *WARS* latency distributions in the operations, and then quantify the t-staleness.

The read coordinator will return stale data if the first R responses received reached their replicas before the replicas received the latest version (delayed by *W*). Note that for a strict quorum, where R+W>N, returning stale data is impossible, because R will intersect a replica that has seen the latest write. For the partial quorum systems, the probability of the staleness depends on the latency distributions on *W*, *A*, *R*, and also indirectly on *S*.

Let wt denotes the commit time (i.e., when the coordinator received W acks). A single replica's response is stale if r' + wt + t < w', for w' drawn from *W* and r' drawn from *R* latency distributions. Of course writes expand to additional replicas during *A* and *R*, and that helps.

We can see from this formulation that longer *W* tails and faster reads increase the chance of staleness due to reordering. Dually, for improved consistency, it helps to:

• reduce variance for *W* write-req from coord to replicas
• increase *A* write-reply from replicas to coord
• increase *R* read-request from coord to replicas
• reduce variance for *S* read-respond from replicas to coord

(The effect of *S* is indirect and is very small. If S is very high variance, then stale reads may get returned faster than fresh reads. So by reducing the variance on *S*, you increase the chance of reordering of a fresher read to get returned faster.)

#### Monte Carlo simulations

Calculating t-visibility for a given value of t is straightforward using Monte Carlo simulations.

1. Draw N samples from *W*, *A*, *R*, and *S* at time t,
2. Compute wt as the Wth smallest value of {*W[i] + A[i]*, i \in [0, N )}
3. Check if the first R samples of *R*, ordered by *R[i] + S[i]* all satisfy $wt+R[i]+t \leq W[i]$

The paper uses exponential latency distribution for some Monte Carlo simulations, because exponential distributions are simple. An exponential distribution describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. The cumulative distribution function (CDF) is given as $F(x;\lambda) = 1- e^{-\lambda*x}$,  for $x \geq 0$, which leads to the  Mean = $\frac{1}{\lambda}$, and Variance= $\frac{1}{\lambda^2}$.

The PBS webpage provides an interactive demonstration of Monte Carlo simulations using the *WARS* model with exponential distributions. The demo gives you a better understanding of the effects of *WARS* distribution and t on consistency.

## Write-latency distribution effects

In order to illustrate the effects of *W*, write-latency distribution, the paper fixes *A=R=S* with $\lambda$=1, and sweeps a range of *W* distributions by changing its $\lambda$.

As expected, we find that high write variance *W* increases staleness:

• When the variance of *W* is 0.0625ms ($\lambda$= 4, mean .25ms, one-fourth the mean of *A=R=S*), we observe a 94% chance of consistency immediately after the write and 99.9% chance after 1ms
• When the variance of *W* is 100ms ($\lambda$ = .1, mean 10ms, ten times the mean of *A=R=S*), we observe a 41% chance of consistency immediately after write and a 99.9% chance of consistency only after 65ms

As the variance and mean of W increases, so does the inconsistency. Looking at distributions with fixed means and variable variances (uniform, normal), the paper finds that the mean of *W* is less important than its variance if *W* is strictly greater than *A=R=S*.

## Using production traces

Instead of just providing simulations with exponential distributions, the paper also maps these simulations to production deployments, by first fitting  production latency distributions (obtained from LinkedIn and Yammer deployments) to distributions. It looks like Pareto distributions fit the latency distributions better for most cases.

LNKD-SSD versus LNKD-DISK comparisons provide a good validation for the PBS finding that reducing *W* variance contributes most for the consistency. The figure shows that SSDs improve consistency immensely due to their reduced write variance. Immediately after write commit, LNKD-SSD had a 97.4% probability of consistent reads, reaching over a 99.999% probability of consistent reads after 5ms.

Another thing to notice is that R=2 & W=1 (the blue square plot) blows the socks off of R=1 & W=2 (the green circle plot).  Why? Aren't these suppose to be symmetrical? My explanation is this. By increasing W by 1, you incur only a very little latency (assuming the variance on *W* is not large) and in return, you get to hit two replicas with a read, which exponentially decreases the probability of both replicas missing the latest version.

Why is this not more widely adapted in partial quorum systems? W=1 makes you vulnerable to a data loss but only if both the replica and the coordinator crashes at the same time and the coordinator did not have chance to forward the write to other replicas even though it had acknowledged the write (not very plausible). Even with W>1, reading from more replicas improves consistency quickly, so it is a low hanging fruit to reap when the performance requirements don't forbid it.

Figure 7 shows how varying N affects t-visibility while maintaining R=W=1. As expected, the probability of consistency immediately after write commit decreases as N increases. But you can see that SSDs totally rock! Even with increased N, they keep a very high consistency probability thanks to the very low variance on *W* write latency across replicas.

Finally, Table 4 compares t-visibility required for a 99.9% probability of consistent reads to the 99.9th %ile read and write latencies. The table shows that lowering values of R and W can greatly improve operation latency and  t-visibility can be low even when we require a high probability of consistent reads. Again note how much an improvement W=1 & R=2 provides over W=2 & R=1! That makes a big difference.

1. Is it possible to change the read API to capture sweetpoints in tradeoff between consistency and read/write latency??

There is a knee for large $\lambda$ (i.e.,  small mean & variance). Waiting till the knee gives you the most bang-for-the-buck for consistency, and waiting after the knee helps less.

What if we fix time waited on a read, instead of R, the number of replicas to read from? This will prevent the coordinator from accepting the response from an early stale read-reply as sufficient. The coordinator can wait the period the SLAs (or cut that short if another read reply is received), and this will avoid falling for an early stale read reply.

2. By working from first principles, can we find an unexplored part of the state space to improve consistency??

We saw that for improved consistency, it helps to

• reduce variance for *W* write-req from coord to replicas
• increase *A* write-reply from replicas to coord
• increase *R* read-request from coord to replicas

What are some alternative ways to satisfy these conditions?

If we take this to logical extremes, it is better to keep the replicas close to each other (in the same cluster in LAN, or in nearby regions in WAN), and away from the client. This setup reduces *W* variance, and increases *A* and *R* durations.

I wonder if we can find other unexplored points in the space.

3. Why don't we use PBS analysis in WAN to help cloud clients decide on which regions to deploy??

I had mentioned that Azure Cosmos DB supports clients to configure the default consistency level on their database account (and later override the consistency on a specific read request). For all four relaxed consistency levels (bounded, session, consistent-prefix, and eventual), among other metrics, Cosmos DB also tracks and reports the probabilistic bounded staleness (PBS) metric, which I think is unique among available solutions.

I think this use of PBS can be extended to guide customers decide on which regions to deploy. In the Azure cloud, customers can deploy among 50+ regions, and the selection of the regions have implications for latency and consistency tradeoffs if relaxed consistency levels are chosen. Moreover,  Cosmos DB does not restrict the client to a single write region and allows multiple write-regions and resolves conflicts by way of an Arbiter and anti-entropy mechanism. So PBS metrics can also be used to get the clients get the most out of this by choosing optimal deployment regions for the access patterns. I will be looking at this in the coming weeks.