Availability in Globally Distributed Storage Systems

- November 12, 2010

This paper by Google Research provides a report on the availability behavior of large cloud storage systems by studying up to 7K nodes at Google over a year. The paper does not propose any new protocols, but provides sufficiently accurate models of system behavior/performance based on the study. These models are important since they enable us to correctly design and optimize these multi layered systems for data availability.

The work is divided into two parts. The first is the analysis of the component availability (disks, machines, racks), and the second is the analysis of the data availability, as inferred from the component availability results and design decisions of the distributed storage system.

Preliminaries

A storage node is defined as unavailable when it fails to respond positively to periodic health checking pings sent by the monitoring system. The paper does not investigate about network errors specifically; those are also swept under unavailability with software & hardware faults at the node level. Even in Section 3, when the causes of node availability are analyzed in detail, network errors are not mentioned.

Node unavailability does not directly imply data unavailability thanks to identical replication or alternatively erasure encoding. In both approaches, data is divided into a set of stripes, each of which comprises of fixed size chunks. Data in a stripe can be reconstructed from some subsets of the chunks. For replication, R = n refers to n identical chunks in a stripe, so the data may be recovered from any one chunk. For Reed-Solomon erasure encoding, RS(n,m) denotes n distinct data blocks and m error correcting blocks in each stripe. In this case a stripe may be reconstructed from any n chunks.

Node failures and Correlated failures

The majority of node unavailability is due to planned reboots (such as kernel version upgrades), which is followed closely by node restarts (OS restart on the storage node). Unplanned machine reboots (e.g., kernel crashes) don't occur frequently but have the longest average duration to recover since extra checks and corrections are used to put the machines to a safe state.

The paper then focuses on measuring the frequency and severity of correlated failures. (Root causes of these failures --power outages, rolling reboots/upgrades-- are not investigated in the paper.) A failure burst is defined with respect to a window size as a maximal sequence of node failures, each one occurring within a time window 2 minutes of the next. 2 minutes is derived from a knee analysis of the graph plotting the "effect of window size on the fraction of individual failures that get clustered into bursts".

Next, the paper focuses on detecting rack-related node failures. To this end, rack-affinity score is defined as the "probability that a burst of the same size affecting randomly chosen nodes in that cell will have a smaller burst score". For a random burst, the expected rack affinity score is 0.5, for a rack-correlated burst close to 1, and a rack-anti-correlated burst close to 0 (they have not observed any rack-anti-correlated bursts). The authors found that larger failure bursts have higher rack affinity: All the failure bursts of more than 20 nodes have rack affinity greater than 0.7, and those of more than 40 nodes have affinity at least 0.9. Figure 8 shows frequency of failure bursts sorted by racks and nodes affected, and we can clearly see a large fraction of failures are bursty and rack-correlated. Later at Figure 10, it is shown that large failure bursts are the biggest contributor to unavailability as well.

Coping with failures

When a node failure causes the unavailability of a chunk within a stripe, the system initiates a recovery operation for that chunk from the other available chunks remaining in the stripe. The rate at which missing chunks may be recovered is limited by the bandwidth of individual disks, nodes, and racks, and there is a tradeoff between doing recovery and serving new client requests. Of course, given the frequency of bursty rack-correlated failures, it is no surprise that a rack-aware stripe placement policy (that ensures that no two chunks in a stripe are placed on nodes in the same rack) increases the stripe MTTF (i.e., data availability) by a factor of 3 typically.

Markov model of data availability and its findings

Using the data collected and analyzed above for node failures and rack-correlated bursty failures, the paper formulates a Markov model for data availability and develops analytical models to reason about past and future availability in the storage clusters, including the effects of different choices of replication, data placement and system parameters. A very simple example for identical replication and 2 chunks per stripe (i.e., R=2) is given in Figure 12.

The paper shows that the model is able to capture well the effect of failure bursts on the MTTF. Next the paper investigates how changes in the parameters of the system will affect data availability. Using the model the paper finds that reducing recovery times is effective when correlated failures are few. For RS(6,3) with no correlated failures, a 10% reduction in recovery time results in a 19% reduction in unavailability. However, when correlated failures are taken into account, even a 90% reduction in recovery time results in only a 6% reduction in unavailability. Correlated failures (compared to independent/random failures) are also shown to reduce the MTTF by at least two orders of magnitude. Correlation also reduces the benefit of increasing data redundancy. The gain in availability achieved by increasing the replication number, for example, grows much more slowly when we have correlated failures.

Two other important conclusions from the model, that should guide design decisions in distributed storage systems, are as follows. The model shows that component availability improvements below the node (server) layer of the storage stack do not significantly improve data availability. Assuming R=3 is used, a 10% reduction in the disk failure rate increases stripe availability by less than 1.5%. On the other hand, cutting node failure rates by 10% can increase data availability by 18%. The model also shows that replicating data across multiple cells (data centers) greatly improves availability because it protects against correlated failures. For example, R=2x2 (i.e., replicating twice in two cells) with 1 day recovery time between cells has two orders of magnitude longer MTTF than R=4.

Conclusions

I expect this paper to be very influential for distributed storage systems research. Using logs obtained from a large-scale production environment, the paper provided evidence that correlation among node failures dwarfs all other contributions to unavailability. (That said, I am disappointed that the network-related failures are not investigated separately.) The provided analytical model of data availability is very promising; and should be used by the distributed storage systems researchers to guide and evaluate any protocols they propose.