Friday, January 28, 2011

Lessons from giant-scale services

This is a 2001 paper by Eric Brewer summarizing the lessons he learned from designing and developing giant-scale internet services (see Inktomi) between 1995-2001. This paper does not mention the CAP theorem at all! This is very surpising given that CAP theorem is what Brewer is famous to most people in the internet services domain. (CAP theorem is very famous and has been very influential for the design of Internet services and cloud computing systems. CAP theorem was first mentioned in publication in 1998 paper, and then presented in PODC'00 keynote by Brewer. See these two posts for a detailed treatment of the CAP theorem.)

Instead this paper is all about DQ principle as a design guideline for internet services. The paper mentions Harvest and Yield which may be seen as finer granularity versions of Consistency and Availability respectively, and may connect back to the CAP theorem. I discuss that connection at the end.

DQ principle
In the web services model, clients make read queries to the servers and servers return data to reply those queries. The DQ principle is simple:
data_per_query * queries_per_second == constant

The intuition behind this principle is that the system's overall capacity tends to have a particular physical bottleneck, such as total I/O bandwidth of all disks in the system, which is tied to data movement. The DQ value is the total amount of data that has to be moved per second on average, and it is thus bounded by the underlying physical limitation. At the high utilization level typical of giant-scale systems, the DQ value approaches this limitation.

DQ is hopefully not network-bound or CPU-bound (e.g., not bound by the load manager's performance). If so, you should fix the problem by throwing more resources at it, so you can provide a large-scale web service. Web services are generally data intensive so DQ model is a good fit for most web-services.

DQ is configuration specific. DQ increases with node additions (that is provided that network bandwidth limit is not reached), and DQ decreases with node failures. So failures may force your system to become over capacity. (You may also consider a sudden increase in query traffic over system capacity also as a failure model. This reduces to the same problem when looking at the reverse perspective.)

Yield and Harvest
Yield is the probability of completing a request, and harvest measures the completeness of the answer to the query. More formally:
yield= queries_completed / queries_offered
harvest = data_available / complete_data

Yield provides a better metric of availability than simply uptime. Being down for one second at peak and off-peak times generates the same uptime, but vastly different yields because there might be an order-of-magnitude difference in load between the peak second and the minimum-load second. Harvest provides a metric for consistency, of how much of the database is reflected in the response.

Due to faults or saturation DQ will be reached sooner or later, and the system will have to make a choice between reducing yield (i.e., stop answering requests) and reducing harvest (i.e., giving answers based on incomplete data). The key insight here is that we can influence whether faults impact yield, harvest, or both. Here is how yield and harvest relate to DQ: When DQ is reached, we can either limit Q (capacity) to maintain D, or we can reduce D and increase Q. We can focus on harvest through admission control (AC), which reduces Q, or on yield through dynamic database reduction, which reduces D, or we can use a combination of the two.

Replication vs. Partitioning
Replicated systems tend to map faults to reduced capacity (and to reduced yield at high utilizations), while partitioned systems tend to map faults to reduced harvest, as parts of the database temporarily disappear, but the capacity in queries per second remains the same. Consider a two-node cluster: The replicated version has traditionally been viewed as "better" because it maintains 100 percent harvest under a fault, whereas the partitioned version drops to 50 percent harvest. Dual analysis shows that the replicated version drops to 50 percent yield, while the partitioned version remains at 100 percent yield. Furthermore, both versions have the same initial DQ value and lose 50 percent of it under one fault: Replicas maintain D and reduce Q (and thus yield), while partitions keep Q constant and reduce D (and thus harvest).

Although you need more copies of the data with replication, the real cost is in the DQ bottleneck rather than storage space. Moreover, replication on disk is cheap, and only accessing the replicated data requires DQ points. According to these principles, you should always use replicas above some specified throughput. With no DQ difference, it sense to replicate the data once the partitions are a convenient size. You will enjoy more control over harvest and support for disaster recovery, and it is easier to grow systems via replication than by repartitioning onto more nodes.

We can vary the replication according to the data's importance, and generally control which data is lost in the presence of a fault. For example, for the cost of some extra disk space we can replicate key data in a partitioned system. Alternatively, we can exploit randomization to make our lost harvest a random subset of the data, (as well as to avoid hot spots in partitions). Many of the load-balancing switches can use a (pseudo-random) hash function to partition the data, for example. This makes the average and worst-case losses the same because we lose a random subset of "average" value. The Inktomi search engine uses partial replication; e-mail systems use full replication; and clustered Web caches use no replication. All three use randomization.

Graceful degradation
Graceful degradation is simply the explicit process for managing the effect of saturation on availability; that is, we explicitly decide how saturation should affect uptime, harvest, and quality of service. The paper gives the following graceful degradation examples taken from real systems:

Cost-based AC. Inktomi can perform AC based on the estimated query cost (measured in DQ). This reduces the average data required per query, and thus increases Q. Note that the AC policy affects both D and Q. Denying one expensive query could thus enable several inexpensive ones, giving us a net gain in harvest and yield. (AC could also be done probabilistically — perhaps in the style of lottery scheduling, so that retrying hard queries eventually works.)

Priority- or value-based AC. Datek handles stock trade requests differently from other queries and guarantees that they will be executed within 60 seconds, or the user pays no commission. The idea is to reduce the required DQ by dropping low-value queries, independently of their DQ cost.

Reduced data freshness. Under saturation, a financial site can make stock quotes expire less frequently. This reduces freshness but also reduces the work per query, and thus increases yield at the expense of harvest. (The cached queries don't reflect the current database and thus have lower harvest.)

Concluding remarks
Based on its usefulness in design of large-scale systems, I wonder why we have not been hearing about DQ principle as much as the CAP theorem. Maybe the principle is famous under a different term?

As I mentioned in the beginning, harvest and yield may be seen as finer granularity versions of consistency and availability respectively. So, does this harvest-yield tradeoff relate to the CAP theorem's consistency-availability tradeoff? I am not sure. The harvest-yield tradeoff mentioned in this paper stems from a capacity limitation, whereas the consistency-availability tradeoff in the CAP theorem stemmed from the partition (lack of communication) problem. I think this harvest-yield tradeoff may be more related to the consistency-lowlatency tradeoff mentioned in the PACELC model. If the system insists on providing full harvest, its yield will suffer as it will be able to complete less of the queries by unit time, and hence its latency will increase.


Anonymous said...

I wonder about the fundamental nature of DQ when speculative execution (like preemptive response), content delivery networks, and other things are applied. In many cases response time is the metric of choice.

Murat said...

Re: Anonymous.

So, basically, you are saying we are in the age of abundance: 1) yield is easy to carry to 100 % (by employing CDNs and over-provisioning), and 2) harvest can be faked to be 100 % (by using stale data and replication).

(Note that the harvest yield tradeoff is still there, because we are *faking* the harvest.)

What is missing in the DQ equation is the response time. I guess the model in this paper is to see response time as a bounded time, and look at yield modulo this response time.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...