Saturday, November 27, 2010

Dynamo: Amazon's highly available key-value store

This paper, which appeared in SOSP'07, describes Dynamo, the underlying storage technology for several core services in Amazon's e-commerce platform. Dynamo is a NoSQL system and provides a single key-value store. The commonly accepted (yet still disputed) wisdom is that RDBMS are overkill for simple key-value stores and are unsuitable for large-scale multi datacenter systems.

The goal in Dynamo is providing reliability at large scale. As the paper says in the introduction "The reliability and scalability of a system is dependent on how its application state is managed". This was a key lesson from the Ousterhout'90 paper: "The role of distributed state".

Dynamo is an optimistic replication system. According to the optimistic survey taxonomy, Dynamo is a multi-master (multiple coordinators can update the data) system that employs state transfer and asynchronous propagation of updates. Hence, Dynamo allows conflicting updates in to the system. Dynamo uses vector clocks to detect conflicts, and employs application-level/semantic conflict resolution. There is no bound on replica divergence. Dynamo adopts a best-effort eventual consistency approach and try to resolve divergent replicas problem when possible.

CAP theorem
The Dynamo system emphasizes availability to the extent of sacrificing consistency. The abstract reads "Dynamo sacrifices consistency under certain failure scenarios". Actually, later it becomes clear that Dynamo sacrifices consistency even in the absence of failures: Dynamo may become inconsistent in the presence of multiple concurrent write requests since the replicas may diverge due to multiple coordinators. Using Abadi's model, Dynamo is PAEL. When there are failures (Partitioning), Dynamo chooses Availability over consistency, and, Else (even in the absence of failures) Dynamo chooses Low-latency over consistency.

Dynamo is inclined to sacrificing consistency because the application semantics can tolerate inconsistency. Dynamo employs post hoc conflict resolution to fix the inconsistencies.

Design principles
The Dynamo paper mentions the Google File System (GFS) SOSP'03 work in just two sentences, but I think they are related works and they warrant comparison. Dynamo makes very different design choices than GFS. While GFS used a centralized-master approach (which was maintained in a highly available fashion using Chubby/Paxos), Dynamo uses a pure peer-to-peer (P2P) approach for serving requests.

The Dynamo system fully-embraces the symmetry principle in its design. Quoting from the paper: "Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance."

I think this over-emphasis on symmetry is unwarranted. Why should the system force every node try to do everything for themselves rather than employing specialization? Amazon has a lot of expertise in services and service level agreements. So I am surprised as why they did not think of simplifying the design and implementation of Dynamo by employing specialized nodes to provide a distributed directory lookup service. The VL2 paper shows that a distributed directory service can be implemented in a very efficient and highly available manner in data centers.

The symmetry principle, and the resulting P2P fully decentralized storage nodes system, looks contrived in the paper. I wish the paper included some justification for this design choice. I cannot find very good reasons for it. Obviously this choice leads to a high availability system, but high availability is also achievable with a master coordinator approach using Chubby/Paxos solution in practice. And using a master coordinator approach would have significantly simplified several design problems in the paper, as I discuss in the next section.

The interesting thing is that the paper lists heterogeneity as another key design principle right after the symmetry principle: "Heterogeneity: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once." Doesn't this hint that specialization can also be warranted to exploit heterogeneity? There is no need to strictly adhere to the symmetry principle.

Details of Dynamo
The paper gives details on the partitioning, replication, versioning, membership, and failure handling components of Dynamo.

Partitioning algorithm employs a typical P2P key distribution algorithm (a consistent hashing approach as in Chord). Virtual nodes (tokens) are used for making the key distribution load balancing more fine-grained and more uniform. Dynamo takes pains to make sure that every storage node is utilized almost equally to any other storage node, but enforcing that precise load balancing may lead to wasted traffic in node join and leave.

Replication is performed so that each data item is replicated at N hosts. The coordinator asks for N nodes, but for the sake of availability if quorum number responds, the operation succeeds. N < R+W so we can have intersecting read and write quorums. Tuning of R and W can give write-optimized read-optimized system. In Dynamo, a popular configuration is N=3, R=2, W=2. (Note that we could still employ this type of quorums with a master approach, in GFS N=3, R=1, W=3. Dynamo can provide only a per service tuning of R and W, but with a master-based solution, you can have per key tuning of R and W.)

Data versioning is performed by using vector clocks. Dynamo treats the result of each modification as a new and immutable version of the data.

The biggest difference between Dynamo and traditional p2p systems is that it employs a one-shot lookup. In Dynamo, all nodes maintain information about all key-coordinator mappings. Dynamo does not employ a directory service lookup, instead insists --due to symmetry principle-- that all nodes maintain all the directory information. The request is first routed to a random node (for load balancing purposes), and since that node has information about which node is primarily responsible (a.k.a. coordinator) for the key (typically the first among the top N nodes in the preference list of the key), it routes the request to the corresponding coordinator. If by chance the node that received the request in the first place is also in the list, it can also coordinate the request. Read and write operations involve the first N healthy nodes in the preference list, skipping over those that are down or inaccessible.

The P2P fully decentralized design introduces a lot of complexity
The P2P fully-decentralized system introduces a lot of complexity especially in the failure handling cases. For handling transient failures, hinted handoff is complicated. For handling permanent failures, replica synchronization is more complicated. Membership and failure detection also gets very tricky in a purely decentralized setup: gossip protocol, seed nodes for external discovery, failure detection, adding/removing storage nodes are presented in a hand-wavy manner in the paper. Using a master coordinator would have made all these operations very straightforward.

Using a single master approach would also obviate the need to deal with conflicts. Since updates would be serialized by the single-master, conflicts would be prevented trivially. However, due to faults, an old state may have been exposed, and in the presence of partitions we still have to choose between consistency and availability.

Distributed directory service for lookup
An easier and less radical change in Dynamo would be to adopt a distributed directory service layer for lookup as in the VL2 paper. I think that would make a cleaner design, and solve the self-confessed limitation of scalability with the current design. To quote from the paper: "Finally, Dynamo adopts a full membership model where each node is aware of the data hosted by its peers. To do this, each node actively gossips the full routing table with other nodes in the system. This model works well for a system that contains couple of hundreds of nodes. However, scaling such a design to run with tens of thousands of nodes is not trivial because the overhead in maintaining the routing table increases with the system size. This limitation might be overcome by introducing hierarchical extensions to Dynamo. Also, note that this problem is actively addressed by O(1) DHT systems (e.g., [14])."

The thing that has most impressed me about Dynamo is that the replication is so loosely coupled that they would choose the replicas at different data centers for disaster tolerance purposes. This is made possible by the NoSQL approach and PAEL design choice in Dynamo. The paper also gives a lot of evaluation results. On average 10 ms to complete a write. 99.9% of writes completed in 300ms.

As I mentioned above, I think Dynamo would have been much simpler and efficient by making different design choices. But, what do I know?

Wednesday, November 24, 2010

Crowdsourced sensing and collaboration using Twitter

First, a brief prelude about my research interests. I started my PhD on distributed algorithms and self-stabilizing systems in 1998. But then in 2000, after my advisor received a DARPA grant on networked embedded sensor technologies, I have started working on wireless sensor networks. After 10 years of working on wireless sensor networks, I am now in the process of switching topics. I am doing a lot of reading on cloud computing. This is a topic I enjoy, and I think I can contribute here due to my background in distributed systems and theory.

I am also doing a lot of reading on smartphones, as they provide a good alternative/complement to wireless sensor network systems. The main appeal of smartphones is that they have solved the market penetration problem, which the wireless sensor networks have been perpetually struggling with. The killer applications for smartphones are communication and offering a lightweight ubiquitous PC replacement. Smartphones are incidentally better sensors than the wireless sensor network motes. They have much higher computation and storage power. They have ubiquitous connection through GSM, wifi, bluetooth. They have surprising amount of sensors: camera, microphone, GPS, compass, proximity, ambient light, ambient noise, 3D accelerometer, touchscreen, temperature. They are mobile, and cared by the human user, who can help for certain sensing/recognition tasks. They are inexpensive thanks to mass production.

That is why I am interested in smartphones and specifically on crowdsourcing sensing/collaboration to smartphones using Twitter as a middleware. Here is a presentation that provides a brief overview of my work in this topic.

I believe this is a promising direction, and I think the real breakthroughs are going to arrive when we can get smartphone crowdsourcing and cloud computing to collaborate seamlessly.

Wednesday, November 17, 2010

VL2: A Scalable and Flexible Data Center Network

This paper is by MS Research and appeared in Sigcomm 2009. The paper investigates data center networking, the same problem as the Portland paper (which also appeared in Sigcomm 2009!). Naturally there are similarities in the approaches recommended by the two papers.

The motivation for this paper is from a slightly different angle than the Portland paper. This paper puts more emphasis on the network capacity problem in the data centers. The paper argues that the network is the bottleneck of the computation, since the switches at the higher levels (i.e., aggregation and core switches) are oversubscribed heavily. Servers typically have 1:1 over-subscription to other servers in the same rack --that is, they can communicate at the full rate of their interfaces (e.g., 1 Gbps). However, up-links from top of the rack (ToR) switches are typically 1:20 oversubscribed (i.e., 1 Gbps of up-link for 20 servers), and paths through the highest layer of the tree can be 1:240 oversubscribed.

The paper focuses on the question of providing uniform high capacity for any server to server communication such that traffic flow should be limited only by the available capacity on the network-interface cards of the sending and receiving servers. However, that is only half of the story. In addition to removing the network bottleneck from computation, the data centers must achieve high utilization in order to be profitable. The key to high utilization is the property of agility--the capacity to assign any server to any service. Without agility, each service must pre-allocate enough servers to meet difficult to predict demand spikes, or risk failure at the brink of success. With agility, the data center operator can meet the fluctuating demands of individual services from a large shared server pool, resulting in higher server utilization and lower costs. In order to achieve agility, assigning servers to a service should be independent of network topology.

Scale-out Clos topology and Valiant load balancing
To answer the uniform high capacity problem, the paper proposes a Clos topology as in Figure 5. This topology provides a tree that is "fatter" than the fat tree used in the Portland paper.
However, providing a very fat tree alone is not sufficient to solve the network bottleneck problems; we also need to ensure that traffic is allocated across these multiple available paths appropriately so that one path does not choke down with lots of traffic and reduce the effective rate of communication for the servers communicating over that path. The question is how to achieve such an allocation? Are there any patterns in the data center traffic that will enable us to partition the multiple paths in an optimal way among the servers?

To answer these questions the paper provides extensive experiments and analysis of data center traffic. The analysis found that 90% of flows are small-size (mice), but still 95% of bytes are in big-size flows (elephants). The analysis also show a lot of volatility (unpredictability) in traffic. The variability in datacenter traffic is not amenable to concise summarization and hence engineering routes for just a few traffic matrices is unlikely to work well for the traffic encountered in practice. Armed with these analysis results, the paper proposes to use valiant load balancing (vlb) to randomize end-to-end communication paths to cope with volatility and achieve load balancing. In this scheme, the ToR switch randomly chooses an intermediate switch (among many available options) on a per flow basis.

The paper also provides extensive experiments on network equipment failure characteristics---this was sorely missing in Google's globally available storage paper. The analysis of network failures found that most failures are small in size (e.g., 50% of network device failures involve < 4 devices and 95% of network device failures involve < 20 devices) and large correlated failures are rare (e.g., the largest correlated failure involved 217 switches). Still, downtimes can be significant, and with no obvious way to eliminate all failures from the top of the hierarchy, this paper's approach is to broaden (fatten) the topmost levels of the network so that the impact of failures is muted and performance degrades gracefully.

Virtual layer 2 networking
To answer the agility problem, the paper proposes VL2, which stands for Virtual Layer 2 as far as I can make out --the acronym is not defined in the paper. The key idea of VL2 is separating names from locators. VL2 assigns servers IP addresses that act as names alone, with no topological significance. When a server sends a packet, the shim-layer (a layer 2.5 if you will) on the server invokes a directory system to learn the actual location of the destination and then tunnels the original packet there. The shim-layer also helps eliminate the scalability problems created by ARP in layer-2 networks.
The separation of location-specific IP addresses (LAs) and application-specific IP addresses (AAs) was chosen for two reasons. First, this makes it possible to use low-cost switches, which often have small routing tables that can hold only LA route intervals, without concern for the huge number of AAs. Second, this reduces overhead in the network control plane by preventing it from seeing the churn in host state (change in AAs), tasking it instead to the more scalable directory system (which we discuss below). As such, the network infrastructure operates using LAs; all switches and interfaces are assigned LAs, and switches run an IP-based (layer-3) link-state routing protocol that disseminates only these LAs (end-host information AAs are not disseminated). This allows switches to obtain the complete switch-level topology, as well as forward packets encapsulated with LAs along shortest paths using standard proven network protocols such as OSPF. On the other hand, applications use AAs, which remain unaltered no matter how servers' locations change due to virtual-machine migration or re-provisioning.

To route traffic between servers, which use AA addresses, on an underlying network that knows routes for LA addresses, the VL2 agent at each server traps packets from the host and encapsulates the packet with the LA address of the ToR of the destination as shown in Figure 6. Once the packet arrives at the LA (the destination ToR), the switch decapsulates the packet and delivers it to the destination AA carried in the inner header.

The crux of offering layer-2 semantics via VL2 is having servers believe they share a single large IP subnet (i.e., the entire AA space), while eliminating the ARP and DHCP scaling bottlenecks that plague large Ethernets. From the cloud service programmer's point of view VL2 *efficiently* provides the abstraction that all the servers assigned to the programmer are plugged in to the same LAN--where any IP address can be connected to any port of an Ethernet switch due to flat addressing.

Directory lookup
The VL2 directory system stores the mapping of AAs to LAs. VL2 uses end-system based address resolution to scale to large server pools, without introducing complexity to the network control plane. When an application sends a packet to an AA for the first time, the networking stack on the host generates a broadcast ARP request for the destination AA. The VL2 agent running on the host intercepts this ARP request and converts it to a unicast query to the VL2 directory system. The directory system answers the query with the LA of the ToR to which packets should be tunneled. The VL2 agent caches this mapping from AA to LA addresses, similar to a host's ARP cache, such that subsequent communication need not entail a directory lookup.
The directory system design consists of a modest number (50-100 servers for 100K servers) of read-optimized, replicated directory servers that cache AA-to-LA mappings to handle queries from VL2 agents, and a small number (5-10 servers) of write-optimized, asynchronous replicated state machine (RSM) servers that offer a strongly consistent, reliable store of AA-to-LA mappings. In other words, the directory lookups are handled using read optimized servers, which are just caches of the write optimized Paxos-running RSMs that hold persistent state.

To achieve high availability and low latency, an agent sends a lookup to k randomly-chosen directory servers, and uses the first answer it receives back. The network provisioning system sends directory updates to a randomly-chosen directory server, which then forwards the update to a RSM server. VL2 does not require the use of a fabric manager proposed in Portland, instead the directory system serves the IP-to-LA mapping. While the fabric manager is a single machine and centralized solution, the directory service provides a decentralized solution.

The evaluation results shows that VL2 provides an effective substrate for a scalable data center network; VL2 achieves (1) 94% optimal network capacity, (2) a TCP fairness index of 0.995, (3) graceful degradation under failures with fast reconvergence, and (4) 50K lookups/sec under 10ms for fast address resolution.

The Clos network topology pays off, as the goodput is more than 10x of what the network in current data centers can achieve with the same investment. Another striking result is that comparing the cost of a VL2 network for 35K servers with a traditional data center network shows that a VL2 network with no over-subscription can be built for the same cost as the traditional network that has 1:240 over-subscription.

This is a very well written, definitive paper on data center networking. The paper illustrates the challenges in data center networking via extensive analysis, and offers a simple design that can be realized today with available networking technologies, re-utilizing time-tested/proven network protocols, and avoiding changes to switch control and data plane capabilities. This is a must-read for anyone interested in the data center networking topic.

Monday, November 15, 2010

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Last week we covered the Portland paper by UC San Diego, which appeared at Sigcomm'09. This paper is well written and I really appreciate its clarity and simplicity.

The motivation for the work is the need to scale at the data center networks (DCNs). Data centers today may have around 100K machines, with 32 virtual machines at each, which yield a total of 3 million hosts. Assuming one switch is required for every 25 physical hosts and accounting for interior nodes, the topology would consist of 8,000 switches. Current network protocols impose significant management overhead at this scale let alone supporting seamless VM migration across machines. For example, broadcasting ARPs and updates to every swithch is out of the question for such a network.

The desiderata for the DCN is given as:
R1. Any VM may migrate to any physical machine. Migrating VMs should not have to change their IP addresses as doing so will break pre-existing TCP connections and application-level state.
R2. An administrator should not need to configure any switch before deployment.
R3. Any end host should be able to efficiently communicate with any other end host in the data center along any of the available physical communication paths.
R4. There should be no forwarding loops.
R5. Failures will be common at scale, so failure detection and recovery should be rapid and efficient.

To achieve these desiderata the paper proposes Portland, a scalable fault-tolerant layer-2 routing and forwarding protocol for DCNs. The principal insight behind Portland is the observation that DCNs have a known baseline topology imposed by racks and rows of racks, and need not support an arbitrary topology. Using this observation, Portland enables switches to discover their positions in topology, and assigns internal hierarchical pseudo mac (pmac) addresses to all end hosts, which enable efficient, provably loop-free forwarding, and easy VM migration.

Portland provides layer 2 network forwarding based on flat MAC addresses. A layer 2 fabric imposes less administrative overhead compared to layer 3 forwarding. A layer 3 protocol would require configuring each switch with its subnet information and synchronizing DHCP servers to distribute IP addresses based on the host's subnet. A layer 3 protocol would also render transparent VM migration difficult because VMs must switch their IP addresses if they migrate to a host on a different subnet. While VLANs allow a middle ground between layer 2 and layer 3, VLANs lacks flexibility (as bandwidth resources to be explicitly assigned to each VLAN at each participating switch) and scalability (as each switch must maintain state for all hosts in each VLAN that they participate in).

Fabric manager:
PortLand employs a logically centralized fabric manager that maintains soft-state about network configuration information. The fabric manager is a user process running on a dedicated machine responsible for assisting with ARP resolution, fault tolerance, and multicast. This fabric manager idea is very similar to the open flow networking idea we discussed in an earlier paper. Using soft-state for network configuration and maintaining this at the fabric manager relieves the manual administration requirements.

Pseudo MAC address:
The basis for efficient forwarding and routing as well as VM migration in Portland's design is the hierarchical Pseudo MAC (PMAC) addresses used to identify each end host. The PMAC encodes the location of an end host in the topology. For example, all end points in the same pod will have the same prefix in their assigned PMAC. PMAC address has the form: pod.position.port.vmid. The end hosts remain unmodified, believing that they maintain their actual MAC addresses.

Location discovery protocol:
Portland employs a location discovery protocol so that switches learn whether they are edge, aggregate, or core switches and configure themselves. Protocol works by first the edge switches discovering that they are on the edge (they don't get location discovery packets from majority of ports [which are terminals]). After that aggregation switches discover they are aggregation switches, because they talk to at least one edge switch. Finally, core switches discover they are core switches as they do not talk to any edge switches. This location discovery protocol assumes that the protocol (and protocol timers) is started at the same time in the data center. But it is easy to relax that requirement by writing a self-stabilizing location discovery protocol.

VM migration:
VM keeps the same IP after migration, but the PMAC address needs to be changed. To this end, the VM from the new location sends a message to the fabric manager, and the IP mapping to PMAC is changed at fabric manager. The fabric manager also propogates this change to the switch that VM has left. (The new switch already knows this updated mapping as the packet was sent by the VM via the new switch.)

Portland achieves loop free routing using a simple rule: once the packet starts to go downwards the hierarchy (core to aggregate to edge) it cannot go back upwards. Portland achieves fault-tolerant routing by utilizing the fabric manager as a centralized control for the network; this is very much like what open-flow does.

Friday, November 12, 2010

Availability in Globally Distributed Storage Systems

This paper by Google Research provides a report on the availability behavior of large cloud storage systems by studying up to 7K nodes at Google over a year. The paper does not propose any new protocols, but provides sufficiently accurate models of system behavior/performance based on the study. These models are important since they enable us to correctly design and optimize these multi layered systems for data availability.

The work is divided into two parts. The first is the analysis of the component availability (disks, machines, racks), and the second is the analysis of the data availability, as inferred from the component availability results and design decisions of the distributed storage system.

A storage node is defined as unavailable when it fails to respond positively to periodic health checking pings sent by the monitoring system. The paper does not investigate about network errors specifically; those are also swept under unavailability with software & hardware faults at the node level. Even in Section 3, when the causes of node availability are analyzed in detail, network errors are not mentioned.

Node unavailability does not directly imply data unavailability thanks to identical replication or alternatively erasure encoding. In both approaches, data is divided into a set of stripes, each of which comprises of fixed size chunks. Data in a stripe can be reconstructed from some subsets of the chunks. For replication, R = n refers to n identical chunks in a stripe, so the data may be recovered from any one chunk. For Reed-Solomon erasure encoding, RS(n,m) denotes n distinct data blocks and m error correcting blocks in each stripe. In this case a stripe may be reconstructed from any n chunks.

Node failures and Correlated failures
The majority of node unavailability is due to planned reboots (such as kernel version upgrades), which is followed closely by node restarts (OS restart on the storage node). Unplanned machine reboots (e.g., kernel crashes) don't occur frequently but have the longest average duration to recover since extra checks and corrections are used to put the machines to a safe state.
The paper then focuses on measuring the frequency and severity of correlated failures. (Root causes of these failures --power outages, rolling reboots/upgrades-- are not investigated in the paper.) A failure burst is defined with respect to a window size as a maximal sequence of node failures, each one occurring within a time window 2 minutes of the next. 2 minutes is derived from a knee analysis of the graph plotting the "effect of window size on the fraction of individual failures that get clustered into bursts".

Next, the paper focuses on detecting rack-related node failures. To this end, rack-affinity score is defined as the "probability that a burst of the same size affecting randomly chosen nodes in that cell will have a smaller burst score". For a random burst, the expected rack affinity score is 0.5, for a rack-correlated burst close to 1, and a rack-anti-correlated burst close to 0 (they have not observed any rack-anti-correlated bursts). The authors found that larger failure bursts have higher rack affinity: All the failure bursts of more than 20 nodes have rack affinity greater than 0.7, and those of more than 40 nodes have affinity at least 0.9. Figure 8 shows frequency of failure bursts sorted by racks and nodes affected, and we can clearly see a large fraction of failures are bursty and rack-correlated. Later at Figure 10, it is shown that large failure bursts are the biggest contributor to unavailability as well.
Coping with failures
When a node failure causes the unavailability of a chunk within a stripe, the system initiates a recovery operation for that chunk from the other available chunks remaining in the stripe. The rate at which missing chunks may be recovered is limited by the bandwidth of individual disks, nodes, and racks, and there is a tradeoff between doing recovery and serving new client requests. Of course, given the frequency of bursty rack-correlated failures, it is no surprise that a rack-aware stripe placement policy (that ensures that no two chunks in a stripe are placed on nodes in the same rack) increases the stripe MTTF (i.e., data availability) by a factor of 3 typically.

Markov model of data availability and its findings
Using the data collected and analyzed above for node failures and rack-correlated bursty failures, the paper formulates a Markov model for data availability and develops analytical models to reason about past and future availability in the storage clusters, including the effects of different choices of replication, data placement and system parameters. A very simple example for identical replication and 2 chunks per stripe (i.e., R=2) is given in Figure 12.
The paper shows that the model is able to capture well the effect of failure bursts on the MTTF. Next the paper investigates how changes in the parameters of the system will affect data availability. Using the model the paper finds that reducing recovery times is effective when correlated failures are few. For RS(6,3) with no correlated failures, a 10% reduction in recovery time results in a 19% reduction in unavailability. However, when correlated failures are taken into account, even a 90% reduction in recovery time results in only a 6% reduction in unavailability. Correlated failures (compared to independent/random failures) are also shown to reduce the MTTF by at least two orders of magnitude. Correlation also reduces the benefit of increasing data redundancy. The gain in availability achieved by increasing the replication number, for example, grows much more slowly when we have correlated failures.

Two other important conclusions from the model, that should guide design decisions in distributed storage systems, are as follows. The model shows that component availability improvements below the node (server) layer of the storage stack do not significantly improve data availability. Assuming R=3 is used, a 10% reduction in the disk failure rate increases stripe availability by less than 1.5%. On the other hand, cutting node failure rates by 10% can increase data availability by 18%. The model also shows that replicating data across multiple cells (data centers) greatly improves availability because it protects against correlated failures. For example, R=2x2 (i.e., replicating twice in two cells) with 1 day recovery time between cells has two orders of magnitude longer MTTF than R=4.

I expect this paper to be very influential for distributed storage systems research. Using logs obtained from a large-scale production environment, the paper provided evidence that correlation among node failures dwarfs all other contributions to unavailability. (That said, I am disappointed that the network-related failures are not investigated separately.) The provided analytical model of data availability is very promising; and should be used by the distributed storage systems researchers to guide and evaluate any protocols they propose.

Sunday, November 7, 2010

My iPhone 4 has a transparent screen

I just took a picture of my palm, and set it as the wallpaper to give the transparency effect. The trick worked well on some unsuspecting friends.

Optimistic Replication

This 2005 paper by Saito and Shapiro provides a comprehensive survey of the optimistic replication area and is a must-read for distributed services designers. Below, I am providing some snippets from the paper to give a brief overview.

Data replication improves both availability and performance for distributed services. Availability is improved by allowing access to the data even when some of the replicas are unavailable. Performance improvements concern reduced latency (by enabling local access from replica instead of remote access) and increased throughput (by letting multiple computers serve the data).

Pessimistic techniques block access to a replica unless it is provably up to date. Such pessimistic techniques perform well in local-area networks, in which latencies are small and failures uncommon, but is unsuitable for wide-area networks, because the Internet remains slow and unreliable. Moreover, pessimistic algorithms scale poorly, because its throughput and availability suffer as the number of sites increases.

Compared to traditional "pessimistic" techniques, optimistic replication promises higher availability and performance, but lets replicas temporarily diverge and lets users see inconsistent data. In optimistic techniques,
updates are propagated in the background, and occasional conflicts are fixed after they happen. Optimistic algorithms should be able to scale to a large number of replicas, because they require little synchronization among sites.

These benefits, however, come at a cost. CAP theorem states that any distributed system faces a trade-off between availability and consistency. Where a pessimistic algorithm waits, an optimistic one speculates. Thus, optimistic replication faces the challenges of diverging replicas and conflicts between concurrent operations. So it is applicable only for applications that can tolerate occasional conflicts and inconsistent data. Fortunately, in many real-world systems, especially file systems, conflicts are known to be rather rare. Some example applications of optimistic replication are DNS, USENET, PDAs, CVS, Amazon Dynamo.

When describing algorithms, it is useful to distinguish sites that can update an object --called master sites -- from those that store read-only replicas.

To update an object, a user submits an operation at some site. An operation submitted by the user of a replica is tentatively applied to the local replica to let the user continue working based on that update. It is also logged, i.e., remembered in order to be propagated to other sites later. These systems often deploy epidemic propagation to let all sites receive operations, even when they cannot communicate with each other directly. Because of background propagation, operations are not always received in the same order at all sites. Each site must reconstruct an appropriate ordering that produces an equivalent result across sites. In addition, with no a priori site coordination, multiple users may update the same object at the same time, so there is a need for detecting operations that are in conflict and resolve them. Commitment refers to an algorithm to converge the state of replicas by letting sites agree on the set of operations and their final ordering and conflict-resolution results.

Optimistic Replication: Design Choices
We classify optimistic replication systems along the following axes: (The rest of the paper investigates these design choices in detail)
  • Number of writers: single-master vs. multi-master: Single-master systems designate one replica as the master. All updates originate at the master and then are propagated to other replicas, or slaves. They are simple but have limited availability, especially when the system is write-intensive. Multi-master systems let updates be submitted at multiple replicas independently and exchange them in the background. They are more available but significantly more complex. In particular, operation scheduling and conflict management are issues unique to these systems. According to Gray et al. [1996], a naive multi-master system would encounter concurrent updates at the rate of O(M^2), assuming that each master submits operations at a constant rate. (M is the number of masters.)
  • Definition of operations: state transfer vs. operation transfer. State-transfer systems limit an operation either to read or to overwrite an entire object. State transfer is simple, because maintaining consistency only involves sending the newest replica contents to other replicas. Operation-transfer systems must maintain either a history of operations and have replicas agree on the set of applied operations and their order. On the other hand, they can be more efficient, especially when objects are large and operations are high level. Operation transfer also allow for more flexible conflict resolution.
  • Scheduling: syntactic vs. semantic. The goal of scheduling is to order operations in a way expected by users, and to make replicas produce equivalent states. Scheduling policies can be classified into syntactic and semantic policies. Syntactic methods are simple but may cause unnecessary conflicts. The most popular example is ordering operations by timestamp. Semantic scheduling, on the other hand, exploits semantic properties such as commutativity of idempotency of operations. Semantic scheduling increases scheduling flexibility and reduces conflict, but at the cost of application dependence and complexity.
  • Handling conflicts. The best approach is to prevent conflicts from happening altogether. Pessimistic algorithms prevent conflicts by blocking or aborting operation as necessary. Single-master systems avoid conflicts by accepting updates only at one site (but allow reads to happen anywhere). These approaches, however, come at the cost of lower availability. Conflicts can also be reduced, for example, by quickening propagation or by dividing objects into smaller independent units.
  • Propagation strategies and topologies. Local operations must be transmitted and re-executed at remote sites. In general, the quicker the propagation, the less the degree of replica inconsistency and the rate of conflict, but more the complexity and overhead, especially when the application receives many updates relative to read requests.
  • Consistency guarantees. In any optimistic replication system, the states of replicas may diverge somewhat. A consistency guarantee specifies whether a client application may observe divergence, and how much.


Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflict-free, and scales well in practice. State transfer using Thomas's write rule (last writer wins) works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution.

Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur.

Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set.

Wednesday, November 3, 2010

A presentation tip (brought to you by a weary audience)

A common and serious flaw I see in student presentations is that the presenter rushes through the first 5-10 slides. This is the worst thing to do in a presentation, these introduction slides to the problem and solution are the most important ones. If the presenter loses the audience in these beginning slides, this wastes the entire hour for the audience. In comparison, losing the audience in the middle of the presentation is less critical, less time is wasted, and the audience can even use this time to think about alternative solutions to the problem (which was presented clearly in the introduction).

You would think this is common sense, but most of the students still commit this mistake. I guess the reason is since the presenter had read and understood the relatively easy introduction part of the paper, he thinks the audience can also understand and follow it with ease. And armed with the powerpoint slides, the presenter can finish the first 10 slides in less than 10 minutes in bullet time. I intercept the presenters with questions if I see this happening, but I am getting really tired of playing the stupid/clueless professor. The worse thing is most of the audience do not follow the presentation, but nobody intercepts. I think in any case it is better to intercept to avoid wasting time for everyone.

The role of distributed state

This is a 1990 technical report by John Ousterhout, who has been a very influential and pioneer figure in systems research. I had written a summary of his log-structured file system (LFS) paper earlier in this blog. It seems like his ideas on LFS and RAMCloud are becoming important in cloud computing these days. Ousterhout is also the creator of TCL/TK, so he has a lot to say in programming languages for web applications domain as well. The miscellaneous articles on his non-academic homepage include a lot of wisdom from his experience in the software development industry, and are also worth reading carefully.

This paper titled "Role of distributed state" describes the advantages and disadvantages of distributed state and suggests that the act of building a distributed system consists of making tradeoffs among various alternatives for managing the distributed state. Illustrating this point, the two case studies included in the paper, NFS and Sprite file systems, end up with a corresponding set of good and bad properties, and the strengths and weaknesses of the two systems are almost opposites.

The advantages of distributed state are performance (through local caching of the state), coherency (through using the state for collaboration and coordination), and reliability (through replication of the state). Unfortunately, these advantages are only potential advantages; in practice they are difficult to achieve due to the four problems introduced by distributed state: consistency, crash sensitivity, time and space overheads, and complexity.

The first problem is consistency: if the same piece of information is stored at several places and one of the copies changes, what happens to the other copies? There are three approaches to cope with this, detect stale data on use (lazy), prevent stale data (pessimistic --this is costly), and tolerate inconsistency (optimistic --this can introduce complexity). The second problem is crash sensitivity: unless you have full state replication, failure of one machine/state renders the rest of the distributed state useless as well (since that state has not been replicated). The third problem is time overhead, which is incurred mainly for maintaining consistency in preventing stale data. (Let's ignore the space overheads, we have space). The last problem is complexity, as getting a distributed protocol right is hard.

Case Study 1: NFS file system
The NFS design was optimized for simplicity and robustness, with performance a secondary goal. Simplicity and robustness were achieved by using a stateless protocol with idempotent operations. In NFS, servers do not store any information about their clients; they do not keep track of which files are currently in use or which clients are using which files. Distributed state is kept almost exclusively on the clients.

The advantage of this design choice is that NFS handles faults easily. As NFS is "stateless" (at the servers, that is) to begin with, no state is corrupted, so no special code is needed for crash recovery.

The disadvantage of this design is that the performance suffers greatly. Whenever a client issues a write request, the server must guarantee that all modified data is safely on disk before the write returns. And this kills performance causing low write-throughput. Even worse for the performance, NFS uses a "write-through-on-close" policy to reduce windows of inconsistency. Whenever a file is closed on a client machine, the client immediately transmits modified data for the file back to the server. Unfortunately, the statelessness of NFS requires that new data be returned immediately to the server to reduce consistency problems, and the only way to return data to the server is with the write operation, which forces data to disk. Still, consistency problems can also arise due to client caches of file at different places. Thus, the clients use polls to see if they should invalidate cache (which in turn reduces the performance). Even then, a window of inconsistency exists for NFS due to client timings of events.

Case Study 2: Sprite file system
Sprite's design is optimized for performance. Sprite chooses to be "stateful":
  1. Servers keep information in their main memories about which workstations are reading or writing which files. This requires clients to notify servers whenever files are opened or closed, but allows the servers to enforce consistency as described below.
  2. Servers retain modified file blocks in their main memories, and do not write that information back to disk until it has aged for thirty seconds.
  3. Clients also retain modified file blocks in their main memories; they do not pass new information back to servers until it has aged for thirty seconds or until the information is needed by some other client. If a client has dirty blocks for a file, the server's state information reflects this.

The advantage of this design choice is performance: Since the servers keep track of which clients are using which files, clients need not return modified file data to servers immediately. Another major advantage of statefulness is consistency: If a file is ever open simultaneously on several clients and at least one of them is writing the file, then the server notifies each of the clients and insists that they not cache the file; all read and write operations must be passed through to the server, where they are applied to a single copy of the file in the server's cache. Thus Sprite provides perfect file consistency: each read operation is guaranteed to return the most recently written data for that file, regardless of where and when the file is read and written.

The disadvantage of the stateful design choice is that the system is more complex; implementing Sprite was difficult due to subtle race conditions. This complexity also spawns more difficult crash recovery problems; server crash leaves clients in an inconsistent state with server, unable to move forward.

To handle the recovery problems, the designers of the Sprite system discovered that distributed state was not just the cause of the recovery problem, but also the solution. By adding slightly to the state kept by clients about their files, they made it possible for servers to recreate all their usage information after rebooting. A new operation was added to the Sprite protocol: reopen. When a server reboots, each client reopens all of its files by passing the server a copy of its state information (including information such as which files are open for reading or writing, which are locked, etc.). This allows the server to reconstruct its state so that it can continue to guarantee consistent access to files, and so that locks are not broken when servers reboot. A side-effect of this reopen fix took its toll on the performance though, the reopen approach leads to a recovery storm and which paralyze the server.

Ousterhout takes a strong stance in the conclusion. I think he provides critical and very good advice, so I quote it verbatim here.
Based on my experience with the NFS and Sprite file systems, I do not believe that the stateless model can meet the needs of high-performance workstations of the future. A stateless approach will limit the performance of the system to the performance of disks; unfortunately, disk performance is not improving at anywhere near the rate of processor performance. Non-volatile memory offers some hope for performance improvement, but I think the best solution is a change to more stateful protocols.

On the other hand, distributed state almost always introduces complexity and fragility, so system designers should attempt to reduce distributed state as much as possible. The less state, the better. In Sprite, I suspect that we may have been a little too eager to embrace state, and that a careful redesign of the system could reduce the amount of state we have to maintain.

Finally, the best approach to dealing with failures is to merge recovery with normal operation so that there is nothing special to do during recovery. NFS achieves this quite nicely through its combination of statelessness and idempotency. Recovery happens so infrequently that is is very difficult to debug special-case recovery code: it is hard to invoke the code under test conditions, and the code is hardly ever executed under real-life conditions. This means that there is a good chance that the code will not work when it is needed. On the other hand, if the recovery code and regular-case code are the same, the recovery code will be exercised constantly during everyday operation of the system so it is likely to work correctly when needed.

By the way, this approach of "dealing with failures by merging recovery with normal operation" is what the self-stabilization prescribes. So this last paragraph is a major motivation for self-stabilizing systems. I intend to write about self-stabilization in a future post.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...