Thursday, December 23, 2010

Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM

I wrote about Ousterhout's "The Role of Distributed State" work before. This review is for his recent work on RAMClouds, which appeared in SIGOPS Operating Systems Review.

This paper makes a case for keeping all the data in the RAM over distributed nodes in a datacenter. "A RAMCloud is not a cache like memcached and data is not stored on an I/O device, as with flash memory: DRAM is the permanent home for data." Obviously, storing everything in RAM would yield a very high-throughput (the paper mentions 100-1000x) and very low-latency (again the paper mentions 100-1000x) system compared to disk-based systems. However, the primary reason the authors are excited about RAMCloud is the following: "RAMCloud will simplify the development of large-scale Web applications by eliminating many of the scalability issues that sap developer productivity today." The motivation for RAMCloud is to provide a general-purpose storage system that scales far beyond existing systems, so for achieving scalability application developers do not have to resort to ad hoc techniques, such as Dynamo, PNUTS, Bigtable, that give up some of the benefits of traditional RDBMSes.

This quest for the holy grail is itself a point of contention. One of the critiques of the RAMCloud proposal is Jeff Darcy in these two posts. Jeff's main point is "You cannot shoehorn everything in one system and in RDBMS. Applications need many different kinds of storage."

I will now go into more technical details about feasibility of RAMClouds.

Latency and bandwidth trends
The paper gives these striking information on disk trends: "Disk capacity has increased more than 10000-fold over the last 25 years and seems likely to continue increasing in the future. Unfortunately, though, the access rate to information on disk has improved much more slowly: the transfer rate for large blocks has improved only 50-fold, and seek time and rotational latency have only improved by a factor of two."

Let's look at the absolute numbers for latencies. Modern hard drives have latencies under 10 milliseconds; in contrast RAM latency is in the 10 nanosecond range. So, RAM is 100,000 times faster. However, it is possible to use many disks in parallel to overlap these latencies and reduce these latencies. Caching also reduces the average latencies from disk significantly. Another way to defend the disk-based systems is to consider the question of whether very-low-latency really matters for cloud applications. The argument in the paper is: if you build a low-latency system, the applications will come. "With sufficiently low-latency none of the specialized approaches for scalability are needed. RAMClouds offer the hope of a new one-size-fits-all where performance is independent of data placement and a rich variety of queries becomes efficient." I was hoping to hear a more convincing argument for low-latency applications here instead.

Let's look at the numbers for bandwidth. Disks actually have pretty good throughputs. The bandwidth of disk is about 100 MB/s. The bandwidth of RAM is 5 GB/s, about 50 times faster, but in practice there are many challenges that prohibit seeing a full benefit from this RAM bandwidth. Network switches above the rack layer can easily become a bandwidth bottleneck in datacenters. (The paper acknowledges these challenges see a couple paragraphs below.) Also, special purpose filesystems should be developed for the RAM-disk bandwidth to reach this raw 5 GB/s number. I recently had an interesting conversation with my colleague, Tevfik, on this issue. Tevfik said that for a master's project he and his student studied how much improvement RAM-disks can provide compared to reading from disk. They moved data from RAM-disk to another half of the RAM (that is not part of the RAM-disk), and compared that with moving data from disk to RAM. Their finding was surprising, there wasn't much noticable improvement from using RAM-disks. There are probably two factors that contributed most to that result. The first one is, the operating systems are already very smart and perform a lot of buffering optimizations while reading from disk, so reading from disk is not that bad (except for pathological cases of constantly small random access reading workloads). Remember also that the disk has pretty good bandwidth. The second reason is that RAM-disk introduces latencies: since the program now uses a filesystem to access to your RAM, the access is made via a system call and not from the kernel directly. Although, there are some systems like tmpFS that reduces this overhead, those filesystems also restrict what you can do with your RAM-disk, how you can mount your RAM-disk to get a hierarchical storage management (HSM).

Cost and size trends
The latency and bandwidth trends mentioned in the paper favored RAMs more than disks. However, there are other striking trends that the paper ignores to mention. For example, the trends for the storage cost. In the last 30 years, the initial $193,000 per gigabyte cost of storing at the disk dropped to only 7 cents; this represents a cost decrease of nearly three-million percent. While the disk has $0.07 cost/gb, the RAM has $60 cost/gb according to the paper (even while excluding electric usage).

In order to argue against disk-based approaches that use generous caching to improve the performance, the paper mentions that caching is not effective for long-tail access patterns such as in Facebook. The solution the paper offers for this long-tail access problem is to keep all the data in the RAMCloud. However, this proposal ignores the size trends in data completely. Those large social network services have several 100 TBs of just text and log data, and every year the size of the data is expected to increase several folds.

Again, to make a case for the RAMClouds the paper cites a figure from another paper, and mentions that the dividing lines in Figure 2 are all shifting upwards with time, which will increase the coverage of RAMClouds in the future. The upwards shift is because "the boundary moves upwards as the cost/bit of DRAM improves; it moves to the right as the cost/query/sec. of flash improves. For all three storage technologies cost/bit is improving much more rapidly than cost/query/sec, so all of the boundaries are moving upwards." But this analysis ignores the trend for more data-storing/consuming needs of applications. With time, the storage needs of the applications will also grow rapidly, which may nullify the benefits from the aforementioned upwards shift in the RAMCloud boundary.

Challenges that need to be overcome for implementing RAMCloud:
The paper mentions that there are several research challenges for implementing RAMCloud efficiently and effectively. I will summarize two of them below.

Low-latency RPC is needed
Network switching at several layers adds delays. Network delays should be reduced by tuning TCP. Since there isn't much locality to data center traffic, the bisection bandwidth may need to be increased in the upper levels of datacenter networks as well. A second problem is the OS level delays. A general purpose OS introduces high overheads for interrupt processing, network protocol stacks, and context switching. "In RAMCloud servers it may make sense to use a special-purpose software architecture where one core is dedicated to polling the network interface(s) in order to eliminate interrupts and context switches."

Durability & availability is needed
Since RAM does not offer permanent storage, data will be lost if the node crashes. To provide durability and availability, the paper suggests to use two other RAM replicas for the same data. (This would triple the cost of RAMClouds.) "After a crash, one backup server for each shard reads its (smaller) log in parallel; each backup server acts as a temporary primary for its shard until a full copy of the lost server's DRAM can be reconstructed elsewhere in the RAMCloud. With this approach it should be possible to resume operation (using the backup servers) within a few seconds of a crash."

RAMCloud's applicability today
The following is the strongest claim in the paper for the applicability of RAMClouds: "It is probably not yet practical to use RAMClouds for large-scale storage of media such as videos, photos, and songs (and these objects make better use of disks because of their large size). However, RAMClouds are practical for almost all other online data today, and future improvements in DRAM technology will probably make RAMClouds attractive for media within a few years."

Based on my review above, however, I find this claim a bit over-optimistic. I think cost trends and size trends have not been taken into account appropriately for the analysis in the paper. Also, there are several research challenges to be addressed before we can reap the benefits of the latency and bandwidth trends. So I contend that RAMCloud is not cost-effective now, and it may not be cost-effective for sometime soon. I agree with this quotable statement from Jeff Darcy's post. "Nine out of ten people who think they have a truly RAM-cloud-appropriate access pattern should be spending their money not on extra RAM but on smarter programmers."

Post-confession: Actually, I didn't mean to take a negative stand against RAMCloud when I started writing this review. I guess the reason this review went this way may be due to my computer scientist debugging instincts that led me to try and poke holes against "the case for RAMCloud". So, let me end by pointing to a saner perspective. Todd Hoff asks this question of "Are Cloud Based Memory Architectures The Next Big Thing?" in his blog and answers it a lot better than I can. More links on the RAMCloud paper are listed in this post. And here is James Hamilton's summary of the RAMCloud talk.

Thursday, December 16, 2010

Finding a Needle in Haystack: Facebook's Photo Storage

This paper appeared in OSDI'10. The title "Finding a needle in Haystack" is a bit over-dramatization :-) Finding a needle in Haystack becomes straightforward if you can memorize the location of each needle in the haystack. And that is exactly what Facebook Haystack system does.

Haystack is an object store for sharing photos on Facebook where data is written once, read often, never modified, and rarely deleted. Haystack storage system was designed because traditional filesystems perform poorly under the Facebook workload. While using network attached storage (NAS) appliance mounted over NFS, several disk operations were necessary to read a single photo: one (or typically more) to translate the filename to an inode number, another to read the inode from disk, and a final one to read the file itself. While insignificant on a small scale, multiplied over billions of photos and petabytes of data, accessing metadata becomes the throughput bottleneck.

Haystack aims to achieve the following 4 goals. High throughput and low latency is achieved by ideas stemming from Log-structured Filesystem work (keeping all metadata in main memory, and minimizing the disk accesses per read and write). Fault-tolerance is achieved by replicating each photo in geographically distinct locations. Cost-effectiveness is achieved by custom designing the filesystem for the photo storing application and trimming all the fat, which results in each usable terabyte costing ~28% less in the new system. Finally, simplicity is achieved by restricting the design to well-known straightforward techniques.

The simplicity of the presentation and design of the system is refreshing. In fact the design is so simple that one may argue there is not much novel ideas here. Yet, the paper is still very interesting because it is from Facebook, a giant. Facebook currently stores over 260 billion images, which translates to over 20 petabytes of data. Users upload one billion new photos (~60 terabytes) each week and Facebook serves over one million images per second at peak.

Previous NFS-based design
Figure 2 gives an overview of the photo access process of the previous NFS-based design used at Facebook. The user's browser first sends an HTTP request to the web server. For each photo the web server constructs a URL directing the browser to a location from which to download the data. If the content delivery network (CDN) has the image cached then the CDN responds immediately with the photo. Otherwise the CDN contacts Facebook's photo store server using the URL. The photo store server extracts from the URL the volume and full path to the file, reads the data over NFS, and returns the result to the CDN.

The problem with this previous design is that at least 3 disk operations were needed to fetch an image: one to read directory metadata into memory, one to load the inode into memory, and one to read the file contents. Actually before optimizing the directory sizes, the directory blockmaps were too large to be cached by the NAS device and 10 disk operations might be needed to fetch the image.

Unfortunately, caching at the CDN or caching filehandles returned by NAS devices did not provide much relief to Facebook due to the long tail problem. While caches can effectively serve the hottest photos --profile pictures and photos that have been recently uploaded--, a social networking site like Facebook also generates a large number of requests for less popular (often older) content. Requests from the long tail account for a significant amount of the traffic, and these requests all miss the cache and need to access the photo store servers.

Haystack design
To resolve the disk access bottleneck in the NFS-based approach, the Haystack system keeps all the metadata in main memory. In order to fit the metadata in main memory, Haystack dramatically reduces the memory used for filesystem metadata. Since storing a single photo per file results in more filesystem metadata than could be reasonably cached, Haystack takes a straightforward approach: it stores millions of photos in a single file and therefore maintains very large files. (See the second paragraph below for how this works.)

Figure 3 shows the architecture of the Haystack system. It consists of 3 core components: the Haystack Store, Haystack Directory, and Haystack Cache. The Store's capacity is organized into physical volumes. Physical volumes are further grouped into logical volumes. When Haystack stores a photo on a logical volume, the photo is written to all corresponding physical volumes for redundancy. The Directory maintains the logical to physical mapping along with other application metadata, such as the logical volume where each photo resides and the logical volumes with free space. The Cache functions as an internal CDN, another level of caching to back up the CDN.

When the browser requests a photo, the webserver uses the Directory to construct a URL for the photo, which includes the physical as well as logical volume information. Each Store machine manages multiple physical volumes. Each volume holds millions of photos. A physical volume is simply a very large file (100 GB) saved as "/hay/haystack-logical volumeID". Haystack is a log-structured append-only object store. A Store machine can access a photo quickly using only the id of the corresponding logical volume and the file offset at which the photo resides. This knowledge is the keystone of the Haystack design: retrieving the filename, offset, and size for a particular photo without needing disk operations. A Store machine keeps open file descriptors for each physical volume that it manages and also an in-memory mapping of photo ids to the filesystem metadata (i.e., file, offset and size in bytes) critical for retrieving that photo. Each photo stored in the file is called a needle.

A Store machine is a 2U node with 12 1TB SATA drives in RAID-6 (as a single volume). A single 10TB XFS filesystem (an extent based file system) runs across these on each Store machine. XFS has two main advantages for Haystack. First, the blockmaps for several contiguous large files can be small enough to be stored in main memory. Second, XFS provides efficient file preallocation, mitigating fragmentation and reining in how large block maps can grow.

My remaining questions
Although the paper is written clearly, there are still a couple of points I couldn't understand.

My first question is why was Google Filesystem (GFS) not suitable for implementing the Photo Store Server? That is what are the advantages of Haystack over GFS for this application? The paper just mentions that serving photo requests in the long tail represents a problem for which Hadoop is not well suited. But, what does this argument (which I don't understand) have to do with the unsuitability of GFS. I guess the answer is because Facebook needed to keep all metadata in the memory, and there was no immediate implementation of this provided by GFS master chunkserver. I guess if the GFS master chunkserver is modified to hold the metadata in the memory, then Facebook could as well use GFS for solving their problem. As another disadvantage of GFS, the paper cites that GFS is optimized for a workload consisting mostly of append operations (which is actually what Haystack does as well) and large sequential reads (I am not really sure why GFS cannot handle small random reads easily as well). So I don't see why GFS couldn't be adopted for use here.

My second question is about the details of replica maintenance at the Store. In the paper, replica maintenance is only outlined briefly as follows. A photo is replicated to each of the physical volumes mapped to the assigned logical volume. Haystack Directory provides this mapping. It seems like the replication logic is CP according the CAP theorem/model, and sacrifices availability when one of the nodes to host a physical volume to be used is unavailable. But, that is only basic operation, and there are several other side cases, such as failure of one volume, compaction of one volume, deletion of some photos. It is unclear how the replicas are maintained to the face of these, and how the mappings at the Haystack Directory component are updated to account for these. That should require coordination between the Store and the Directory.

Wednesday, December 15, 2010

Boxwood: Abstractions as the foundation for storage infrastructure

This paper is by Microsoft Research, and appeared in OSDI'04. This review will mostly be a stream of conciousness, because I have not yet understood all of the paper and cannot put it in context as much as I would like to.

While reading in to the Boxwood paper, I started to notice how similar this is getting to the GFS problem and GFS approach. Boxwood appeared at OSDI'04, and GFS appeared at SOSP'03. Boxwood refers to GFS but does not compare or contrast itself with GFS. Maybe the reason is in 2004 the Boxwood authors could not see the similarities. This could be because, as I mentioned in my GFS review, the GFS paper did not talk about the Paxos replication of the master chunk-manager in the 2003 paper; that came a couple years later in the Chubby and Paxos-made-live papers. When citing GFS, the Boxwood authors only state that GFS "will be layered over the facilities of Boxwood". But, that is impractical as it would be duplicating a lot of the services; GFS also has chunk manager, failure detector, lock service, replication, Paxos etc.

Let me give a brief overview of Boxwood design, so that I can compare Boxwood with GFS further. Boxwood is for LANs. It assumes Gigabit Ethernet, and uses synchronous replication of data on two discs (I guess the master and shadow chunk managers).

The Boxwood design section in the paper starts with a description of the Paxos service. Then, it continues with the failure detector service, which is not really very interesting since it is a well-studied and mature topic. Then it describes the distributed lock service. Curiously this service is implemented as a master shadow replication, and Paxos is employed only to keep the id of the master. This is curious because, when implementing the same service, the GFS-Chubby approach was to replicate the master via Paxos to four other replicas, which yields a much simpler design and a robust (masking fault-tolerance to two node failures) system.

Then comes the RLDev (replicated logical device) component. The paper writes: "The list of RLDevs, the segments belonging to them, the identity of machines that host the primary and the secondary segments, and the disks are all part of the global state maintained in Paxos." This state amounts to pretty much what the master chunk-manager should hold. Then the question is again, why not maintain this state simply by replicating the master chunk-manager via Paxos as GFS did.

Next comes the chunk manager. As Figure 3 shows, the chunk manager is replicated with a shadow node. "The chunk manager pair relies on a shared RLDev and RPCs to keep the mapping information consistent." Again, why not do this simply via Paxos replication of the chunk manager as GFS did. What is (if any) the advantage to this approach?

Next comes the transaction and logging service but it is extremely scarce in details. The authors implemented Boxwood, as well as BoxFS, a filesystem using Boxwood B-trees, exported using the NFS v2 protocol. Both are implemented as user level processes. Evaluation results for BoxFS are given, but I am not sure how to interpret those results as BoxFS makes simplifying assumptions (such as 30 second data cache flush) over NFS to achieve acceptable performance.

So, wrapping up, I guess there may be several things I don't understand in this paper. The presentation is unfortunately not clear and simple for me. Instead of telling us what its most significant contribution/lesson is, the paper tells us about everything in the system. The introduction emphasizes one thing (distributed data structure abstractions provided), yet the internal sections of the paper emphasize another (Paxos and lock service, and the chunk manager on top of them). Maybe I should use this paper as a discussion paper alongside GFS in my seminar in Spring'11; as a group we can have a better understanding and comparison of both systems.

Sunday, December 12, 2010

Globecom, WSN forum, Urban-scale sensing talk by Ed Knightly (Rice U)

Last week, I attended Globecom'10.

Ed Knightly from Rice talked about urban-scale sensing under 3 parts: vehicular sensing, health sensing, and smart grid.

Ed spent most of his talk on the vehicular sensing part. A recent US deparment of transportation vehicle safety commission project asked this question: vehicles have dozens of sensors already, what if this information was shared, what can we achieve? Some low hanging fruits are: traffic signal warning, curve speed warning, left turn assistant, stop sign movement assistant, lane change warning, collision warning, and finally, internet access applications.

The candidate technology that is proposed for making this networking feasible is a wireless technology, of course. But not the wifi technology which is probably many people's first guess; It is visible light communication (VLC) technology. A VLC transmitter is a LED, which can as well be the LED headlight and taillight in most of the recent models. The only thing needed is to modulate the signals at high frequencies at these LED sources in the visible light spectrum. (VLC is invisible to the naked eye, because it is just modulated too frequently to register at our brains.) VLC receiver is a photodiode, which can again be installed next to the lights in the front, back, and the sides.

More good things going for VLC (aka free space optical communication) are as follows. VLC is green. You get good bandwidth with little energy expenditure. VLC is directional, good for vehicular applications. Sun interference is not a problem, unless there is a direct sunset interference, ambient sun or shadow environment does not interfere with VLC. 400Mbits/sec is the current physical level record for VLC. Ed noted that Boing is trying to push this VLC technology in planes, as it is untintrusive to other equipments on the plane.

The second topic Ed talked about is the health applications of sensing.
Ed said there is an app for a lot of health issues, but these apps are not making much use of sensors. (The exception is nike+ipod, which uses a pedometer.) If we can combine those apps with sensing and close the loop, we can diagnose a lot of problems early on. These regular monitorings can enable preemptive medicine and cut down ER costs.

Ed gave the example of the bluebox. Cardiovascular diseases are responsible for 40% of deaths. Bluebox is a $10 device, which is basically a crappy ekg. It is noisy, but using it regularly catches cardiovascular problems a lot before they become dangerous. So, crappy it is, its regular use saves lives. Can we make this sensor available on your smartphone? Another example is the laser breathalyzer for diabetes which helps determine if an asthma attack is imminent. Using this sensor, asthma attacks can be detected much before they occur.

Ed's final topic was smart grid. Smart electrical-grid consists of network-controlled power terminals, as well as thousands of solar panels attached to the grid. There is recent interest in this domain due to green energy sources. Recently, like the Google-meter several sensing solutions are emerging to provide fine-grain (device-level, minute-granularity) accounting of power usage at living spaces. The goal of the smart grid is to be robust, self-healing against attacks and natural disasters.

Globecom, wireless networking forum, talk on smartphones by Roy Want

Last week, I attended Globecom'10 .

Roy Want (Intel) gave a talk on smartphones in Globecom. He started by showing the market trends for cellphones smartphones and laptops. Cellphones and smartphones grow so quickly that they dwarf the laptop market (which is growing with a healthy 20%). Roy, then, asked the following question: "Will one day will we be computing on the smartphones?" He said that in order for that to happen we need to overcome the poor UI experience of smartphones.

As part of these introductory slides, Roy showed a picture of the Intel atom processor for smartphones. It is smaller than rice grain yet is the brain of smartphone and x86 compatible, so these chips can bring After the presentation, over dinner, I asked Roy about why not put a dozen of these atom processors in one smartphone, given that they take virtually no space. Turns out this is currently not very feasible, because these processors are pretty battery-hungry, even though they are very tiny.

Roy's talk then focused on three things: context aware operation, resource sharing, and processor acceleration.

For context aware operation, the goal is to prioritize which sensors get focus.
iPhone made sensing mainstream, there are a dozen sensors in the iphone: accelerometer, light, touch screen, gps, mic, proximity, magnetometer, etc. But they cannot be always on. Roy mentioned the need for a triage processor to prioritize the sensors, processors, radio on the smartphone.

For resource sharing, the goal is to make maximum use of the computation device rich environment. Can the smartphone seamlessly share a nearby display, network, storage, or other peripherals. He gave the example of editing a movie at home. He calls this dynamic composable computing. At this point, he demoed a VNC running over the smartphone. The smartphone was running Linux, and from the laptop he sshed to the smartphone and started a remote X-window session which is hosted at the smartphone but displayed at the laptop.

For the processor acceleration, Roy asked the question of whether it is possible to share the processor by migrating an application from one device to the other. He said that generally the cleanest way of doing this is to migrate the entire OS, virtual machine. And for this he mentioned Satya's work at CMU on incremental VM synchronization. By doing VM synchronization in an incremental manner only for the parts that got changed, it is possible to improve the performance and reduce the latency of this synchronization to work in real-time. The example he gave was that your phone shares your home computer does something, and your work computer also gets synchronized with your work computer so you can continue at your work computer when you arrive there. (Turns out this is now common-place. Xen provides this. Amazon makes use of this for replicating your machine with them and checkpoint your service.)

My colleague Chunming Qiao asked Roy a question on whether he is aware of the single system VM migration made possible by Kerrighed. While traditional VM approaches are fitting multiple VMs in one physical machine, Kerrighed tries to run one VM over multiple physical machines. This is beneficial for providing more CPU, RAM, storage resources to your VM. Your VM can grow on the fly even beyond the limitations of the physical host. This will obviously be beneficial for your smartphone, as it can now use the CPU, RAM, storage resources from the other computers.

This talk was very interesting to me because we are trying to build a 1000 smartphone (cloud-enabled) testbed at UB. More on this later, once we make more progress on this.

Onix: A Distributed Control Platform for Large-scale Production Networks

The Onix work (OSDI'10) builds on Nox. Essentially, Onix takes Nox and distributes over multiple servers.

Let me start with a brief refresher on Nox. (Or you can read my previous post on Nox) The main idea in Nox (and openflow) was to facilitate innovation by separating the control plane from the forwarding (data) plane. (In the current networking architecture, control and data planes are both implemented in the same place, the routers.) Nox introduced "software defined networking" (SDN): Nox uses a centralized controller to make the decisions (i.e., control plane); The routers implement only the data plane, and just follow directions from the controller while forwarding data. A drawback with Nox was that since it uses a single controller, it is prone to a single point of failure. Although the Nox work pointed out how this single controller can be distributed, it didn't pursue it further.

Enter Onix. Onix investigates how to distribute that single controller. In Onix the controller consists of a network information base (nib), and two other components: switch import/export and distribution import/export. All active network elements are stored in nib in key value pairs. Nib is a decentralized component, distributed over several Onix nodes. Once you change your local nib instance on one Onix node, those modifications are propogated to other nibs. The switch import/export component talks to switches to configure them according the instructions from the Onix node (it sort of acts as an interpreter). The distribution import/export component makes multiple nibs consistent with each other in an asynchronous manner.

Developers only see nib, exclusive locks on nib can only be attained on local instances. All the nib operations are asynchronous. No distributed locking mechanism is provided in the API. The limitation of this API is that it relies on application-specific logic to detect and provide conflict resolution of the network state.

For scalability, Onix let's you have hierarchies. Onix node C can coordinate Onix node A and B; In this case C's nib has two elements A and B. It is possible that A and B may be sharing some switches they are responsible with (those switches appears in nibs of both A and B).

As for reliability, Onix can detect link failures, then using the user written code, Onix fixes the link failures accordingly. Onix provides two data stores: a transactional data store (for durability of the local storage) and a one-hop DHT simple consistent hashing (for holding volatile network state in a fast manner).

The paper discusses 4 applications being built with Onix: a network management application, distributed virtual switch application, multi-tenant virtualized data centers (VPN implementation), and scale-out carrier-grade IP router. The paper also includes evaluation experiments.

The Onix paper ends on a cautionary note: "What we should make clear, however, is that Onix does not, by itself, solve all the problems of network management. The designers of management applications still have to understand the scalability implications of their design. Onix provides general tools for managing state, but it does not magically make problems of scale and consistency disappear. We are still learning how to build control logic on the Onix API, but in the examples we have encountered so far management applications are far easier to build with Onix than without it."

There are a lot of smart people pushing for creating a very programmable network. Hellerstein's work on declarative network programming is a nice complement/addition to Nox, Onix. I certainly appreciate that Nox and openflow makes evaluation and development of new protocols possible in the network. That is certainly very useful for network researchers. Even at the enterprise level, some programmability over the network would be welcomed, I guess. (Right now, the only control we have over networks is via the BGP level rules, and there is no control at the LAN level.) But, the problem is, it is very hard to find/hire people with expertise to program the enterprise network in the very fine grain programming environment these groups are providing. Maybe it is better to provide a network that just works as plug and play (like in the VL2 approach), rather than trying to provide complete control over every aspect of the network. It is unclear which approach is better.

Saturday, November 27, 2010

Dynamo: Amazon's highly available key-value store

This paper, which appeared in SOSP'07, describes Dynamo, the underlying storage technology for several core services in Amazon's e-commerce platform. Dynamo is a NoSQL system and provides a single key-value store. The commonly accepted (yet still disputed) wisdom is that RDBMS are overkill for simple key-value stores and are unsuitable for large-scale multi datacenter systems.

The goal in Dynamo is providing reliability at large scale. As the paper says in the introduction "The reliability and scalability of a system is dependent on how its application state is managed". This was a key lesson from the Ousterhout'90 paper: "The role of distributed state".

Dynamo is an optimistic replication system. According to the optimistic survey taxonomy, Dynamo is a multi-master (multiple coordinators can update the data) system that employs state transfer and asynchronous propagation of updates. Hence, Dynamo allows conflicting updates in to the system. Dynamo uses vector clocks to detect conflicts, and employs application-level/semantic conflict resolution. There is no bound on replica divergence. Dynamo adopts a best-effort eventual consistency approach and try to resolve divergent replicas problem when possible.

CAP theorem
The Dynamo system emphasizes availability to the extent of sacrificing consistency. The abstract reads "Dynamo sacrifices consistency under certain failure scenarios". Actually, later it becomes clear that Dynamo sacrifices consistency even in the absence of failures: Dynamo may become inconsistent in the presence of multiple concurrent write requests since the replicas may diverge due to multiple coordinators. Using Abadi's model, Dynamo is PAEL. When there are failures (Partitioning), Dynamo chooses Availability over consistency, and, Else (even in the absence of failures) Dynamo chooses Low-latency over consistency.

Dynamo is inclined to sacrificing consistency because the application semantics can tolerate inconsistency. Dynamo employs post hoc conflict resolution to fix the inconsistencies.

Design principles
The Dynamo paper mentions the Google File System (GFS) SOSP'03 work in just two sentences, but I think they are related works and they warrant comparison. Dynamo makes very different design choices than GFS. While GFS used a centralized-master approach (which was maintained in a highly available fashion using Chubby/Paxos), Dynamo uses a pure peer-to-peer (P2P) approach for serving requests.

The Dynamo system fully-embraces the symmetry principle in its design. Quoting from the paper: "Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance."

I think this over-emphasis on symmetry is unwarranted. Why should the system force every node try to do everything for themselves rather than employing specialization? Amazon has a lot of expertise in services and service level agreements. So I am surprised as why they did not think of simplifying the design and implementation of Dynamo by employing specialized nodes to provide a distributed directory lookup service. The VL2 paper shows that a distributed directory service can be implemented in a very efficient and highly available manner in data centers.

The symmetry principle, and the resulting P2P fully decentralized storage nodes system, looks contrived in the paper. I wish the paper included some justification for this design choice. I cannot find very good reasons for it. Obviously this choice leads to a high availability system, but high availability is also achievable with a master coordinator approach using Chubby/Paxos solution in practice. And using a master coordinator approach would have significantly simplified several design problems in the paper, as I discuss in the next section.

The interesting thing is that the paper lists heterogeneity as another key design principle right after the symmetry principle: "Heterogeneity: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once." Doesn't this hint that specialization can also be warranted to exploit heterogeneity? There is no need to strictly adhere to the symmetry principle.

Details of Dynamo
The paper gives details on the partitioning, replication, versioning, membership, and failure handling components of Dynamo.

Partitioning algorithm employs a typical P2P key distribution algorithm (a consistent hashing approach as in Chord). Virtual nodes (tokens) are used for making the key distribution load balancing more fine-grained and more uniform. Dynamo takes pains to make sure that every storage node is utilized almost equally to any other storage node, but enforcing that precise load balancing may lead to wasted traffic in node join and leave.

Replication is performed so that each data item is replicated at N hosts. The coordinator asks for N nodes, but for the sake of availability if quorum number responds, the operation succeeds. N < R+W so we can have intersecting read and write quorums. Tuning of R and W can give write-optimized read-optimized system. In Dynamo, a popular configuration is N=3, R=2, W=2. (Note that we could still employ this type of quorums with a master approach, in GFS N=3, R=1, W=3. Dynamo can provide only a per service tuning of R and W, but with a master-based solution, you can have per key tuning of R and W.)

Data versioning is performed by using vector clocks. Dynamo treats the result of each modification as a new and immutable version of the data.

The biggest difference between Dynamo and traditional p2p systems is that it employs a one-shot lookup. In Dynamo, all nodes maintain information about all key-coordinator mappings. Dynamo does not employ a directory service lookup, instead insists --due to symmetry principle-- that all nodes maintain all the directory information. The request is first routed to a random node (for load balancing purposes), and since that node has information about which node is primarily responsible (a.k.a. coordinator) for the key (typically the first among the top N nodes in the preference list of the key), it routes the request to the corresponding coordinator. If by chance the node that received the request in the first place is also in the list, it can also coordinate the request. Read and write operations involve the first N healthy nodes in the preference list, skipping over those that are down or inaccessible.

The P2P fully decentralized design introduces a lot of complexity
The P2P fully-decentralized system introduces a lot of complexity especially in the failure handling cases. For handling transient failures, hinted handoff is complicated. For handling permanent failures, replica synchronization is more complicated. Membership and failure detection also gets very tricky in a purely decentralized setup: gossip protocol, seed nodes for external discovery, failure detection, adding/removing storage nodes are presented in a hand-wavy manner in the paper. Using a master coordinator would have made all these operations very straightforward.

Using a single master approach would also obviate the need to deal with conflicts. Since updates would be serialized by the single-master, conflicts would be prevented trivially. However, due to faults, an old state may have been exposed, and in the presence of partitions we still have to choose between consistency and availability.

Distributed directory service for lookup
An easier and less radical change in Dynamo would be to adopt a distributed directory service layer for lookup as in the VL2 paper. I think that would make a cleaner design, and solve the self-confessed limitation of scalability with the current design. To quote from the paper: "Finally, Dynamo adopts a full membership model where each node is aware of the data hosted by its peers. To do this, each node actively gossips the full routing table with other nodes in the system. This model works well for a system that contains couple of hundreds of nodes. However, scaling such a design to run with tens of thousands of nodes is not trivial because the overhead in maintaining the routing table increases with the system size. This limitation might be overcome by introducing hierarchical extensions to Dynamo. Also, note that this problem is actively addressed by O(1) DHT systems (e.g., [14])."

The thing that has most impressed me about Dynamo is that the replication is so loosely coupled that they would choose the replicas at different data centers for disaster tolerance purposes. This is made possible by the NoSQL approach and PAEL design choice in Dynamo. The paper also gives a lot of evaluation results. On average 10 ms to complete a write. 99.9% of writes completed in 300ms.

As I mentioned above, I think Dynamo would have been much simpler and efficient by making different design choices. But, what do I know?

Wednesday, November 24, 2010

Crowdsourced sensing and collaboration using Twitter

First, a brief prelude about my research interests. I started my PhD on distributed algorithms and self-stabilizing systems in 1998. But then in 2000, after my advisor received a DARPA grant on networked embedded sensor technologies, I have started working on wireless sensor networks. After 10 years of working on wireless sensor networks, I am now in the process of switching topics. I am doing a lot of reading on cloud computing. This is a topic I enjoy, and I think I can contribute here due to my background in distributed systems and theory.

I am also doing a lot of reading on smartphones, as they provide a good alternative/complement to wireless sensor network systems. The main appeal of smartphones is that they have solved the market penetration problem, which the wireless sensor networks have been perpetually struggling with. The killer applications for smartphones are communication and offering a lightweight ubiquitous PC replacement. Smartphones are incidentally better sensors than the wireless sensor network motes. They have much higher computation and storage power. They have ubiquitous connection through GSM, wifi, bluetooth. They have surprising amount of sensors: camera, microphone, GPS, compass, proximity, ambient light, ambient noise, 3D accelerometer, touchscreen, temperature. They are mobile, and cared by the human user, who can help for certain sensing/recognition tasks. They are inexpensive thanks to mass production.

That is why I am interested in smartphones and specifically on crowdsourcing sensing/collaboration to smartphones using Twitter as a middleware. Here is a presentation that provides a brief overview of my work in this topic.

I believe this is a promising direction, and I think the real breakthroughs are going to arrive when we can get smartphone crowdsourcing and cloud computing to collaborate seamlessly.

Wednesday, November 17, 2010

VL2: A Scalable and Flexible Data Center Network

This paper is by MS Research and appeared in Sigcomm 2009. The paper investigates data center networking, the same problem as the Portland paper (which also appeared in Sigcomm 2009!). Naturally there are similarities in the approaches recommended by the two papers.

The motivation for this paper is from a slightly different angle than the Portland paper. This paper puts more emphasis on the network capacity problem in the data centers. The paper argues that the network is the bottleneck of the computation, since the switches at the higher levels (i.e., aggregation and core switches) are oversubscribed heavily. Servers typically have 1:1 over-subscription to other servers in the same rack --that is, they can communicate at the full rate of their interfaces (e.g., 1 Gbps). However, up-links from top of the rack (ToR) switches are typically 1:20 oversubscribed (i.e., 1 Gbps of up-link for 20 servers), and paths through the highest layer of the tree can be 1:240 oversubscribed.

The paper focuses on the question of providing uniform high capacity for any server to server communication such that traffic flow should be limited only by the available capacity on the network-interface cards of the sending and receiving servers. However, that is only half of the story. In addition to removing the network bottleneck from computation, the data centers must achieve high utilization in order to be profitable. The key to high utilization is the property of agility--the capacity to assign any server to any service. Without agility, each service must pre-allocate enough servers to meet difficult to predict demand spikes, or risk failure at the brink of success. With agility, the data center operator can meet the fluctuating demands of individual services from a large shared server pool, resulting in higher server utilization and lower costs. In order to achieve agility, assigning servers to a service should be independent of network topology.

Scale-out Clos topology and Valiant load balancing
To answer the uniform high capacity problem, the paper proposes a Clos topology as in Figure 5. This topology provides a tree that is "fatter" than the fat tree used in the Portland paper.
However, providing a very fat tree alone is not sufficient to solve the network bottleneck problems; we also need to ensure that traffic is allocated across these multiple available paths appropriately so that one path does not choke down with lots of traffic and reduce the effective rate of communication for the servers communicating over that path. The question is how to achieve such an allocation? Are there any patterns in the data center traffic that will enable us to partition the multiple paths in an optimal way among the servers?

To answer these questions the paper provides extensive experiments and analysis of data center traffic. The analysis found that 90% of flows are small-size (mice), but still 95% of bytes are in big-size flows (elephants). The analysis also show a lot of volatility (unpredictability) in traffic. The variability in datacenter traffic is not amenable to concise summarization and hence engineering routes for just a few traffic matrices is unlikely to work well for the traffic encountered in practice. Armed with these analysis results, the paper proposes to use valiant load balancing (vlb) to randomize end-to-end communication paths to cope with volatility and achieve load balancing. In this scheme, the ToR switch randomly chooses an intermediate switch (among many available options) on a per flow basis.

The paper also provides extensive experiments on network equipment failure characteristics---this was sorely missing in Google's globally available storage paper. The analysis of network failures found that most failures are small in size (e.g., 50% of network device failures involve < 4 devices and 95% of network device failures involve < 20 devices) and large correlated failures are rare (e.g., the largest correlated failure involved 217 switches). Still, downtimes can be significant, and with no obvious way to eliminate all failures from the top of the hierarchy, this paper's approach is to broaden (fatten) the topmost levels of the network so that the impact of failures is muted and performance degrades gracefully.

Virtual layer 2 networking
To answer the agility problem, the paper proposes VL2, which stands for Virtual Layer 2 as far as I can make out --the acronym is not defined in the paper. The key idea of VL2 is separating names from locators. VL2 assigns servers IP addresses that act as names alone, with no topological significance. When a server sends a packet, the shim-layer (a layer 2.5 if you will) on the server invokes a directory system to learn the actual location of the destination and then tunnels the original packet there. The shim-layer also helps eliminate the scalability problems created by ARP in layer-2 networks.
The separation of location-specific IP addresses (LAs) and application-specific IP addresses (AAs) was chosen for two reasons. First, this makes it possible to use low-cost switches, which often have small routing tables that can hold only LA route intervals, without concern for the huge number of AAs. Second, this reduces overhead in the network control plane by preventing it from seeing the churn in host state (change in AAs), tasking it instead to the more scalable directory system (which we discuss below). As such, the network infrastructure operates using LAs; all switches and interfaces are assigned LAs, and switches run an IP-based (layer-3) link-state routing protocol that disseminates only these LAs (end-host information AAs are not disseminated). This allows switches to obtain the complete switch-level topology, as well as forward packets encapsulated with LAs along shortest paths using standard proven network protocols such as OSPF. On the other hand, applications use AAs, which remain unaltered no matter how servers' locations change due to virtual-machine migration or re-provisioning.

To route traffic between servers, which use AA addresses, on an underlying network that knows routes for LA addresses, the VL2 agent at each server traps packets from the host and encapsulates the packet with the LA address of the ToR of the destination as shown in Figure 6. Once the packet arrives at the LA (the destination ToR), the switch decapsulates the packet and delivers it to the destination AA carried in the inner header.

The crux of offering layer-2 semantics via VL2 is having servers believe they share a single large IP subnet (i.e., the entire AA space), while eliminating the ARP and DHCP scaling bottlenecks that plague large Ethernets. From the cloud service programmer's point of view VL2 *efficiently* provides the abstraction that all the servers assigned to the programmer are plugged in to the same LAN--where any IP address can be connected to any port of an Ethernet switch due to flat addressing.

Directory lookup
The VL2 directory system stores the mapping of AAs to LAs. VL2 uses end-system based address resolution to scale to large server pools, without introducing complexity to the network control plane. When an application sends a packet to an AA for the first time, the networking stack on the host generates a broadcast ARP request for the destination AA. The VL2 agent running on the host intercepts this ARP request and converts it to a unicast query to the VL2 directory system. The directory system answers the query with the LA of the ToR to which packets should be tunneled. The VL2 agent caches this mapping from AA to LA addresses, similar to a host's ARP cache, such that subsequent communication need not entail a directory lookup.
The directory system design consists of a modest number (50-100 servers for 100K servers) of read-optimized, replicated directory servers that cache AA-to-LA mappings to handle queries from VL2 agents, and a small number (5-10 servers) of write-optimized, asynchronous replicated state machine (RSM) servers that offer a strongly consistent, reliable store of AA-to-LA mappings. In other words, the directory lookups are handled using read optimized servers, which are just caches of the write optimized Paxos-running RSMs that hold persistent state.

To achieve high availability and low latency, an agent sends a lookup to k randomly-chosen directory servers, and uses the first answer it receives back. The network provisioning system sends directory updates to a randomly-chosen directory server, which then forwards the update to a RSM server. VL2 does not require the use of a fabric manager proposed in Portland, instead the directory system serves the IP-to-LA mapping. While the fabric manager is a single machine and centralized solution, the directory service provides a decentralized solution.

The evaluation results shows that VL2 provides an effective substrate for a scalable data center network; VL2 achieves (1) 94% optimal network capacity, (2) a TCP fairness index of 0.995, (3) graceful degradation under failures with fast reconvergence, and (4) 50K lookups/sec under 10ms for fast address resolution.

The Clos network topology pays off, as the goodput is more than 10x of what the network in current data centers can achieve with the same investment. Another striking result is that comparing the cost of a VL2 network for 35K servers with a traditional data center network shows that a VL2 network with no over-subscription can be built for the same cost as the traditional network that has 1:240 over-subscription.

This is a very well written, definitive paper on data center networking. The paper illustrates the challenges in data center networking via extensive analysis, and offers a simple design that can be realized today with available networking technologies, re-utilizing time-tested/proven network protocols, and avoiding changes to switch control and data plane capabilities. This is a must-read for anyone interested in the data center networking topic.

Monday, November 15, 2010

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Last week we covered the Portland paper by UC San Diego, which appeared at Sigcomm'09. This paper is well written and I really appreciate its clarity and simplicity.

The motivation for the work is the need to scale at the data center networks (DCNs). Data centers today may have around 100K machines, with 32 virtual machines at each, which yield a total of 3 million hosts. Assuming one switch is required for every 25 physical hosts and accounting for interior nodes, the topology would consist of 8,000 switches. Current network protocols impose significant management overhead at this scale let alone supporting seamless VM migration across machines. For example, broadcasting ARPs and updates to every swithch is out of the question for such a network.

The desiderata for the DCN is given as:
R1. Any VM may migrate to any physical machine. Migrating VMs should not have to change their IP addresses as doing so will break pre-existing TCP connections and application-level state.
R2. An administrator should not need to configure any switch before deployment.
R3. Any end host should be able to efficiently communicate with any other end host in the data center along any of the available physical communication paths.
R4. There should be no forwarding loops.
R5. Failures will be common at scale, so failure detection and recovery should be rapid and efficient.

To achieve these desiderata the paper proposes Portland, a scalable fault-tolerant layer-2 routing and forwarding protocol for DCNs. The principal insight behind Portland is the observation that DCNs have a known baseline topology imposed by racks and rows of racks, and need not support an arbitrary topology. Using this observation, Portland enables switches to discover their positions in topology, and assigns internal hierarchical pseudo mac (pmac) addresses to all end hosts, which enable efficient, provably loop-free forwarding, and easy VM migration.

Portland provides layer 2 network forwarding based on flat MAC addresses. A layer 2 fabric imposes less administrative overhead compared to layer 3 forwarding. A layer 3 protocol would require configuring each switch with its subnet information and synchronizing DHCP servers to distribute IP addresses based on the host's subnet. A layer 3 protocol would also render transparent VM migration difficult because VMs must switch their IP addresses if they migrate to a host on a different subnet. While VLANs allow a middle ground between layer 2 and layer 3, VLANs lacks flexibility (as bandwidth resources to be explicitly assigned to each VLAN at each participating switch) and scalability (as each switch must maintain state for all hosts in each VLAN that they participate in).

Fabric manager:
PortLand employs a logically centralized fabric manager that maintains soft-state about network configuration information. The fabric manager is a user process running on a dedicated machine responsible for assisting with ARP resolution, fault tolerance, and multicast. This fabric manager idea is very similar to the open flow networking idea we discussed in an earlier paper. Using soft-state for network configuration and maintaining this at the fabric manager relieves the manual administration requirements.

Pseudo MAC address:
The basis for efficient forwarding and routing as well as VM migration in Portland's design is the hierarchical Pseudo MAC (PMAC) addresses used to identify each end host. The PMAC encodes the location of an end host in the topology. For example, all end points in the same pod will have the same prefix in their assigned PMAC. PMAC address has the form: pod.position.port.vmid. The end hosts remain unmodified, believing that they maintain their actual MAC addresses.

Location discovery protocol:
Portland employs a location discovery protocol so that switches learn whether they are edge, aggregate, or core switches and configure themselves. Protocol works by first the edge switches discovering that they are on the edge (they don't get location discovery packets from majority of ports [which are terminals]). After that aggregation switches discover they are aggregation switches, because they talk to at least one edge switch. Finally, core switches discover they are core switches as they do not talk to any edge switches. This location discovery protocol assumes that the protocol (and protocol timers) is started at the same time in the data center. But it is easy to relax that requirement by writing a self-stabilizing location discovery protocol.

VM migration:
VM keeps the same IP after migration, but the PMAC address needs to be changed. To this end, the VM from the new location sends a message to the fabric manager, and the IP mapping to PMAC is changed at fabric manager. The fabric manager also propogates this change to the switch that VM has left. (The new switch already knows this updated mapping as the packet was sent by the VM via the new switch.)

Portland achieves loop free routing using a simple rule: once the packet starts to go downwards the hierarchy (core to aggregate to edge) it cannot go back upwards. Portland achieves fault-tolerant routing by utilizing the fabric manager as a centralized control for the network; this is very much like what open-flow does.

Friday, November 12, 2010

Availability in Globally Distributed Storage Systems

This paper by Google Research provides a report on the availability behavior of large cloud storage systems by studying up to 7K nodes at Google over a year. The paper does not propose any new protocols, but provides sufficiently accurate models of system behavior/performance based on the study. These models are important since they enable us to correctly design and optimize these multi layered systems for data availability.

The work is divided into two parts. The first is the analysis of the component availability (disks, machines, racks), and the second is the analysis of the data availability, as inferred from the component availability results and design decisions of the distributed storage system.

A storage node is defined as unavailable when it fails to respond positively to periodic health checking pings sent by the monitoring system. The paper does not investigate about network errors specifically; those are also swept under unavailability with software & hardware faults at the node level. Even in Section 3, when the causes of node availability are analyzed in detail, network errors are not mentioned.

Node unavailability does not directly imply data unavailability thanks to identical replication or alternatively erasure encoding. In both approaches, data is divided into a set of stripes, each of which comprises of fixed size chunks. Data in a stripe can be reconstructed from some subsets of the chunks. For replication, R = n refers to n identical chunks in a stripe, so the data may be recovered from any one chunk. For Reed-Solomon erasure encoding, RS(n,m) denotes n distinct data blocks and m error correcting blocks in each stripe. In this case a stripe may be reconstructed from any n chunks.

Node failures and Correlated failures
The majority of node unavailability is due to planned reboots (such as kernel version upgrades), which is followed closely by node restarts (OS restart on the storage node). Unplanned machine reboots (e.g., kernel crashes) don't occur frequently but have the longest average duration to recover since extra checks and corrections are used to put the machines to a safe state.
The paper then focuses on measuring the frequency and severity of correlated failures. (Root causes of these failures --power outages, rolling reboots/upgrades-- are not investigated in the paper.) A failure burst is defined with respect to a window size as a maximal sequence of node failures, each one occurring within a time window 2 minutes of the next. 2 minutes is derived from a knee analysis of the graph plotting the "effect of window size on the fraction of individual failures that get clustered into bursts".

Next, the paper focuses on detecting rack-related node failures. To this end, rack-affinity score is defined as the "probability that a burst of the same size affecting randomly chosen nodes in that cell will have a smaller burst score". For a random burst, the expected rack affinity score is 0.5, for a rack-correlated burst close to 1, and a rack-anti-correlated burst close to 0 (they have not observed any rack-anti-correlated bursts). The authors found that larger failure bursts have higher rack affinity: All the failure bursts of more than 20 nodes have rack affinity greater than 0.7, and those of more than 40 nodes have affinity at least 0.9. Figure 8 shows frequency of failure bursts sorted by racks and nodes affected, and we can clearly see a large fraction of failures are bursty and rack-correlated. Later at Figure 10, it is shown that large failure bursts are the biggest contributor to unavailability as well.
Coping with failures
When a node failure causes the unavailability of a chunk within a stripe, the system initiates a recovery operation for that chunk from the other available chunks remaining in the stripe. The rate at which missing chunks may be recovered is limited by the bandwidth of individual disks, nodes, and racks, and there is a tradeoff between doing recovery and serving new client requests. Of course, given the frequency of bursty rack-correlated failures, it is no surprise that a rack-aware stripe placement policy (that ensures that no two chunks in a stripe are placed on nodes in the same rack) increases the stripe MTTF (i.e., data availability) by a factor of 3 typically.

Markov model of data availability and its findings
Using the data collected and analyzed above for node failures and rack-correlated bursty failures, the paper formulates a Markov model for data availability and develops analytical models to reason about past and future availability in the storage clusters, including the effects of different choices of replication, data placement and system parameters. A very simple example for identical replication and 2 chunks per stripe (i.e., R=2) is given in Figure 12.
The paper shows that the model is able to capture well the effect of failure bursts on the MTTF. Next the paper investigates how changes in the parameters of the system will affect data availability. Using the model the paper finds that reducing recovery times is effective when correlated failures are few. For RS(6,3) with no correlated failures, a 10% reduction in recovery time results in a 19% reduction in unavailability. However, when correlated failures are taken into account, even a 90% reduction in recovery time results in only a 6% reduction in unavailability. Correlated failures (compared to independent/random failures) are also shown to reduce the MTTF by at least two orders of magnitude. Correlation also reduces the benefit of increasing data redundancy. The gain in availability achieved by increasing the replication number, for example, grows much more slowly when we have correlated failures.

Two other important conclusions from the model, that should guide design decisions in distributed storage systems, are as follows. The model shows that component availability improvements below the node (server) layer of the storage stack do not significantly improve data availability. Assuming R=3 is used, a 10% reduction in the disk failure rate increases stripe availability by less than 1.5%. On the other hand, cutting node failure rates by 10% can increase data availability by 18%. The model also shows that replicating data across multiple cells (data centers) greatly improves availability because it protects against correlated failures. For example, R=2x2 (i.e., replicating twice in two cells) with 1 day recovery time between cells has two orders of magnitude longer MTTF than R=4.

I expect this paper to be very influential for distributed storage systems research. Using logs obtained from a large-scale production environment, the paper provided evidence that correlation among node failures dwarfs all other contributions to unavailability. (That said, I am disappointed that the network-related failures are not investigated separately.) The provided analytical model of data availability is very promising; and should be used by the distributed storage systems researchers to guide and evaluate any protocols they propose.

Sunday, November 7, 2010

My iPhone 4 has a transparent screen

I just took a picture of my palm, and set it as the wallpaper to give the transparency effect. The trick worked well on some unsuspecting friends.

Optimistic Replication

This 2005 paper by Saito and Shapiro provides a comprehensive survey of the optimistic replication area and is a must-read for distributed services designers. Below, I am providing some snippets from the paper to give a brief overview.

Data replication improves both availability and performance for distributed services. Availability is improved by allowing access to the data even when some of the replicas are unavailable. Performance improvements concern reduced latency (by enabling local access from replica instead of remote access) and increased throughput (by letting multiple computers serve the data).

Pessimistic techniques block access to a replica unless it is provably up to date. Such pessimistic techniques perform well in local-area networks, in which latencies are small and failures uncommon, but is unsuitable for wide-area networks, because the Internet remains slow and unreliable. Moreover, pessimistic algorithms scale poorly, because its throughput and availability suffer as the number of sites increases.

Compared to traditional "pessimistic" techniques, optimistic replication promises higher availability and performance, but lets replicas temporarily diverge and lets users see inconsistent data. In optimistic techniques,
updates are propagated in the background, and occasional conflicts are fixed after they happen. Optimistic algorithms should be able to scale to a large number of replicas, because they require little synchronization among sites.

These benefits, however, come at a cost. CAP theorem states that any distributed system faces a trade-off between availability and consistency. Where a pessimistic algorithm waits, an optimistic one speculates. Thus, optimistic replication faces the challenges of diverging replicas and conflicts between concurrent operations. So it is applicable only for applications that can tolerate occasional conflicts and inconsistent data. Fortunately, in many real-world systems, especially file systems, conflicts are known to be rather rare. Some example applications of optimistic replication are DNS, USENET, PDAs, CVS, Amazon Dynamo.

When describing algorithms, it is useful to distinguish sites that can update an object --called master sites -- from those that store read-only replicas.

To update an object, a user submits an operation at some site. An operation submitted by the user of a replica is tentatively applied to the local replica to let the user continue working based on that update. It is also logged, i.e., remembered in order to be propagated to other sites later. These systems often deploy epidemic propagation to let all sites receive operations, even when they cannot communicate with each other directly. Because of background propagation, operations are not always received in the same order at all sites. Each site must reconstruct an appropriate ordering that produces an equivalent result across sites. In addition, with no a priori site coordination, multiple users may update the same object at the same time, so there is a need for detecting operations that are in conflict and resolve them. Commitment refers to an algorithm to converge the state of replicas by letting sites agree on the set of operations and their final ordering and conflict-resolution results.

Optimistic Replication: Design Choices
We classify optimistic replication systems along the following axes: (The rest of the paper investigates these design choices in detail)
  • Number of writers: single-master vs. multi-master: Single-master systems designate one replica as the master. All updates originate at the master and then are propagated to other replicas, or slaves. They are simple but have limited availability, especially when the system is write-intensive. Multi-master systems let updates be submitted at multiple replicas independently and exchange them in the background. They are more available but significantly more complex. In particular, operation scheduling and conflict management are issues unique to these systems. According to Gray et al. [1996], a naive multi-master system would encounter concurrent updates at the rate of O(M^2), assuming that each master submits operations at a constant rate. (M is the number of masters.)
  • Definition of operations: state transfer vs. operation transfer. State-transfer systems limit an operation either to read or to overwrite an entire object. State transfer is simple, because maintaining consistency only involves sending the newest replica contents to other replicas. Operation-transfer systems must maintain either a history of operations and have replicas agree on the set of applied operations and their order. On the other hand, they can be more efficient, especially when objects are large and operations are high level. Operation transfer also allow for more flexible conflict resolution.
  • Scheduling: syntactic vs. semantic. The goal of scheduling is to order operations in a way expected by users, and to make replicas produce equivalent states. Scheduling policies can be classified into syntactic and semantic policies. Syntactic methods are simple but may cause unnecessary conflicts. The most popular example is ordering operations by timestamp. Semantic scheduling, on the other hand, exploits semantic properties such as commutativity of idempotency of operations. Semantic scheduling increases scheduling flexibility and reduces conflict, but at the cost of application dependence and complexity.
  • Handling conflicts. The best approach is to prevent conflicts from happening altogether. Pessimistic algorithms prevent conflicts by blocking or aborting operation as necessary. Single-master systems avoid conflicts by accepting updates only at one site (but allow reads to happen anywhere). These approaches, however, come at the cost of lower availability. Conflicts can also be reduced, for example, by quickening propagation or by dividing objects into smaller independent units.
  • Propagation strategies and topologies. Local operations must be transmitted and re-executed at remote sites. In general, the quicker the propagation, the less the degree of replica inconsistency and the rate of conflict, but more the complexity and overhead, especially when the application receives many updates relative to read requests.
  • Consistency guarantees. In any optimistic replication system, the states of replicas may diverge somewhat. A consistency guarantee specifies whether a client application may observe divergence, and how much.


Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflict-free, and scales well in practice. State transfer using Thomas's write rule (last writer wins) works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution.

Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur.

Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set.

Wednesday, November 3, 2010

A presentation tip (brought to you by a weary audience)

A common and serious flaw I see in student presentations is that the presenter rushes through the first 5-10 slides. This is the worst thing to do in a presentation, these introduction slides to the problem and solution are the most important ones. If the presenter loses the audience in these beginning slides, this wastes the entire hour for the audience. In comparison, losing the audience in the middle of the presentation is less critical, less time is wasted, and the audience can even use this time to think about alternative solutions to the problem (which was presented clearly in the introduction).

You would think this is common sense, but most of the students still commit this mistake. I guess the reason is since the presenter had read and understood the relatively easy introduction part of the paper, he thinks the audience can also understand and follow it with ease. And armed with the powerpoint slides, the presenter can finish the first 10 slides in less than 10 minutes in bullet time. I intercept the presenters with questions if I see this happening, but I am getting really tired of playing the stupid/clueless professor. The worse thing is most of the audience do not follow the presentation, but nobody intercepts. I think in any case it is better to intercept to avoid wasting time for everyone.

The role of distributed state

This is a 1990 technical report by John Ousterhout, who has been a very influential and pioneer figure in systems research. I had written a summary of his log-structured file system (LFS) paper earlier in this blog. It seems like his ideas on LFS and RAMCloud are becoming important in cloud computing these days. Ousterhout is also the creator of TCL/TK, so he has a lot to say in programming languages for web applications domain as well. The miscellaneous articles on his non-academic homepage include a lot of wisdom from his experience in the software development industry, and are also worth reading carefully.

This paper titled "Role of distributed state" describes the advantages and disadvantages of distributed state and suggests that the act of building a distributed system consists of making tradeoffs among various alternatives for managing the distributed state. Illustrating this point, the two case studies included in the paper, NFS and Sprite file systems, end up with a corresponding set of good and bad properties, and the strengths and weaknesses of the two systems are almost opposites.

The advantages of distributed state are performance (through local caching of the state), coherency (through using the state for collaboration and coordination), and reliability (through replication of the state). Unfortunately, these advantages are only potential advantages; in practice they are difficult to achieve due to the four problems introduced by distributed state: consistency, crash sensitivity, time and space overheads, and complexity.

The first problem is consistency: if the same piece of information is stored at several places and one of the copies changes, what happens to the other copies? There are three approaches to cope with this, detect stale data on use (lazy), prevent stale data (pessimistic --this is costly), and tolerate inconsistency (optimistic --this can introduce complexity). The second problem is crash sensitivity: unless you have full state replication, failure of one machine/state renders the rest of the distributed state useless as well (since that state has not been replicated). The third problem is time overhead, which is incurred mainly for maintaining consistency in preventing stale data. (Let's ignore the space overheads, we have space). The last problem is complexity, as getting a distributed protocol right is hard.

Case Study 1: NFS file system
The NFS design was optimized for simplicity and robustness, with performance a secondary goal. Simplicity and robustness were achieved by using a stateless protocol with idempotent operations. In NFS, servers do not store any information about their clients; they do not keep track of which files are currently in use or which clients are using which files. Distributed state is kept almost exclusively on the clients.

The advantage of this design choice is that NFS handles faults easily. As NFS is "stateless" (at the servers, that is) to begin with, no state is corrupted, so no special code is needed for crash recovery.

The disadvantage of this design is that the performance suffers greatly. Whenever a client issues a write request, the server must guarantee that all modified data is safely on disk before the write returns. And this kills performance causing low write-throughput. Even worse for the performance, NFS uses a "write-through-on-close" policy to reduce windows of inconsistency. Whenever a file is closed on a client machine, the client immediately transmits modified data for the file back to the server. Unfortunately, the statelessness of NFS requires that new data be returned immediately to the server to reduce consistency problems, and the only way to return data to the server is with the write operation, which forces data to disk. Still, consistency problems can also arise due to client caches of file at different places. Thus, the clients use polls to see if they should invalidate cache (which in turn reduces the performance). Even then, a window of inconsistency exists for NFS due to client timings of events.

Case Study 2: Sprite file system
Sprite's design is optimized for performance. Sprite chooses to be "stateful":
  1. Servers keep information in their main memories about which workstations are reading or writing which files. This requires clients to notify servers whenever files are opened or closed, but allows the servers to enforce consistency as described below.
  2. Servers retain modified file blocks in their main memories, and do not write that information back to disk until it has aged for thirty seconds.
  3. Clients also retain modified file blocks in their main memories; they do not pass new information back to servers until it has aged for thirty seconds or until the information is needed by some other client. If a client has dirty blocks for a file, the server's state information reflects this.

The advantage of this design choice is performance: Since the servers keep track of which clients are using which files, clients need not return modified file data to servers immediately. Another major advantage of statefulness is consistency: If a file is ever open simultaneously on several clients and at least one of them is writing the file, then the server notifies each of the clients and insists that they not cache the file; all read and write operations must be passed through to the server, where they are applied to a single copy of the file in the server's cache. Thus Sprite provides perfect file consistency: each read operation is guaranteed to return the most recently written data for that file, regardless of where and when the file is read and written.

The disadvantage of the stateful design choice is that the system is more complex; implementing Sprite was difficult due to subtle race conditions. This complexity also spawns more difficult crash recovery problems; server crash leaves clients in an inconsistent state with server, unable to move forward.

To handle the recovery problems, the designers of the Sprite system discovered that distributed state was not just the cause of the recovery problem, but also the solution. By adding slightly to the state kept by clients about their files, they made it possible for servers to recreate all their usage information after rebooting. A new operation was added to the Sprite protocol: reopen. When a server reboots, each client reopens all of its files by passing the server a copy of its state information (including information such as which files are open for reading or writing, which are locked, etc.). This allows the server to reconstruct its state so that it can continue to guarantee consistent access to files, and so that locks are not broken when servers reboot. A side-effect of this reopen fix took its toll on the performance though, the reopen approach leads to a recovery storm and which paralyze the server.

Ousterhout takes a strong stance in the conclusion. I think he provides critical and very good advice, so I quote it verbatim here.
Based on my experience with the NFS and Sprite file systems, I do not believe that the stateless model can meet the needs of high-performance workstations of the future. A stateless approach will limit the performance of the system to the performance of disks; unfortunately, disk performance is not improving at anywhere near the rate of processor performance. Non-volatile memory offers some hope for performance improvement, but I think the best solution is a change to more stateful protocols.

On the other hand, distributed state almost always introduces complexity and fragility, so system designers should attempt to reduce distributed state as much as possible. The less state, the better. In Sprite, I suspect that we may have been a little too eager to embrace state, and that a careful redesign of the system could reduce the amount of state we have to maintain.

Finally, the best approach to dealing with failures is to merge recovery with normal operation so that there is nothing special to do during recovery. NFS achieves this quite nicely through its combination of statelessness and idempotency. Recovery happens so infrequently that is is very difficult to debug special-case recovery code: it is hard to invoke the code under test conditions, and the code is hardly ever executed under real-life conditions. This means that there is a good chance that the code will not work when it is needed. On the other hand, if the recovery code and regular-case code are the same, the recovery code will be exercised constantly during everyday operation of the system so it is likely to work correctly when needed.

By the way, this approach of "dealing with failures by merging recovery with normal operation" is what the self-stabilization prescribes. So this last paragraph is a major motivation for self-stabilizing systems. I intend to write about self-stabilization in a future post.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...