## Thursday, December 23, 2010

### Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM

I wrote about Ousterhout's "The Role of Distributed State" work before. This review is for his recent work on RAMClouds, which appeared in SIGOPS Operating Systems Review.

This paper makes a case for keeping all the data in the RAM over distributed nodes in a datacenter. "A RAMCloud is not a cache like memcached and data is not stored on an I/O device, as with flash memory: DRAM is the permanent home for data." Obviously, storing everything in RAM would yield a very high-throughput (the paper mentions 100-1000x) and very low-latency (again the paper mentions 100-1000x) system compared to disk-based systems. However, the primary reason the authors are excited about RAMCloud is the following: "RAMCloud will simplify the development of large-scale Web applications by eliminating many of the scalability issues that sap developer productivity today." The motivation for RAMCloud is to provide a general-purpose storage system that scales far beyond existing systems, so for achieving scalability application developers do not have to resort to ad hoc techniques, such as Dynamo, PNUTS, Bigtable, that give up some of the benefits of traditional RDBMSes.

This quest for the holy grail is itself a point of contention. One of the critiques of the RAMCloud proposal is Jeff Darcy in these two posts. Jeff's main point is "You cannot shoehorn everything in one system and in RDBMS. Applications need many different kinds of storage."

I will now go into more technical details about feasibility of RAMClouds.

Latency and bandwidth trends
The paper gives these striking information on disk trends: "Disk capacity has increased more than 10000-fold over the last 25 years and seems likely to continue increasing in the future. Unfortunately, though, the access rate to information on disk has improved much more slowly: the transfer rate for large blocks has improved only 50-fold, and seek time and rotational latency have only improved by a factor of two."

Let's look at the absolute numbers for latencies. Modern hard drives have latencies under 10 milliseconds; in contrast RAM latency is in the 10 nanosecond range. So, RAM is 100,000 times faster. However, it is possible to use many disks in parallel to overlap these latencies and reduce these latencies. Caching also reduces the average latencies from disk significantly. Another way to defend the disk-based systems is to consider the question of whether very-low-latency really matters for cloud applications. The argument in the paper is: if you build a low-latency system, the applications will come. "With sufficiently low-latency none of the specialized approaches for scalability are needed. RAMClouds offer the hope of a new one-size-fits-all where performance is independent of data placement and a rich variety of queries becomes efficient." I was hoping to hear a more convincing argument for low-latency applications here instead.

Cost and size trends
The latency and bandwidth trends mentioned in the paper favored RAMs more than disks. However, there are other striking trends that the paper ignores to mention. For example, the trends for the storage cost. In the last 30 years, the initial $193,000 per gigabyte cost of storing at the disk dropped to only 7 cents; this represents a cost decrease of nearly three-million percent. While the disk has$0.07 cost/gb, the RAM has $60 cost/gb according to the paper (even while excluding electric usage). In order to argue against disk-based approaches that use generous caching to improve the performance, the paper mentions that caching is not effective for long-tail access patterns such as in Facebook. The solution the paper offers for this long-tail access problem is to keep all the data in the RAMCloud. However, this proposal ignores the size trends in data completely. Those large social network services have several 100 TBs of just text and log data, and every year the size of the data is expected to increase several folds. Again, to make a case for the RAMClouds the paper cites a figure from another paper, and mentions that the dividing lines in Figure 2 are all shifting upwards with time, which will increase the coverage of RAMClouds in the future. The upwards shift is because "the boundary moves upwards as the cost/bit of DRAM improves; it moves to the right as the cost/query/sec. of flash improves. For all three storage technologies cost/bit is improving much more rapidly than cost/query/sec, so all of the boundaries are moving upwards." But this analysis ignores the trend for more data-storing/consuming needs of applications. With time, the storage needs of the applications will also grow rapidly, which may nullify the benefits from the aforementioned upwards shift in the RAMCloud boundary. Challenges that need to be overcome for implementing RAMCloud: The paper mentions that there are several research challenges for implementing RAMCloud efficiently and effectively. I will summarize two of them below. Low-latency RPC is needed Network switching at several layers adds delays. Network delays should be reduced by tuning TCP. Since there isn't much locality to data center traffic, the bisection bandwidth may need to be increased in the upper levels of datacenter networks as well. A second problem is the OS level delays. A general purpose OS introduces high overheads for interrupt processing, network protocol stacks, and context switching. "In RAMCloud servers it may make sense to use a special-purpose software architecture where one core is dedicated to polling the network interface(s) in order to eliminate interrupts and context switches." Durability & availability is needed Since RAM does not offer permanent storage, data will be lost if the node crashes. To provide durability and availability, the paper suggests to use two other RAM replicas for the same data. (This would triple the cost of RAMClouds.) "After a crash, one backup server for each shard reads its (smaller) log in parallel; each backup server acts as a temporary primary for its shard until a full copy of the lost server's DRAM can be reconstructed elsewhere in the RAMCloud. With this approach it should be possible to resume operation (using the backup servers) within a few seconds of a crash." RAMCloud's applicability today The following is the strongest claim in the paper for the applicability of RAMClouds: "It is probably not yet practical to use RAMClouds for large-scale storage of media such as videos, photos, and songs (and these objects make better use of disks because of their large size). However, RAMClouds are practical for almost all other online data today, and future improvements in DRAM technology will probably make RAMClouds attractive for media within a few years." Based on my review above, however, I find this claim a bit over-optimistic. I think cost trends and size trends have not been taken into account appropriately for the analysis in the paper. Also, there are several research challenges to be addressed before we can reap the benefits of the latency and bandwidth trends. So I contend that RAMCloud is not cost-effective now, and it may not be cost-effective for sometime soon. I agree with this quotable statement from Jeff Darcy's post. "Nine out of ten people who think they have a truly RAM-cloud-appropriate access pattern should be spending their money not on extra RAM but on smarter programmers." Post-confession: Actually, I didn't mean to take a negative stand against RAMCloud when I started writing this review. I guess the reason this review went this way may be due to my computer scientist debugging instincts that led me to try and poke holes against "the case for RAMCloud". So, let me end by pointing to a saner perspective. Todd Hoff asks this question of "Are Cloud Based Memory Architectures The Next Big Thing?" in his blog and answers it a lot better than I can. More links on the RAMCloud paper are listed in this post. And here is James Hamilton's summary of the RAMCloud talk. ## Thursday, December 16, 2010 ### Finding a Needle in Haystack: Facebook's Photo Storage This paper appeared in OSDI'10. The title "Finding a needle in Haystack" is a bit over-dramatization :-) Finding a needle in Haystack becomes straightforward if you can memorize the location of each needle in the haystack. And that is exactly what Facebook Haystack system does. Haystack is an object store for sharing photos on Facebook where data is written once, read often, never modified, and rarely deleted. Haystack storage system was designed because traditional filesystems perform poorly under the Facebook workload. While using network attached storage (NAS) appliance mounted over NFS, several disk operations were necessary to read a single photo: one (or typically more) to translate the filename to an inode number, another to read the inode from disk, and a final one to read the file itself. While insignificant on a small scale, multiplied over billions of photos and petabytes of data, accessing metadata becomes the throughput bottleneck. Haystack aims to achieve the following 4 goals. High throughput and low latency is achieved by ideas stemming from Log-structured Filesystem work (keeping all metadata in main memory, and minimizing the disk accesses per read and write). Fault-tolerance is achieved by replicating each photo in geographically distinct locations. Cost-effectiveness is achieved by custom designing the filesystem for the photo storing application and trimming all the fat, which results in each usable terabyte costing ~28% less in the new system. Finally, simplicity is achieved by restricting the design to well-known straightforward techniques. The simplicity of the presentation and design of the system is refreshing. In fact the design is so simple that one may argue there is not much novel ideas here. Yet, the paper is still very interesting because it is from Facebook, a giant. Facebook currently stores over 260 billion images, which translates to over 20 petabytes of data. Users upload one billion new photos (~60 terabytes) each week and Facebook serves over one million images per second at peak. Previous NFS-based design Figure 2 gives an overview of the photo access process of the previous NFS-based design used at Facebook. The user's browser first sends an HTTP request to the web server. For each photo the web server constructs a URL directing the browser to a location from which to download the data. If the content delivery network (CDN) has the image cached then the CDN responds immediately with the photo. Otherwise the CDN contacts Facebook's photo store server using the URL. The photo store server extracts from the URL the volume and full path to the file, reads the data over NFS, and returns the result to the CDN. The problem with this previous design is that at least 3 disk operations were needed to fetch an image: one to read directory metadata into memory, one to load the inode into memory, and one to read the file contents. Actually before optimizing the directory sizes, the directory blockmaps were too large to be cached by the NAS device and 10 disk operations might be needed to fetch the image. Unfortunately, caching at the CDN or caching filehandles returned by NAS devices did not provide much relief to Facebook due to the long tail problem. While caches can effectively serve the hottest photos --profile pictures and photos that have been recently uploaded--, a social networking site like Facebook also generates a large number of requests for less popular (often older) content. Requests from the long tail account for a significant amount of the traffic, and these requests all miss the cache and need to access the photo store servers. Haystack design To resolve the disk access bottleneck in the NFS-based approach, the Haystack system keeps all the metadata in main memory. In order to fit the metadata in main memory, Haystack dramatically reduces the memory used for filesystem metadata. Since storing a single photo per file results in more filesystem metadata than could be reasonably cached, Haystack takes a straightforward approach: it stores millions of photos in a single file and therefore maintains very large files. (See the second paragraph below for how this works.) Figure 3 shows the architecture of the Haystack system. It consists of 3 core components: the Haystack Store, Haystack Directory, and Haystack Cache. The Store's capacity is organized into physical volumes. Physical volumes are further grouped into logical volumes. When Haystack stores a photo on a logical volume, the photo is written to all corresponding physical volumes for redundancy. The Directory maintains the logical to physical mapping along with other application metadata, such as the logical volume where each photo resides and the logical volumes with free space. The Cache functions as an internal CDN, another level of caching to back up the CDN. When the browser requests a photo, the webserver uses the Directory to construct a URL for the photo, which includes the physical as well as logical volume information. Each Store machine manages multiple physical volumes. Each volume holds millions of photos. A physical volume is simply a very large file (100 GB) saved as "/hay/haystack-logical volumeID". Haystack is a log-structured append-only object store. A Store machine can access a photo quickly using only the id of the corresponding logical volume and the file offset at which the photo resides. This knowledge is the keystone of the Haystack design: retrieving the filename, offset, and size for a particular photo without needing disk operations. A Store machine keeps open file descriptors for each physical volume that it manages and also an in-memory mapping of photo ids to the filesystem metadata (i.e., file, offset and size in bytes) critical for retrieving that photo. Each photo stored in the file is called a needle. A Store machine is a 2U node with 12 1TB SATA drives in RAID-6 (as a single volume). A single 10TB XFS filesystem (an extent based file system) runs across these on each Store machine. XFS has two main advantages for Haystack. First, the blockmaps for several contiguous large files can be small enough to be stored in main memory. Second, XFS provides efficient file preallocation, mitigating fragmentation and reining in how large block maps can grow. My remaining questions Although the paper is written clearly, there are still a couple of points I couldn't understand. My first question is why was Google Filesystem (GFS) not suitable for implementing the Photo Store Server? That is what are the advantages of Haystack over GFS for this application? The paper just mentions that serving photo requests in the long tail represents a problem for which Hadoop is not well suited. But, what does this argument (which I don't understand) have to do with the unsuitability of GFS. I guess the answer is because Facebook needed to keep all metadata in the memory, and there was no immediate implementation of this provided by GFS master chunkserver. I guess if the GFS master chunkserver is modified to hold the metadata in the memory, then Facebook could as well use GFS for solving their problem. As another disadvantage of GFS, the paper cites that GFS is optimized for a workload consisting mostly of append operations (which is actually what Haystack does as well) and large sequential reads (I am not really sure why GFS cannot handle small random reads easily as well). So I don't see why GFS couldn't be adopted for use here. My second question is about the details of replica maintenance at the Store. In the paper, replica maintenance is only outlined briefly as follows. A photo is replicated to each of the physical volumes mapped to the assigned logical volume. Haystack Directory provides this mapping. It seems like the replication logic is CP according the CAP theorem/model, and sacrifices availability when one of the nodes to host a physical volume to be used is unavailable. But, that is only basic operation, and there are several other side cases, such as failure of one volume, compaction of one volume, deletion of some photos. It is unclear how the replicas are maintained to the face of these, and how the mappings at the Haystack Directory component are updated to account for these. That should require coordination between the Store and the Directory. ## Wednesday, December 15, 2010 ### Boxwood: Abstractions as the foundation for storage infrastructure This paper is by Microsoft Research, and appeared in OSDI'04. This review will mostly be a stream of conciousness, because I have not yet understood all of the paper and cannot put it in context as much as I would like to. While reading in to the Boxwood paper, I started to notice how similar this is getting to the GFS problem and GFS approach. Boxwood appeared at OSDI'04, and GFS appeared at SOSP'03. Boxwood refers to GFS but does not compare or contrast itself with GFS. Maybe the reason is in 2004 the Boxwood authors could not see the similarities. This could be because, as I mentioned in my GFS review, the GFS paper did not talk about the Paxos replication of the master chunk-manager in the 2003 paper; that came a couple years later in the Chubby and Paxos-made-live papers. When citing GFS, the Boxwood authors only state that GFS "will be layered over the facilities of Boxwood". But, that is impractical as it would be duplicating a lot of the services; GFS also has chunk manager, failure detector, lock service, replication, Paxos etc. Let me give a brief overview of Boxwood design, so that I can compare Boxwood with GFS further. Boxwood is for LANs. It assumes Gigabit Ethernet, and uses synchronous replication of data on two discs (I guess the master and shadow chunk managers). The Boxwood design section in the paper starts with a description of the Paxos service. Then, it continues with the failure detector service, which is not really very interesting since it is a well-studied and mature topic. Then it describes the distributed lock service. Curiously this service is implemented as a master shadow replication, and Paxos is employed only to keep the id of the master. This is curious because, when implementing the same service, the GFS-Chubby approach was to replicate the master via Paxos to four other replicas, which yields a much simpler design and a robust (masking fault-tolerance to two node failures) system. Then comes the RLDev (replicated logical device) component. The paper writes: "The list of RLDevs, the segments belonging to them, the identity of machines that host the primary and the secondary segments, and the disks are all part of the global state maintained in Paxos." This state amounts to pretty much what the master chunk-manager should hold. Then the question is again, why not maintain this state simply by replicating the master chunk-manager via Paxos as GFS did. Next comes the chunk manager. As Figure 3 shows, the chunk manager is replicated with a shadow node. "The chunk manager pair relies on a shared RLDev and RPCs to keep the mapping information consistent." Again, why not do this simply via Paxos replication of the chunk manager as GFS did. What is (if any) the advantage to this approach? Next comes the transaction and logging service but it is extremely scarce in details. The authors implemented Boxwood, as well as BoxFS, a filesystem using Boxwood B-trees, exported using the NFS v2 protocol. Both are implemented as user level processes. Evaluation results for BoxFS are given, but I am not sure how to interpret those results as BoxFS makes simplifying assumptions (such as 30 second data cache flush) over NFS to achieve acceptable performance. So, wrapping up, I guess there may be several things I don't understand in this paper. The presentation is unfortunately not clear and simple for me. Instead of telling us what its most significant contribution/lesson is, the paper tells us about everything in the system. The introduction emphasizes one thing (distributed data structure abstractions provided), yet the internal sections of the paper emphasize another (Paxos and lock service, and the chunk manager on top of them). Maybe I should use this paper as a discussion paper alongside GFS in my seminar in Spring'11; as a group we can have a better understanding and comparison of both systems. ## Sunday, December 12, 2010 ### Globecom, WSN forum, Urban-scale sensing talk by Ed Knightly (Rice U) Last week, I attended Globecom'10. Ed Knightly from Rice talked about urban-scale sensing under 3 parts: vehicular sensing, health sensing, and smart grid. Ed spent most of his talk on the vehicular sensing part. A recent US deparment of transportation vehicle safety commission project asked this question: vehicles have dozens of sensors already, what if this information was shared, what can we achieve? Some low hanging fruits are: traffic signal warning, curve speed warning, left turn assistant, stop sign movement assistant, lane change warning, collision warning, and finally, internet access applications. The candidate technology that is proposed for making this networking feasible is a wireless technology, of course. But not the wifi technology which is probably many people's first guess; It is visible light communication (VLC) technology. A VLC transmitter is a LED, which can as well be the LED headlight and taillight in most of the recent models. The only thing needed is to modulate the signals at high frequencies at these LED sources in the visible light spectrum. (VLC is invisible to the naked eye, because it is just modulated too frequently to register at our brains.) VLC receiver is a photodiode, which can again be installed next to the lights in the front, back, and the sides. More good things going for VLC (aka free space optical communication) are as follows. VLC is green. You get good bandwidth with little energy expenditure. VLC is directional, good for vehicular applications. Sun interference is not a problem, unless there is a direct sunset interference, ambient sun or shadow environment does not interfere with VLC. 400Mbits/sec is the current physical level record for VLC. Ed noted that Boing is trying to push this VLC technology in planes, as it is untintrusive to other equipments on the plane. The second topic Ed talked about is the health applications of sensing. Ed said there is an app for a lot of health issues, but these apps are not making much use of sensors. (The exception is nike+ipod, which uses a pedometer.) If we can combine those apps with sensing and close the loop, we can diagnose a lot of problems early on. These regular monitorings can enable preemptive medicine and cut down ER costs. Ed gave the example of the bluebox. Cardiovascular diseases are responsible for 40% of deaths. Bluebox is a$10 device, which is basically a crappy ekg. It is noisy, but using it regularly catches cardiovascular problems a lot before they become dangerous. So, crappy it is, its regular use saves lives. Can we make this sensor available on your smartphone? Another example is the laser breathalyzer for diabetes which helps determine if an asthma attack is imminent. Using this sensor, asthma attacks can be detected much before they occur.

Ed's final topic was smart grid. Smart electrical-grid consists of network-controlled power terminals, as well as thousands of solar panels attached to the grid. There is recent interest in this domain due to green energy sources. Recently, like the Google-meter several sensing solutions are emerging to provide fine-grain (device-level, minute-granularity) accounting of power usage at living spaces. The goal of the smart grid is to be robust, self-healing against attacks and natural disasters.

### Globecom, wireless networking forum, talk on smartphones by Roy Want

Last week, I attended Globecom'10 .

Roy Want (Intel) gave a talk on smartphones in Globecom. He started by showing the market trends for cellphones smartphones and laptops. Cellphones and smartphones grow so quickly that they dwarf the laptop market (which is growing with a healthy 20%). Roy, then, asked the following question: "Will one day will we be computing on the smartphones?" He said that in order for that to happen we need to overcome the poor UI experience of smartphones.

As part of these introductory slides, Roy showed a picture of the Intel atom processor for smartphones. It is smaller than rice grain yet is the brain of smartphone and x86 compatible, so these chips can bring After the presentation, over dinner, I asked Roy about why not put a dozen of these atom processors in one smartphone, given that they take virtually no space. Turns out this is currently not very feasible, because these processors are pretty battery-hungry, even though they are very tiny.

Roy's talk then focused on three things: context aware operation, resource sharing, and processor acceleration.

For context aware operation, the goal is to prioritize which sensors get focus.
iPhone made sensing mainstream, there are a dozen sensors in the iphone: accelerometer, light, touch screen, gps, mic, proximity, magnetometer, etc. But they cannot be always on. Roy mentioned the need for a triage processor to prioritize the sensors, processors, radio on the smartphone.

For resource sharing, the goal is to make maximum use of the computation device rich environment. Can the smartphone seamlessly share a nearby display, network, storage, or other peripherals. He gave the example of editing a movie at home. He calls this dynamic composable computing. At this point, he demoed a VNC running over the smartphone. The smartphone was running Linux, and from the laptop he sshed to the smartphone and started a remote X-window session which is hosted at the smartphone but displayed at the laptop.

For the processor acceleration, Roy asked the question of whether it is possible to share the processor by migrating an application from one device to the other. He said that generally the cleanest way of doing this is to migrate the entire OS, virtual machine. And for this he mentioned Satya's work at CMU on incremental VM synchronization. By doing VM synchronization in an incremental manner only for the parts that got changed, it is possible to improve the performance and reduce the latency of this synchronization to work in real-time. The example he gave was that your phone shares your home computer does something, and your work computer also gets synchronized with your work computer so you can continue at your work computer when you arrive there. (Turns out this is now common-place. Xen provides this. Amazon makes use of this for replicating your machine with them and checkpoint your service.)

My colleague Chunming Qiao asked Roy a question on whether he is aware of the single system VM migration made possible by Kerrighed. While traditional VM approaches are fitting multiple VMs in one physical machine, Kerrighed tries to run one VM over multiple physical machines. This is beneficial for providing more CPU, RAM, storage resources to your VM. Your VM can grow on the fly even beyond the limitations of the physical host. This will obviously be beneficial for your smartphone, as it can now use the CPU, RAM, storage resources from the other computers.

This talk was very interesting to me because we are trying to build a 1000 smartphone (cloud-enabled) testbed at UB. More on this later, once we make more progress on this.

### Onix: A Distributed Control Platform for Large-scale Production Networks

The Onix work (OSDI'10) builds on Nox. Essentially, Onix takes Nox and distributes over multiple servers.

Let me start with a brief refresher on Nox. (Or you can read my previous post on Nox) The main idea in Nox (and openflow) was to facilitate innovation by separating the control plane from the forwarding (data) plane. (In the current networking architecture, control and data planes are both implemented in the same place, the routers.) Nox introduced "software defined networking" (SDN): Nox uses a centralized controller to make the decisions (i.e., control plane); The routers implement only the data plane, and just follow directions from the controller while forwarding data. A drawback with Nox was that since it uses a single controller, it is prone to a single point of failure. Although the Nox work pointed out how this single controller can be distributed, it didn't pursue it further.

Enter Onix. Onix investigates how to distribute that single controller. In Onix the controller consists of a network information base (nib), and two other components: switch import/export and distribution import/export. All active network elements are stored in nib in key value pairs. Nib is a decentralized component, distributed over several Onix nodes. Once you change your local nib instance on one Onix node, those modifications are propogated to other nibs. The switch import/export component talks to switches to configure them according the instructions from the Onix node (it sort of acts as an interpreter). The distribution import/export component makes multiple nibs consistent with each other in an asynchronous manner.

Developers only see nib, exclusive locks on nib can only be attained on local instances. All the nib operations are asynchronous. No distributed locking mechanism is provided in the API. The limitation of this API is that it relies on application-specific logic to detect and provide conflict resolution of the network state.

For scalability, Onix let's you have hierarchies. Onix node C can coordinate Onix node A and B; In this case C's nib has two elements A and B. It is possible that A and B may be sharing some switches they are responsible with (those switches appears in nibs of both A and B).

As for reliability, Onix can detect link failures, then using the user written code, Onix fixes the link failures accordingly. Onix provides two data stores: a transactional data store (for durability of the local storage) and a one-hop DHT simple consistent hashing (for holding volatile network state in a fast manner).

The paper discusses 4 applications being built with Onix: a network management application, distributed virtual switch application, multi-tenant virtualized data centers (VPN implementation), and scale-out carrier-grade IP router. The paper also includes evaluation experiments.

The Onix paper ends on a cautionary note: "What we should make clear, however, is that Onix does not, by itself, solve all the problems of network management. The designers of management applications still have to understand the scalability implications of their design. Onix provides general tools for managing state, but it does not magically make problems of scale and consistency disappear. We are still learning how to build control logic on the Onix API, but in the examples we have encountered so far management applications are far easier to build with Onix than without it."

There are a lot of smart people pushing for creating a very programmable network. Hellerstein's work on declarative network programming is a nice complement/addition to Nox, Onix. I certainly appreciate that Nox and openflow makes evaluation and development of new protocols possible in the network. That is certainly very useful for network researchers. Even at the enterprise level, some programmability over the network would be welcomed, I guess. (Right now, the only control we have over networks is via the BGP level rules, and there is no control at the LAN level.) But, the problem is, it is very hard to find/hire people with expertise to program the enterprise network in the very fine grain programming environment these groups are providing. Maybe it is better to provide a network that just works as plug and play (like in the VL2 approach), rather than trying to provide complete control over every aspect of the network. It is unclear which approach is better.