Thursday, December 23, 2010

Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM

I wrote about Ousterhout's "The Role of Distributed State" work before. This review is for his recent work on RAMClouds, which appeared in SIGOPS Operating Systems Review.

This paper makes a case for keeping all the data in the RAM over distributed nodes in a datacenter. "A RAMCloud is not a cache like memcached and data is not stored on an I/O device, as with flash memory: DRAM is the permanent home for data." Obviously, storing everything in RAM would yield a very high-throughput (the paper mentions 100-1000x) and very low-latency (again the paper mentions 100-1000x) system compared to disk-based systems. However, the primary reason the authors are excited about RAMCloud is the following: "RAMCloud will simplify the development of large-scale Web applications by eliminating many of the scalability issues that sap developer productivity today." The motivation for RAMCloud is to provide a general-purpose storage system that scales far beyond existing systems, so for achieving scalability application developers do not have to resort to ad hoc techniques, such as Dynamo, PNUTS, Bigtable, that give up some of the benefits of traditional RDBMSes.

This quest for the holy grail is itself a point of contention. One of the critiques of the RAMCloud proposal is Jeff Darcy in these two posts. Jeff's main point is "You cannot shoehorn everything in one system and in RDBMS. Applications need many different kinds of storage."


I will now go into more technical details about feasibility of RAMClouds.

Latency and bandwidth trends
The paper gives these striking information on disk trends: "Disk capacity has increased more than 10000-fold over the last 25 years and seems likely to continue increasing in the future. Unfortunately, though, the access rate to information on disk has improved much more slowly: the transfer rate for large blocks has improved only 50-fold, and seek time and rotational latency have only improved by a factor of two."

Let's look at the absolute numbers for latencies. Modern hard drives have latencies under 10 milliseconds; in contrast RAM latency is in the 10 nanosecond range. So, RAM is 100,000 times faster. However, it is possible to use many disks in parallel to overlap these latencies and reduce these latencies. Caching also reduces the average latencies from disk significantly. Another way to defend the disk-based systems is to consider the question of whether very-low-latency really matters for cloud applications. The argument in the paper is: if you build a low-latency system, the applications will come. "With sufficiently low-latency none of the specialized approaches for scalability are needed. RAMClouds offer the hope of a new one-size-fits-all where performance is independent of data placement and a rich variety of queries becomes efficient." I was hoping to hear a more convincing argument for low-latency applications here instead.

Let's look at the numbers for bandwidth. Disks actually have pretty good throughputs. The bandwidth of disk is about 100 MB/s. The bandwidth of RAM is 5 GB/s, about 50 times faster, but in practice there are many challenges that prohibit seeing a full benefit from this RAM bandwidth. Network switches above the rack layer can easily become a bandwidth bottleneck in datacenters. (The paper acknowledges these challenges see a couple paragraphs below.) Also, special purpose filesystems should be developed for the RAM-disk bandwidth to reach this raw 5 GB/s number. I recently had an interesting conversation with my colleague, Tevfik, on this issue. Tevfik said that for a master's project he and his student studied how much improvement RAM-disks can provide compared to reading from disk. They moved data from RAM-disk to another half of the RAM (that is not part of the RAM-disk), and compared that with moving data from disk to RAM. Their finding was surprising, there wasn't much noticable improvement from using RAM-disks. There are probably two factors that contributed most to that result. The first one is, the operating systems are already very smart and perform a lot of buffering optimizations while reading from disk, so reading from disk is not that bad (except for pathological cases of constantly small random access reading workloads). Remember also that the disk has pretty good bandwidth. The second reason is that RAM-disk introduces latencies: since the program now uses a filesystem to access to your RAM, the access is made via a system call and not from the kernel directly. Although, there are some systems like tmpFS that reduces this overhead, those filesystems also restrict what you can do with your RAM-disk, how you can mount your RAM-disk to get a hierarchical storage management (HSM).

Cost and size trends
The latency and bandwidth trends mentioned in the paper favored RAMs more than disks. However, there are other striking trends that the paper ignores to mention. For example, the trends for the storage cost. In the last 30 years, the initial $193,000 per gigabyte cost of storing at the disk dropped to only 7 cents; this represents a cost decrease of nearly three-million percent. While the disk has $0.07 cost/gb, the RAM has $60 cost/gb according to the paper (even while excluding electric usage).

In order to argue against disk-based approaches that use generous caching to improve the performance, the paper mentions that caching is not effective for long-tail access patterns such as in Facebook. The solution the paper offers for this long-tail access problem is to keep all the data in the RAMCloud. However, this proposal ignores the size trends in data completely. Those large social network services have several 100 TBs of just text and log data, and every year the size of the data is expected to increase several folds.

Again, to make a case for the RAMClouds the paper cites a figure from another paper, and mentions that the dividing lines in Figure 2 are all shifting upwards with time, which will increase the coverage of RAMClouds in the future. The upwards shift is because "the boundary moves upwards as the cost/bit of DRAM improves; it moves to the right as the cost/query/sec. of flash improves. For all three storage technologies cost/bit is improving much more rapidly than cost/query/sec, so all of the boundaries are moving upwards." But this analysis ignores the trend for more data-storing/consuming needs of applications. With time, the storage needs of the applications will also grow rapidly, which may nullify the benefits from the aforementioned upwards shift in the RAMCloud boundary.

Challenges that need to be overcome for implementing RAMCloud:
The paper mentions that there are several research challenges for implementing RAMCloud efficiently and effectively. I will summarize two of them below.

Low-latency RPC is needed
Network switching at several layers adds delays. Network delays should be reduced by tuning TCP. Since there isn't much locality to data center traffic, the bisection bandwidth may need to be increased in the upper levels of datacenter networks as well. A second problem is the OS level delays. A general purpose OS introduces high overheads for interrupt processing, network protocol stacks, and context switching. "In RAMCloud servers it may make sense to use a special-purpose software architecture where one core is dedicated to polling the network interface(s) in order to eliminate interrupts and context switches."

Durability & availability is needed
Since RAM does not offer permanent storage, data will be lost if the node crashes. To provide durability and availability, the paper suggests to use two other RAM replicas for the same data. (This would triple the cost of RAMClouds.) "After a crash, one backup server for each shard reads its (smaller) log in parallel; each backup server acts as a temporary primary for its shard until a full copy of the lost server's DRAM can be reconstructed elsewhere in the RAMCloud. With this approach it should be possible to resume operation (using the backup servers) within a few seconds of a crash."

RAMCloud's applicability today
The following is the strongest claim in the paper for the applicability of RAMClouds: "It is probably not yet practical to use RAMClouds for large-scale storage of media such as videos, photos, and songs (and these objects make better use of disks because of their large size). However, RAMClouds are practical for almost all other online data today, and future improvements in DRAM technology will probably make RAMClouds attractive for media within a few years."

Based on my review above, however, I find this claim a bit over-optimistic. I think cost trends and size trends have not been taken into account appropriately for the analysis in the paper. Also, there are several research challenges to be addressed before we can reap the benefits of the latency and bandwidth trends. So I contend that RAMCloud is not cost-effective now, and it may not be cost-effective for sometime soon. I agree with this quotable statement from Jeff Darcy's post. "Nine out of ten people who think they have a truly RAM-cloud-appropriate access pattern should be spending their money not on extra RAM but on smarter programmers."

Post-confession: Actually, I didn't mean to take a negative stand against RAMCloud when I started writing this review. I guess the reason this review went this way may be due to my computer scientist debugging instincts that led me to try and poke holes against "the case for RAMCloud". So, let me end by pointing to a saner perspective. Todd Hoff asks this question of "Are Cloud Based Memory Architectures The Next Big Thing?" in his blog and answers it a lot better than I can. More links on the RAMCloud paper are listed in this post. And here is James Hamilton's summary of the RAMCloud talk.

5 comments:

urssur said...

Yeah , get RAMclouds out there :D.

You don't need to make a case for it dude, it's a friking awesome idea ... why have your app be slow because of cheap hardware ? BAM smash in 20gb of DDR3 and away it goes.

Morgan said...

There's a time mistake in your post:

"Let's look at the absolute numbers for latencies. Modern hard drives have latencies under 13 milliseconds; in contrast RAM latency is in the 5 nanosecond range. So, RAM is 2,000 times faster."

1ms = 1 million ns, so by these numbers RAM is 2,000,000 times faster.

I never remembered RAM having a 5 ns access time - it was always ~200ns, but things could have changed.

Murat said...

Thanks Morgan, fixed the numbers.

SlipperySlope said...

Terracotta makes the case for using a memory cache as the database for java application clusters.

Jim Lu said...

Couchbase is another new star to use memcached as its base plus some database features. If RAM-cloud can be implemented like Couchbase will be much better.