Thursday, June 6, 2013

Antifragility from an engineering perspective

I read Nassim Taleb's "Antifragility: Things That Gain from Disorder" book a while ago, and enjoyed it a lot. I have been thinking about antifragility in engineering systems, and thought it would be good to put what I have come up with so far in writing to be able to contribute to the discussion. There are some nice reviews of the book in various places. My intention here is not to review the book, but to try to look at antifragility from an engineering perspective. Unfortunately, this came out mostly as rambling, but here it is for what it is worth.

Engineered systems: 

Let's start with giving examples from the mechanical world. I will try to give examples for three increasingly-superior levels of reliability: robust-yet-fragile < resilient < antifragile.

Robust-yet-fragile. A good example here is the glass. Glass (think of automobile glasses or gorilla glass) is actually very tough/robust material. You can throw pebbles, and even bigger rocks at it, and it won't break or scratch, well, up to a point that is. The glass is very robust to the anticipated faults (stressor) up to a point. But, exceed that point, and then the glass is in shambles.  That shows an unanticipated stressor (a black swan event in Taleb's jargon) for the glass: a ninja stone. The ninja stone is basically a piece of ceramic that you take from the spark plug, and is denser than glass. So if you gently throw this very little piece of ceramic to your car window, it breaks in shambles. This is a well-known trick to break into cars.

This is called a robust-yet-fragile structure, and this is actually why we had the Titanic disaster. Titanic, the ship, had very robust panels, but again upto a point. When Titanic exceeded that point a little bit (with the iceberg hitting it), the panels broke into shambles, very much like the glass meeting ninja stone. Modern ships after Titanic, went for resilient, instead of robust (yet fragile) panels. The resilient panels bend easier, but they don't break as miserably. They still hold together to the face of an extreme stressor. Think of plastic; it is less robust but more resilient than glass.

The robust-yet-fragile effect is also known as highly optimized tolerance. If you optimize tolerance for one anticipated stressor, you become very vulnerable to another unanticipated fault. (Much like the closed Australian ecosystem.) There is some literature on robust yet fragile systems.

Resilient. Robust systems mask stressors (upto a point), and they optimize for certain anticipated stressors and fail miserably for unanticipated stressors. The resilient system approach is the opposite. It prescribes not to mask stressors; the stressors perturb the system somewhat, but you eventually recover from them. A good slogan for resilience is robust-yet-flexible. These systems stretch somewhat with the stressor, but not fail completely or discretely.

Engineers embrace resilience today. We can see this idea in construction, in bridges, skycrapers, which are built in a flexible way to sway somewhat with wind and earthquakes but not fail completely. In fact, by flexing/stretching a bit with stressors, these systems remain unharmed and last longer, as they tolerate the shocks by rolling with them instead of absorbing the shocks fully.

Today, it is more or less established that systems should aim to expose stressors/faults/problems in some suitable manner, rather than hide them. Even as a system masks problems, it should at least report/log them. Otherwise your system will die slowly and you won't know (such accounts are more common than you would imagine).

Antifragility in mechanical systems. Taleb emphasizes repeatedly that the antifragility idea is not robust and not just resilient. An antifragile system thrives under failures, not just tolerate it. Resilient is better than robust(-yet-fragile), and antifragile is better than resilience in this respect.

Taleb says in the book that one of the few examples of antifragile materials he knows is carbon nanotubes, which gets stronger when faced with a stressor. Here is another antifragile material, a non-newtonian fluid.

There are not many antifragility examples to mechanical systems. In fact, Taleb gives the washing machine versus the cat section in the book to illustrate this point. The cat (living organism) is antifragile. The cat becomes stronger with stressors: It is use it or lose it for the cat, as her muscles would atrophy in the absence of use. So living organisms benefit from stressors. But for the washing machine (mechanical system) it is use it and lose it, as there is wear and tear associated with usage. A mechanical system does not gain much from stressors. (Actually my old car sitting in the garage disagrees. When a car is not used, it starts developing some problems, and you can say that the car benefits to a degree from being used and stressors. This is similar to the state of a vacant house versus an occupied house.)

For engineering of mechanical systems, maybe antifragility is not that applicable. With the wisdom of hundreds of thousands of man-years in engineering (practical systems and technology), why aren't we already seeing examples of antifragility?

Antifragility is not well formalized in the book, especially from the engineering perspective. So, there is some gray area for which systems we can categorize as antifragile. Does antifragility involve effects of randomization, white-noise effects? If that is the case, you can say that engineered systems benefit from some randomness. It was found that Norbert Weiner's mechanical target-tracking calculators/computers performed better in the plane (with noise from vibration) than on the ground (with no noise). There is this thing called stochastic resonance, which has been harvested to achieve some gains in practice. These are sort of white noise asyncronized noise to the systems. On the other hand, it is also well-known that synchronized resonance (an army marching in unison) can bring down bridges. Maybe by harvesting resonance and learning to use it, antifragile mechanical systems can be engineered. Unfortunately I don't know much about this domain, and I am not sure if there are some good progress and results in this domain.

Antifragility in cyber world

Antifragility is actually all about environment actions and reaction by the system. An antifragile system reacts to environment actions to keep a utility function high, and sometimes achieve a very high utility function (this is when stressors help to improve the system).  This is called the barbell strategy in the book: Be safe with the utility function, and sometimes improve it drastically.

Then we can define antifragility as the ability to improve the utility function to the face of "bad" environment actions. But what is bad? One man's food can be another man's poison. Bad can be defined with respect to (comparative to) other systems: When other systems are badly affected by the environment actions, if you are gaining then you are antifragile. (This also implies that for zero-sum environments, you are being antifragile to the expense of the fragility of others.)

Probabilistic algorithms. The CS literature is full of examples of probabilistic algorithms that do much better than any deterministic algorithm. The impossibility results (such as FLP and attacking generals) are circumscribed by probabilistic algorithms. Some algorithms specifically benefit from the randomness: the random the process the better they fare. A nice book has been written by Rajiv Motwani and Prabhakar Raghawan on randomized algorithms.

Randomness is especially great for breaking ties, and used in networking and distributed systems literature for that. For example, in the ethernet and wifi protocols, random backoffs are used so that communication over a shared medium is possible at all. Do these examples count as antifragility?

This also reminds me of an example from the Sync book by Strogatz (which is an excellent book about research, read it if you haven't already). Strogatz has formulated a sync problem for runners on a running track. "Each runner has his own speed, which is analogous to the frequency of an oscillator, and all the runners shout at and are heard by every other runner, which is analogous to the coupling between the oscillators. Depending on the initial conditions and the setup of the coupling, a group of runners may synchronize into a single block all running at the same speed, fall into chaos everybody running on her own, or anything in between." It was found that when the runners (speeds and positions) are not very uniform but rather somewhat random, synchronization was possible and achieved faster. It was also found that there is a sudden phase transition non-sync and sync outcomes. I don't know what (if anything) this implies for antifragility.

BitTorrent example. For an antifragile system, the more you try to stress the system, the stronger it grows. A great example is bittorrent. In bittorrent, the more a file becomes traffic hotspot, the faster it gets to download it. Bittorrent streams gains from hotspots and contention by exploiting the Network effect to provide scaling. If a file is popular for downloads, then its parts are available from more peers, so it will be faster to download it from many peers available.

Fault-tolerant computing angle

Self-stabilization. In the fault-tolerant computing domain, self-stabilization comes to mind immediately as an example of resiliency. Self-stabilization, first proposed by Dijkstra in 1970s, calls for not categorizing faults but instead to design tolerance for anticipated faults. Treat all faults uniformly as a perturbation to the system, and design your system to be tolerant to perturbation. So regardless of faults or combinations of faults, your system is going to recover. This is the self-stabilization view. There has been a lot of work on self-stabilizing computer systems in the literature. The trivial kinds of stabilizing systems are soft-state systems, and restartable systems.

Self-stabilization does not fit the antifragility definition. Since the state corruption abstraction is too abstract and well-defined, the "corruption helps" becomes an oxymoron. Corruption is pre-defined to be the bad thing, so it is hard to play that game. Maybe you have zones of perturbation, and the further you are perturbed away, the faster you recover from it. But, if you have a fast recovery method why not use it for the other regions as well?

Self-adaptive systems, a more recent concept can fit the antifragility idea. But self-adaptive systems are not well defined/formalized, and I am not aware of any big success stories from that line of thinking yet.  And I guess the philosophical difference between self-adaptive versus antifragile systems is that, you can still have a predefined/constant antifragile system that is not self-adaptive. You can have an antifragile system that uses the barbell idea, and does not do bad in any input, but does great in some inputs. That system is not adaptive, but it is still antifragile.

Software rejuvenation. Software rejuvenation can be an example of antifragility in the fault-tolerant computing domain. The software rejuvenation idea is to reset the software occassionally to get rid of memory leaks, unoptimal or incorrect bugs. The antifragility angle is that, if the number of faults increase, you start a software rejuvenation. So the increased number of faults helps for faster recovery/rejuvenation of the software. Specifically the Microsoft bug reporting paper comes to my mind, which is a really interesting piece of work that appeared in SOSP 2009. The idea is that if a bug is more prevalent, it is reported more automatically and is fixed first.

Wednesday, June 5, 2013

RAMCloud reloaded: Log-structured Memory for DRAM-based Storage

I had written a review about "the case for RAMCloud" paper in 2010.  RAMCloud advocates storing all the data in the RAM over distributed nodes in a datacenter. A RAMCloud is not a cache like memcached and data is not stored on an I/O device; DRAM is the permanent home for data.  Obviously, storing everything in RAM could yield a very high-throughput (100-1000x) and very low-latency (100-1000x) system compared to disk-based systems.

In the last 3 years, the RAMCloud group headed by John Ousterhout at Stanford has done significant work on this project, and this is a good time to write another review on RAMCloud. My review mostly uses (shortened) text form their papers, and focuses on some of the significant ideas in their work.

State of the RAMCloud 

Data model. RAMCloud provides a simple key-value data model consisting of uninterpreted data blobs called objects. Objects are grouped into tables that may span one or more servers in the cluster; a subset of a table stored on a single server is called a tablet. Objects must be read or written in their entirety. RAMCloud is optimized for small objects --a few hundred bytes or less-- but supports objects up to 1 MB.

Each master's memory contains a hash table and a collection of objects stored in DRAM. The hash table contains one entry for each object stored on that master; it allows any object to be located quickly, given its table and key.

RAMCloud recovery. To ensure data durability against crashes and power failures, each master must keep backup copies of its objects on the secondary storage of other servers. The backup data is organized as a log for maximum efficiency. Each master has its own log, which is divided into 8 MB pieces called segments. Each segment is replicated on several backups (typically two or three).  A master uses a different set of backups to replicate each segment, so that its segment replicas end up scattered across the entire cluster. E.g., this segment replicated at nodes 11 and 44, and the next segment at nodes 26 and 37.

When a master receives a write request from a client, it adds the new object to its memory and forwards information about that object to the backups. The backups append the new object to segment replicas stored in nonvolatile buffers; they respond to the master as soon as the object has been copied into their buffer, without issuing an I/O to secondary storage. Once the master has received replies from all the backups, it responds to the client. Each backup accumulates data in its buffer until the segment is complete, at which point it writes the segment to secondary storage and reallocates the buffer for another segment.

RAMCloud recovery is then possible by reading from the segments and constructing the hashtables at the masters. The paper describing this recovery process appeared in SOSP 2011.

RAMCloud log cleaner. RAMCloud uses a log cleaner to reclaim free space that accumulates in the logs when objects are deleted or over-written. Each master runs a separate cleaner. The cleaner selects several segments to clean (based on the amount of free space & age of data). The cleaner then scans these segments stored in memory and copies any live objects to new survivor segments. Liveness is determined by checking for a reference to the object in the hash table. The cleaner makes the old segments' memory available for new segments, and notifies the backups for those segments that they can reclaim the storage for the replicas.

In this post, we will look at the paper for the RAMCloud log cleaner, which has been published as a technical report and is under submission at SOSP'13.

The problem with Memory allocation & Garbage collection

Memory allocators fall into two general classes: noncopying allocators and copying allocators.  Non-copying allocators such as malloc cannot move an object once it has been allocated, so they are vulnerable to fragmentation. Non-copying allocators work well for individual applications with a consistent distribution of object sizes, but they can easily waste half of memory when allocation patterns change.  Copying allocators are those that can move objects after they have been created. In principle, garbage collecting (GC) can solve the fragmentation problem by moving live data to coalesce free heap space.

GC comes with a trade-off: at some point all of these collectors (even those that label themselves as "incremental") must walk all live data, relocate it, and update references. This is an expensive operation that scales poorly, so GC delay global collections until a large amount of garbage has accumulated. As a result, they typically require 1.5-5x as much space as is actually used in order to maintain high performance. This erases any space savings gained by defragmenting memory.

Pause times are another concern with copying allocators with GC. At some point all collectors must halt the processes' threads to update references when objects are moved. Despite work on real-time garbage collectors, even state-of-art solutions have maximum pause times of hundreds of microseconds, or even milliseconds --this is 100 to 1,000 times longer than the round-trip time for a RAMCloud RPC.

Log Structured Memory (LSM)

Existing allocators are not able to use memory efficiently, particularly in the face of changing access patterns, so are not applicable for RAMCloud. An ideal memory allocator for a DRAM-based storage system such as RAMCloud should have two properties: 1) It must be able to copy objects in order to eliminate fragmentation. 2) It must not require a global scan of memory: instead, it must be able to perform the copying incrementally, garbage collecting small regions of memory independently with cost proportional to the size of a region. This paper shows how to use a log-structured approach to memory management to achieve fragmentation-free and fast memory allocation.

In RAMCloud it was a natural choice to use a logging approach on disk to backup the data stored in main memory (given also that log-structured file-system (LSF) has been introduced by Ousterhout in 1991, it was inevitable actually :-). However, it was surprising to discover that logging also makes sense as a technique for managing the data in DRAM: Log-structured memory (LSM) takes advantage of /the restricted use of pointers/ in storage systems /to eliminate the global memory scans/ that fundamentally limit existing garbage collectors. The result is an efficient and highly incremental form of copying garbage collector that allows memory to be used efficiently even at utilizations of 80-90%.

RAMCloud uses a single log-based approach for managing both disk and main memory, with small policy differences that optimize the usage of each medium. Combining LSM and LSF, RAMCloud adopts a 2-level approach to cleaning with different policies for cleaning data in memory versus secondary storage. Since log data is immutable once appended, the log cleaner can run concurrently with normal read and write operations. Furthermore, multiple cleaners can run in separate threads. In the rest of this post, we will discuss each of these ideas: the LSM logs, 2-level cleaning, and parallel cleaning.

Log metadata and log cleaning

Each new log segment contains a log digest that describes the entire log. Every segment has a unique identifier, allocated in ascending order within a log (see Fig 3 above). Each object in the log must be self-identifying: it contains the table identifier, key, and version number for the object in addition to its value. Log metadata also contains tombstones that identify deleted objects. When an object is deleted or modified, RAMCloud does not modify the object's existing record in the log. Instead, it appends a tombstone record to the log.

To reclaim free space, the log cleaner should copy live data out of the segments it chooses for cleaning. Unfortunately, the cost of log cleaning at the disk rises rapidly as memory utilization approaches 100%.  E.g., if segments are cleaned when 80% of their data are still live, the cleaner must copy four bytes of live data for every byte it frees. If segments are cleaned at 90% utilization, the cleaner must copy 9 bytes of live data for every byte it frees.

2-Level cleaning

In the original implementation of RAMCloud, disk and memory cleaning were tied together: cleaning in memory was mirrored to the backup copies on disk. This made it impossible to achieve both high memory utilization and high write throughput. The solution to this is to clean in-memory and on-disk logs independently: 2-level cleaning. This way, memory can have higher utilization than disk. The cleaning cost for memory will be high, but DRAM can easily provide the bandwidth required to clean at 90% utilization or higher.  Disk cleaning happens less often. The disk log becomes larger than the in-memory log, so it has lower overall utilization (50%), and this reduces the bandwidth required for cleaning.

The first level of cleaning, called segment compaction, operates only on the in-memory segments on masters. It compacts a single segment at a time, copying its live data into a smaller region of memory and freeing the original storage for new segments. Segment compaction maintains the same logical log in memory and on disk: each segment in memory still has a corresponding segment on disk. However, the segment in memory takes less space because deleted objects and obsolete tombstones have been removed.

The second level of cleaning is called combined cleaning. If the disk log is allowed to grow until it consumes twice as much space as the log in memory, the utilization of segments cleaned on disk will never be greater than 50%, which makes cleaning relatively efficient.

Segments and seglets. With compaction, however, segments in memory can have different sizes. Each RAMCloud master divides its memory into fixed-size 64KB seglets. A segment consists of a collection of seglets, and the number of seglets varies with the size of the segment. Segment compaction cannot reorganize data, since it must preserve the 1-to-1 mapping between segments in memory and those on disk. Combined cleaning is there to enable segment reorganization.

Parallel cleaning

Parallel cleaning in RAMCloud is greatly simplified by the use of a log structure and simple metadata: Since segments are immutable after they are created, the cleaner never needs to worry about objects being modified while the cleaner is copying them. This means that the basic cleaning mechanism is very straightforward: the cleaner copies live data to new segments, atomically updates references in the hash table, and frees the cleaned segments.

There are 3 points of contention between cleaner threads and service threads handling read and write requests: 1) both cleaner and service threads need to add data at the head of the log, 2) the threads may conflict in updates to the hash table, 3) the cleaner must not free segments that are still in use by service threads. We consider these 3 contention points next.

Head of log contention. The most obvious way to perform cleaning is to copy the live data to the head of the log, but this would create contention for the log head between cleaner threads and service threads that are writing new data.  RAMCloud's solution is for the cleaner to write survivor data to different segments than the log head. Each cleaner thread allocates a separate set of segments for its survivor data. Synchronization is required when allocating segments, but once segments are allocated, each cleaner thread can copy data to its own survivor segments without additional synchronization.

Hash table contention. Hash table is used both by service threads and cleaner threads. The cleaner uses the hash table to check whether an object is alive (by seeing if the hash table currently points to that exact object). If the object is alive, the cleaner copies it and updates the hash table to refer to the new location in a survivor segment. To ensure consistency while reducing contention, RAMCloud currently uses fine-grained locks on individual hash table buckets. Although contention for these locks is low (only 0.1% under heavy write and cleaner load) it still means that the cleaner must acquire and release a lock for every object it scans, and read and write requests must also acquire an extra lock. Lockless solution is future work.

Freeing segments in memory. Once a cleaner thread has cleaned a segment, the segment's storage in memory can be freed for reuse. However, it is possible that a service thread had begun using the data in the segment before the cleaner updated the hash table; if so, the cleaner must not free the segment until the service thread has finished using it.  A simple solution is to use a reader-writer lock for each segment, with service threads holding reader locks while using a segment and cleaner threads holding writer locks before freeing a segment. However, this would increase the latency of all read requests by adding another lock in the critical path.

Instead, RAMCloud uses a mechanism based on epochs, which avoids locking in service threads. The only additional overhead for service threads is to read a global epoch variable and store it with the RPC.  When a cleaner thread finishes a cleaning pass, it increments the epoch and then tags each cleaned segment with the new epoch (after this point, no new request will use the cleaned segment).  The cleaner occasionally scans the epochs of active RPCs and frees the segment when all RPCs with epochs less than the segment's epoch have completed. This approach creates additional overhead for segment freeing, but these operations are infrequent and run in a separate thread where they don't impact read and write times.

Evaluation and concluding remarks

In the most stressful workload, a single RAMCloud server can support 230,000-400,000 durable 100-byte writes per second at 90% memory utilization. The two-level approach to cleaning improves performance by 2-8x over a single-level approach at high memory utilization, and reduces disk bandwidth overhead by 2-20x.  Parallel cleaning effectively hides the cost of cleaning: an active cleaner adds only about 3% to the latency of typical client write requests. For detailed evaluation results, you can check the paper.

To recap, this work showed that logging also makes sense as a technique for managing the data in DRAM: Log-structured memory takes advantage of /the restricted use of pointers/ in storage systems /to eliminate the global memory scans/ that fundamentally limit existing garbage collectors. The result is an efficient and highly incremental form of copying garbage collector that allows memory to be used efficiently even at utilizations of 80-90%.

Questions for further thinking along these lines:
"Each object in the log must be self-identifying; when the log is scanned during crash recovery, this information allows RAMCloud to identify the most recent version of an object and reconstruct the hash table." Then, why not backup (copy reconstruct) the hash-table as well?

"Log-structured memory takes advantage of the restricted use of pointers in storage systems to eliminate the global memory scans that fundamentally limit existing garbage collectors." Can you think of similar "restricted use of pointers" approaches to enable log compaction and parallel cleaning?

LSM worked for storage systems. What other applications/use-cases would this be appropriate? What applications/use-cases would this be inappropriate?

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...