Sunday, May 29, 2011

Online migration for geodistributed storage systems, Usenix ATC 2011

This paper investigates the problem of migrating data between data centers. Data needs to be moved from one center to another based on the access patterns, for example, the user may have moved from East to West coast. The problem is complicated by the large size of data that needs to be moved, the requirement to perform the migration online without blocking access to any part of the data from anywhere, and finally that the data can be accessed/modified concurrently in different locations.

To address this problem, the paper proposes an overlay abstraction. The goal of the abstraction is to implement migration as a service, so that the developer does not have to deal with the race conditions that may result while migrating data in ad hoc ways. The analogy of overlay is a sheet of transparencies. Remember the old days before powerpoint? The presenters used to print the slides on transparencies, and do animation by overlaying one transparency over another. The overlay idea is similar. "Where it is clear, the overlay reveals the contents underneath; where it is written, the overlay overrides those contents." Basically, the idea is to represent data as stacked layers in different places. This enables migration of data in smaller units, and the capability of having part of the data in one location and the other parts in other locations.


Overlay is implemented much like the (doubly) linked-list. Each overlay has two pointers, one pointing to the overlay below, and one pointing to the overlay above. Overlay insertion and deletion are similar to those one would expect from linked-list implementations. The overlay is designed such that every operation is linearized by the overlay structure even when the operations are submitted from any data center. Moreover, read and write operations can be executed concurrently with the overlay structure operations and with each other, at many clients without blocking.

To write data to an object the client first finds the highest overlay by following the above pointers starting from the based location. (Base location is learned from the directory service.) The data is written to this highest level overlay. To read an object, again the highest overlay is found as the first step. If the data to be read is not there, then below pointers are followed until the data is reached.

The contribution of the paper is the abstraction that nicely separates policy level from the concurrency-safe execution of the actual migration operations. The paper presents several optimizations and more use-cases for the overlay structure (such as exploiting in-built replication for migration, multiway caching, and split overlays).

Friday, May 6, 2011

Refuse to Crash with Re-FUSE

This paper appeared in Eurosys'10. This is a well written paper: the paper holds your hand, and takes you for a walk in the park. At each step of the path, you can easily predict what is coming next. I like this kind of easy-reading papers, compared to the cryptic or ambiguous papers which make you wander around or try to guess which paths to take through a jungle of junctions.

The goal of this work is to provide support for restartable user-level filesystems. But, before I can tell you more about that, we first need to discuss user-filesystems. User-filesystems provides a way to add custom features (such as encryption, deduplication, access to databases, access to Amazon S3, etc.) on top of existing kernel-level filesystems. FUSE is a popular software that facilitates building user-filesystems on top of kernel-level filesystems. FUSE is available for Linux, FreeBSD, NetBSD, and MacOSX, and more than 200 userfilesystems have already been implemented using FUSE. GlusterFS, HDFS, ZFS are some of the well-known user-level filesystems implemented on top of FUSE.

FUSE works by wrapping the virtual filesystem (VFS) layer in UNIX systems at both sides. FUSE has a kernel file-system module (KFM) below the VFS layer that acts as a pseudo filesystem and queues application requests that arrive through the VFS layer. FUSE also has a libfuse module that exports a simplified filesystem interface between the user-level filesystem and the KFM.

Re-FUSE modifies FUSE to enable support for transparent restartability of the user-filesystem. The fault-model considered is transient fail-stop failures of the user-filesystem. Re-FUSE is based on three basic principles: request-tagging, system-call logging, and non-interruptible system calls. After a crash of the user-filesystem, Re-FUSE does not attempt to roll it back to a consistent state, but rather continues forward from the inconsistent state towards a new consistent state. Re-FUSE does so by enabling partially-completed requests to continue executing from where they were stopped at the time of crash.

Re-FUSE is tested with 3 popular representative user-filesystems implemented on top of FUSE. For testing robustness fault-injection (both controlled and random) is used; Re-FUSE enables the user-filesystem to mask failure and carry-on uninterrupted after a crash. Re-FUSE at most around 2-3% overhead in the normal operation, and recovers the filesystem in 10-100ms after a crash.

Sabbatical help

My tenure at University at Bufffalo, SUNY has just become official after the President signed on it. Having completed my 6 years at UB, I am now eligible to spend a sabbatical year. I am planning to spend 6 months of my sabbatical in the industry/research-lab working on cloud computing challenges. If you have any suggestions/connections to make this happen, please send me an email.

Why are algorithms not scalable?

Recently, a colleague emailed me the following:
Since you have been reading so much about clouds, CAP, and presumably lots of consensus things, you can answer better the question of algorithm scalability. How scalable are the popular algorithms? Can they do a reasonable job of consensus with 100,000 processes? Is this even a reasonable question? What are the fundamental problems, the algorithms or the lower level communication issues?
These are actually the right kind of questions to ask probing for deeper CS concepts. Here are my preliminary answers to these questions.

From what I read, consensus with 100K processes is really out of question. Paxos consensus was deployed on 5 nodes for GFS and similar systems: Zookeper, Megastore, etc. As another example Sinfonia's participant nodes in a transaction is also around limited to 5-10.

So what is wrong with algorithms, why are they unscalable? I guess one obstacle against scalability is the "online" processing requirement, and most algorithms are inherently limited because of that. When you accept offline processing, as in mapreduce tasks, then you can afford more scalability.

I think a more fundamental problem for scalability is the synchronization requirements in these algorithms. Synchronization is a deal breaker for scalability. The Ladis'08 summary has a very nice discussion related to this (see pages 4-5).

Our traditional complexity measures are for time complexity and message complexity. Time complexity is not much relevant for scalability, but message complexity may give some indication of synchronization complexity. A node sending a message to all other nodes is very bad for scalability (which is the case for both Paxos and Sinfonia). Workflows on the other hand have very little message complexity. They store data to disk (rather than requiring synchronizing several processes with messages), and any process may later visit this data if they need it. This decouples data from computation and enables more processing to be added as needed for scalability. Workflows tend to be more data-centric whereas algorithms tend to be more computation-centric. Here is a brief discussion of the workflow idea.

How does one formally define "synchronization point" that is not specific to a particular system/algorithm? I don't know of a good answer, but I think it is closer to a snapshot. Bloom also identifies synchronization points as culprits, and defines synchronization points as operations that are classified as non-monotonous in the Bloom declarative language.

Tuesday, May 3, 2011

On designing and deploying Internet scale services

This 2008 paper presents hard-earned lessons from Jim Hamilton's experience over the last 20 years in high-scale data-centric software systems and internet-scale services. I liken this paper to "Elements of Style" for the domain of Internet scale services. Like the "Elements of Style" this paper is also not to be consumed at once, it is to be visited again and again every so often.

There are three main overarching principles: expect failures, keep things simple, automate everything. We will see reflections of these three principles in several subareas pertaining to Internet-scale services below.

Overall Application Design
Low-cost administration correlates highly with how closely the development, test, and operations teams work together. Some of the operations-friendly basics that have the biggest impact on overall service design are as follows.

/Design for failure/ Armando Fox had argued that the best way to test the failure path is never to shut the service down normally, just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won't work when needed. The acid test for fault-tolerance is the following: is the operations team willing and able to bring down any server in the service at any time without draining the work load first? (Chaos Monkey anyone?)

/Use commodity hardware slice/ This is less expensive, scales better for performance and power-efficiency, and provides better failure granularity. For example, storage-light servers will be dual socket, 2- to 4-core systems in the $1,000 to $2,500 range with a boot disk.

Automatic Management and Provisioning
/Provide automatic provisioning and installation./
/Deliver configuration and code as a unit./
/Recover at the service level/ Handle failures and correct errors at the service level where the full execution context is available rather than in lower software levels. For example, build redundancy into the service rather than depending upon recovery at the lower software layer."

I would amend the above paragraph by saying "at the lowest possible service level where the execution context is available". Building fault-tolerance from bottom up is cheaper and more reusable. Doing it only at the service level is more expensive and not reusable. Building fault-tolerance at the service level is also conflicting with the principle they cite "Do not build the same functionality in multiple components".

Dependency Management
As a general rule, dependence on small components or services doesn't save enough to justify the complexity of managing them and should be avoided. Only depend on systems that are single, shared instance when multi-instancing to avoid dependency isn't an option. When dependency is inevitable as above, manage them as follows:
/Expect latency/ Don't let delays in one component or service cause delays in completely unrelated areas. Ensure all interactions have appropriate timeouts to avoid tying up resources for protracted periods.
/Isolate failures/ The architecture of the site must prevent cascading failures. Always "fail fast". When dependent services fail, mark them as down and stop using them to prevent threads from being tied up waiting on failed components.
/Implement inter-service monitoring and alerting/

Release Cycle and Testing
Take a new service release through standard unit, functional, and production test lab testing and then go into limited production as the final test phase. Rather than deploying as quickly as possible, it is better to put one system in production for a few days in a single data center, two data centers and eventually deploy globally. Big-bang deployments are very dangerous.
/Ship often and in small increments/
/Use production data to find problems/
/Support version roll-back/

Operations and Capacity Planning
Automate the procedure to move state off the damaged systems. Relying on operations to update SQL tables by hand or to move data using ad hoc techniques is courting disaster. Mistakes get made in the heat of battle. If testing in production is too risky, the script isn't ready or safe for use in an emergency.
/Make the development team responsible./ You built it, you manage it.
/Soft delete only./ Never delete anything. Just mark it deleted.
/Track resource allocation./
/Make one change at a time./
/Make everything configurable./ Even if there is no good reason why a value will need to change in production, make it changeable as long as it is easy to do.

Auditing, Monitoring, and Alerting
/Instrument everything/
/Data is the most valuable asset/
/Expose health information for monitoring/
/Track all fault tolerance mechanisms/ Fault tolerance mechanisms hide failures. Track every time a retry happens, or a piece of data is copied from one place to another, or a machine is rebooted or a service restarted. Know when fault tolerance is hiding little failures so they can be tracked down before they become big failures. Once they had a 2000-machine service fall slowly to only 400 available over the period of a few days without it being noticed initially.

Graceful Degradation and Admission Control
/Support a "big red switch."/ The concept of a big red switch is to keep the vital processing progressing while shedding or delaying some noncritical workload in an emergency.
/Control admission./ If the current load cannot be processed on the system, bringing more work load into the system just assures that a larger cross section of the user base is going to get a bad experience.