Tuesday, April 19, 2011

Centrifuge: Integrated Lease Management and Partitioning for Cloud Services

For performance reasons many large-scale sites (LinkedIn, Digg, Facebook, etc.) employ a pool of backend servers that operate purely on in-memory state. The frontend servers that forward the requests to the backend servers then should be very carefully implemented to ensure that they forward the requests to the correct backends. The frontend should ensure that the selected backend has the requested object in its memory and also possesses the lease to update the object. Unfortunately, things may change quickly at the backend; backend servers may fail, new ones may join, backend servers may start to cache different objects due to load-balancing requirements, and leases may exchange hands. All of these make the task of programming the frontend very challenging and frustrating.

This work (from NSDI'10) proposes Centrifuge, a datacenter lease manager that addresses this problem. The Centrifuge system has been deployed as part of Microsoft live mesh service, a large scale commercial cloud service that finds use for file sharing, notifications of activity on shared files, virtual desktop, etc.

There are 3 parts to the Centrifuge. The frontend servers use lookup library to learn about the leases and partitioning from the manager and accordingly forward the requests to the backends. The manager is centralized (and Paxos replicated for fault-tolerance) and decides on leases and partitioning for the objects. The backend servers use owner library to coordinate with the manager about the leases and partitioning.

The manager consists of one leader and two standbys. A Paxos group is also used as a persistent/consistent state storage and as the leader elector. The leader makes all the decisions for partitioning and leases. If the leader dies, one of the standbys become a leader by contacting the Paxos group, and then learn the state of leases and partitioning from the Paxos group to start serving as the new leader.

The leader partitions the key space to 64 ranges using consistent hashing and handles the leases. The leader performs load-balancing by rearranging and reassigning these lease ranges to the backends while accounting for lightly/heavily used ranges and failed backends. In contrast to the traditional model where backends would request for the leases, in Centrifuge manager assigns leases to the backends unilaterally and this simplifies a lot of things such as enabling the assignment of leases as ranges rather than per object basis.

The lookup library at the frontend maintains a complete copy of the lease table (200KB constant since leases are per range not per object basis). The lookup returns "hints", these are checked at the backend owner library again. If the lookup table was wrong, the backend informs the corresponding frontend, and this triggers the lookup library to get the new table from the manager. Otherwise, the lookup table is renewed by contacting the manager every 30 seconds. Since leases are granted by the manager to the backends for 60 second periods (and most likely are renewed by the backend), the 30 second period for lookup table renewal is reasonable.

The paper provides extensive performance evaluations both from the MS live mesh system and from controlled examples. The Centrifuge system would come as handy for many cloud deployments.

Tuesday, April 12, 2011

Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance

This paper from FAST'09 introduces smoke and mirrors filesystem (SMFS) which mirrors files at a geographically remote datacenter with negligible impact on performance. It turns out remote mirroring is a major problem for banking systems which keep off-site mirrors (employing dedicated high-speed 10-40Gbits optical links) to survive disasters.

This paper is about disaster tolerance, not your everyday fault-tolerance. The fault model includes that the primary site may get destroyed, and some sporadic packet losses upto 1% may occur simultaneously as well, yet still no data should be lost. (Data is said to be lost if the client is acknowledged for the update but the corresponding update/data no longer exists in the system.) The primary site being destroyed may be a bit over-dramatization. An equivalent way to state the fault model would be that the paper just rules out a post~hoc correction (replay or manual correction). Here is how manual correction would work: if power outage occurs and the system drops some requests, and the mirror is inconsistent, then when we get the primary up again, we can restore the lost requests from the primary and make the mirror eventually-consistent. The paper rules that out, and insists that the system doesn't lose any data ever.

Existing mirroring techniques
Here are the existing mirroring techniques in use:

Synchronous mirroring only sends acknowledgments to the client after receiving a response from the mirror. Data cannot be lost unless both primary and mirror sites fail. This is the most dependable solution, but performance suffers because of wide-area oneway link latencies of upto 50ms.

Semi-synchronous mirroring sends acknowledgments to the client after data written is locally stored at the primary site and an update is sent to the mirror. This scheme does not lose data unless the primary site fails and sent packets drop on the way to the mirror.

Asynchronous mirroring sends acknowledgments to the client immediately after data is written locally. This is the solution that provides the best performance, but it is also the least dependable solution. Data loss can occur even if just the primary site fails.

Proposed network-sync mirroring
Clearly, semi-synchronous mirroring strikes a good balance between reliability and performance. The proposed approach in SMFS is actually a small improvement on the semi-synchronous mirroring. The basic idea is to ensure that once a packet has been sent, the likelihood that it will be lost is as low as possible. They do this by sending forward error correction (FEC) data along with the packet and informing the sending application when FEC has been sent along with the data. (An example of FEC is using Reed-Solomon error correction.) They call this technique "network-sync mirroring".

This idea is simple and straightforward, but this work provides a very good execution of the idea. SMFS employs previous work of the authors, Maelstrom (NSDI'08), to provide FEC for wide-area-network transmission. SMFS implements a filesystem that preserves the order of of operations in the structure of the filesystem itself, a log-structured filesystem. The paper also presents several real-world experiments to evaluate the performance of SMFS as well as its disaster tolerance. Here are two graphs from the evaluation section.

Consistency analysis in Bloom: a CALM and collected approach

This work from CIDR11 aims to provide theoretical foundations for correctness under eventual consistency, and identifies "order independence" (independence of program execution from temporal nondeterminism) as a sufficient condition for eventual consistency.

CALM (consistency and logical monotonicity) principle states that monotonic programs guarantee order independence, and hence, eventual consistency. A monotonic program is one where any true statement remains to be true as new axioms arrive. In contrast, in a non-monotonic program, new input can cause the earlier output to be revoked. A monotonic program guarantees eventual consistency, whereas non-monotonicity requires the use of coordination logic.

To simplify the task of identifying monotonic and nonmonotonic program segments, this work proposes using program analysis in a declarative language, Bloom. In Bloom, monotonicity of a program is examined via simple syntactic checks. Selection, join, and projection operators are monotonic, whereas aggregation and negation operators are nonmonotonic and are flagged as "points of order" in the program. Coordination logic needs to be introduced at these points of order to achieve consistency.

Bloom is implemented in Ruby as a domain specific language, called Bud. The paper presents two case studies, a key-value store and a shopping cart implementation, to demonstrate the above concepts. Bloom's alpha release was out a couple days ago. Congratulations to the Bloom team! I am sure they will receive useful feedback as the language gets used for building things.

Our related work on CALM principle
I tried to think of an analogue to the monotonicity property in the context of guarded-command languages. I think, monotonicity property corresponds to the guards being "stable" (closed under program actions) in a guarded-command program. If all guards of a program are stable, then the program is monotonic. For a guard that refers to the state at other processes, normally we would need synchronization and atomicity to evaluate the guard and execute the statement at the same step. But for actions with stable guards, we don't need that; we can evaluate the guard at one step and execute the statement at a later step with several other actions from other processes executing in between without need for synchronization.

We had in fact noticed this as a nice property a year ago, and published a paper on this with some generalizations of the stable property: Slow is Fast for Wireless Sensor Networks in the presence of Message Losses

This work on guarded-command languages can provide an imperative alternative to declarative languages for realizing the CALM principle. Declarative programs are hard to master for many developers (count me here) and may be difficult[different] to test and debug. Imperative programs have an advantage in this regard.

Concluding remarks
I think this is a promising direction to pursue. As Barbara Liskov mentioned in her ACM Turing Award Lecture (SOSP'09), "The Power of Abstraction", the next breakthrough for distributed computing will most likely be led by novel programming languages/abstractions.

I guess the next interesting question for this work is: What are the rules of thumbs for writing programs with less synchronization points?

Exercise questions
Are map, reduce primitives in Bloom monotonic? What does this imply for map-reduce chain programs?

Can you reconstruct any of the analysis for the provided programs?

Sunday, April 10, 2011

Life beyond Distributed Transactions: an Apostate's Opinion

Pat Helland is one of the veterans of the database community. He worked on the Tandem Computers with Jim Gray. His tribute to Jim Gray, which gives a lot of insights into Jim Gray as a researcher, is worth reading again and again.

This 2007 position paper from Pat Helland is about extreme scalability in cloud systems, and by its nature anti-transactional. Since Pat has been a strong advocate for transactions and global serializability for most of his career, the title is aptly named as an apostate's opinion.

This paper is very relevant to the NoSQL movement. Pat introduces "entity and activities" abstractions as building primitives for extreme scalability cloud systems. He also talks about at length about the need to craft a good workflow/business-logic on top of these primitives.

Entity and activities abstractions
Entities are collections of named (keyed) data which may be atomically updated within the entity but never atomically updated across entities. An entity lives on a single machine at a time and the application can only manipulate one entity atomically. A consequence of almost-infinite scaling is that this programmatic abstraction must be exposed to the developer of business logic. Each entity has a unique ID, and entities represent disjoint sets of data.

Since you can’t update the data across two entities in the same transaction, you need a mechanism to update the data in different transactions. The connection between the entities is via a message addressed to the other entity.

Activities comprise the collection of state within the entities used to manage messaging relationships with a single partner entity. Activities keep track of messages between entities. This can be used to keep entities eventually-consistent, even when we are limited to do the transaction on a single entity. (Messaging notifies the other entity about this activity, and the other entity may update its state.)

Key-value tuple concept widely employed in key-value stores in cloud computing systems is a good example of an entity. However, key-value tuples do not specify any explicit "activities". Note that, if we can manage to make messages between entities idempotent, then we don't need to keep activities for entities; hence entity+activities concept reduces to the key-value tuple concept.

In fact several developers invented on their own different ad~hoc ways of implementing activities on top entities. What Pat is advocating is to explicitly recognize activities and develop a standard primitive for implementing them to avoid inconsistency bugs.

An example of an activity is found in Google's Percolator paper which replaced MapReduce for creating Google's pagerank index. Percolator provides a distributed transaction middleware leveraging on BigTable. Each row is an entity as a transaction is atomic with respect to a row at any time. However, to build a distributed transaction, the system should remember the state of the transaction with respect to other involved rows, i.e., "activities". This Percolator metadata is again encoded as a separate field in that row in BigTable. Percolator logs the state, for example, primary and secondary locks in these fields. (See Figure 5 for full list.) I guess using coordination services such as Zookeper is also another way of implicitly implementing activities.

Workflow is for dealing with uncertainty at a distance
In a system which cannot count on atomic distributed transactions, the management of uncertainty must be implemented in the business logic. The uncertainty of the outcome is held in the business semantics rather than in the record lock. This is simply workflow. Think about the style of interactions common across businesses. Contracts between businesses include time commitments, cancellation clauses, reserved resources, and much more. The semantics of uncertainty is wrapped up in the behaviour of the business functionality. While more complicated to implement than simply using atomic distributed transactions, it is how the real world works. Again, this is simply an argument for workflow but it is fine-grained workflow with entities as the participants.

Concluding remarks
Systematic support for implementing the activities concept is still lacking today. It seems like this concept needs to address more explicitly and more methodically to improve the NoSQL systems.

Workflow is also the prescribed as the way to deal with the lack of atomic distributed transactions. Workflow requires the developer to think hard and decide on the business logic for dealing with the decentralized nature of the process: time commitments, cancellation clauses, reserved resources, etc. But, are there any support for developing/testing/verifying workflows?

Saturday, April 9, 2011

Tempest and Deutronomy reviews

I am 3 weeks behind in writing summaries for the papers we discuss in my seminar. In order to have some chance of catching up, I am skipping the summaries for Tempest and Deutronomy papers, and refer to student summaries for these two.

Tempest: Soft state replication in the service tier (DSN'08)

Deuteronomy: Transaction Support for Cloud Data (CIDR'11)

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...