Friday, January 28, 2011

Lessons from giant-scale services

This is a 2001 paper by Eric Brewer summarizing the lessons he learned from designing and developing giant-scale internet services (see Inktomi) between 1995-2001. This paper does not mention the CAP theorem at all! This is very surpising given that CAP theorem is what Brewer is famous to most people in the internet services domain. (CAP theorem is very famous and has been very influential for the design of Internet services and cloud computing systems. CAP theorem was first mentioned in publication in 1998 paper, and then presented in PODC'00 keynote by Brewer. See these two posts for a detailed treatment of the CAP theorem.)

Instead this paper is all about DQ principle as a design guideline for internet services. The paper mentions Harvest and Yield which may be seen as finer granularity versions of Consistency and Availability respectively, and may connect back to the CAP theorem. I discuss that connection at the end.

DQ principle
In the web services model, clients make read queries to the servers and servers return data to reply those queries. The DQ principle is simple:
data_per_query * queries_per_second == constant

The intuition behind this principle is that the system's overall capacity tends to have a particular physical bottleneck, such as total I/O bandwidth of all disks in the system, which is tied to data movement. The DQ value is the total amount of data that has to be moved per second on average, and it is thus bounded by the underlying physical limitation. At the high utilization level typical of giant-scale systems, the DQ value approaches this limitation.

DQ is hopefully not network-bound or CPU-bound (e.g., not bound by the load manager's performance). If so, you should fix the problem by throwing more resources at it, so you can provide a large-scale web service. Web services are generally data intensive so DQ model is a good fit for most web-services.

DQ is configuration specific. DQ increases with node additions (that is provided that network bandwidth limit is not reached), and DQ decreases with node failures. So failures may force your system to become over capacity. (You may also consider a sudden increase in query traffic over system capacity also as a failure model. This reduces to the same problem when looking at the reverse perspective.)

Yield and Harvest
Yield is the probability of completing a request, and harvest measures the completeness of the answer to the query. More formally:
yield= queries_completed / queries_offered
harvest = data_available / complete_data

Yield provides a better metric of availability than simply uptime. Being down for one second at peak and off-peak times generates the same uptime, but vastly different yields because there might be an order-of-magnitude difference in load between the peak second and the minimum-load second. Harvest provides a metric for consistency, of how much of the database is reflected in the response.

Due to faults or saturation DQ will be reached sooner or later, and the system will have to make a choice between reducing yield (i.e., stop answering requests) and reducing harvest (i.e., giving answers based on incomplete data). The key insight here is that we can influence whether faults impact yield, harvest, or both. Here is how yield and harvest relate to DQ: When DQ is reached, we can either limit Q (capacity) to maintain D, or we can reduce D and increase Q. We can focus on harvest through admission control (AC), which reduces Q, or on yield through dynamic database reduction, which reduces D, or we can use a combination of the two.

Replication vs. Partitioning
Replicated systems tend to map faults to reduced capacity (and to reduced yield at high utilizations), while partitioned systems tend to map faults to reduced harvest, as parts of the database temporarily disappear, but the capacity in queries per second remains the same. Consider a two-node cluster: The replicated version has traditionally been viewed as "better" because it maintains 100 percent harvest under a fault, whereas the partitioned version drops to 50 percent harvest. Dual analysis shows that the replicated version drops to 50 percent yield, while the partitioned version remains at 100 percent yield. Furthermore, both versions have the same initial DQ value and lose 50 percent of it under one fault: Replicas maintain D and reduce Q (and thus yield), while partitions keep Q constant and reduce D (and thus harvest).

Although you need more copies of the data with replication, the real cost is in the DQ bottleneck rather than storage space. Moreover, replication on disk is cheap, and only accessing the replicated data requires DQ points. According to these principles, you should always use replicas above some specified throughput. With no DQ difference, it sense to replicate the data once the partitions are a convenient size. You will enjoy more control over harvest and support for disaster recovery, and it is easier to grow systems via replication than by repartitioning onto more nodes.

We can vary the replication according to the data's importance, and generally control which data is lost in the presence of a fault. For example, for the cost of some extra disk space we can replicate key data in a partitioned system. Alternatively, we can exploit randomization to make our lost harvest a random subset of the data, (as well as to avoid hot spots in partitions). Many of the load-balancing switches can use a (pseudo-random) hash function to partition the data, for example. This makes the average and worst-case losses the same because we lose a random subset of "average" value. The Inktomi search engine uses partial replication; e-mail systems use full replication; and clustered Web caches use no replication. All three use randomization.

Graceful degradation
Graceful degradation is simply the explicit process for managing the effect of saturation on availability; that is, we explicitly decide how saturation should affect uptime, harvest, and quality of service. The paper gives the following graceful degradation examples taken from real systems:

Cost-based AC. Inktomi can perform AC based on the estimated query cost (measured in DQ). This reduces the average data required per query, and thus increases Q. Note that the AC policy affects both D and Q. Denying one expensive query could thus enable several inexpensive ones, giving us a net gain in harvest and yield. (AC could also be done probabilistically — perhaps in the style of lottery scheduling, so that retrying hard queries eventually works.)

Priority- or value-based AC. Datek handles stock trade requests differently from other queries and guarantees that they will be executed within 60 seconds, or the user pays no commission. The idea is to reduce the required DQ by dropping low-value queries, independently of their DQ cost.

Reduced data freshness. Under saturation, a financial site can make stock quotes expire less frequently. This reduces freshness but also reduces the work per query, and thus increases yield at the expense of harvest. (The cached queries don't reflect the current database and thus have lower harvest.)

Concluding remarks
Based on its usefulness in design of large-scale systems, I wonder why we have not been hearing about DQ principle as much as the CAP theorem. Maybe the principle is famous under a different term?

As I mentioned in the beginning, harvest and yield may be seen as finer granularity versions of consistency and availability respectively. So, does this harvest-yield tradeoff relate to the CAP theorem's consistency-availability tradeoff? I am not sure. The harvest-yield tradeoff mentioned in this paper stems from a capacity limitation, whereas the consistency-availability tradeoff in the CAP theorem stemmed from the partition (lack of communication) problem. I think this harvest-yield tradeoff may be more related to the consistency-lowlatency tradeoff mentioned in the PACELC model. If the system insists on providing full harvest, its yield will suffer as it will be able to complete less of the queries by unit time, and hence its latency will increase.

Tuesday, January 18, 2011

Hints for computer system design, ACM-OS'83

My seminar on cloud computing systems has started today with two papers. Here is the summary of the first paper we covered in the seminar.

Designing a computer system is very different from designing an algorithm: The external interface (that is, the requirement) is less precisely defined, more complex, and more subject to change. The system has much more internal structure, and hence many internal interfaces. The measure of success is much less clear. In this 1983 paper, Butler Lampson gives hints for computer system design based on his experience on building several systems in Xerox PARC labs. It is amazing how relevant and how fresh these hints are after 30 years of their publication. While these hints were not specifically targeting distributed systems design, several of these hints are applicable (and widely used) for cloud computing systems.

Most of the text below is verbatim copied from the paper. I omitted about half of the hints, and details/justifications about the hints. The actual paper is 27 pages long, and is well worth a read. A related paper is by Jim Waldo, called "On System Design". Another related paper, targeting cloud computing system design is by James Hamilton, called "On designing and deploying Internet scale services". We will cover that paper at the end of the semester in our seminar.

Hints pertaining to Functionality
"An interface separates an implementation of some abstraction from the clients who use the abstraction." I think most people are familiar with this definition. But, I guess, the next observation may come as a revelation to many (except those that have studied formal methods and program verification). "The interface between two programs consists of the set of assumptions that each programmer needs to make about the other program in order to demonstrate the correctness of his program." The reason this definition of interface is lesser-known could be because this aspect of interfaces do not show up frequently on traditional APIs. But this type of rely-guarantee interfaces is very relevant and important for correct system design. A component relies on some assumptions to be able to provide its guarantees.

Defining interfaces is the most important part of system design. Usually it is also the most difficult, since the interface design must satisfy three conflicting requirements: an interface should be simple, it should be complete, and it should admit a sufficiently small and fast implementation.

Keep it simple: Do one thing at a time, and do it well. An interface should capture the minimum essentials of an abstraction.

Make it fast, rather than general or powerful, and leave it to the client: The Unix system encourages the building of small programs that take one or more character streams as input, produce one or more streams as output, and do one operation. When this style is imitated properly, each program has a simple interface and does one thing well, leaving the client to combine a set of such programs with its own code and achieve precisely the effect desired.

Handle normal and worst cases separately: The requirements for the two are quite different; The normal case must be fast. The worst case must make some progress. Caches and hints are examples of special treatment for the normal case, but there are many others.

Hints pertaining to Speed
Split resources in a fixed way if in doubt, rather than sharing them. Use static analysis if you can. Cache answers to expensive computations, rather than doing them over. Safety first, optimize later.

Use hints to speed up normal execution: A hint, like a cache entry, is the saved result of some computation. It is different in two ways: it may be wrong, and it is not necessarily reached by an associative lookup. Because a hint may be wrong, there must be a way to check its correctness before taking any unrecoverable action. It is checked against the 'truth', information that must be correct but can be optimized for this purpose rather than for efficient execution. Like a cache entry, the purpose of a hint is to make the system run faster. Usually this means that it must be correct nearly all the time. Hints are very relevant for speculative execution, and you can find several examples of hints and speculative execution in computer systems. One example of a hint is in the Ethernet, in which lack of a carrier signal on the cable is used as a hint that a packet can be sent. If two senders take the hint simultaneously, there is a collision that both can detect; both stop, delay for a randomly chosen interval, and then try again.

Shed load to control demand, rather than allowing the system to become overloaded: There are many ways to shed load. An interactive system can refuse new users, or even deny service to existing ones.

Hints pertaining to Fault-tolerance
End-to-end: Error recovery at the application level is absolutely necessary for a reliable system, and any other error detection or recovery is not logically necessary but is strictly for performance. (Caveats: There are two problems with the end-to-end strategy. First, it requires a cheap test for success. Second, it can lead to working systems with severe performance defects that may not appear until the system becomes operational and is placed under heavy load.)

Log updates to record the truth about the state of an object: A log is a very simple data structure that can be reliably written and read, and cheaply forced out onto disk or other stable storage that can survive a crash. Because it is append-only, the amount of writing is minimized. To use the technique, record every update to an object as a log entry consisting of the name of the update procedure (a.k.a. operation) and its arguments. The operation specified by the log entry can be re-executed later, and if the object being updated is in the same state as when the update was first done, it will end up in the same state as after the update was first done. By induction, this means that a sequence of log entries can be re-executed, starting with the same objects, and produce the same objects that were produced in the original execution.

Make actions atomic or restartable: An atomic action (often called a transaction) is one that either completes or has no effect. The advantages of atomic actions for fault-tolerance are obvious: if a failure occurs during the action it has no effect, so that in recovering from a failure it is not necessary to deal with any of the intermediate states of the action.
Summary of the hints

Final remarks
Butler Lampson (Turing award winner in 1992) is a big proponent of using formal methods in systems design. His ideas there are heavily influenced by Nancy Lynch's work on I/O automata and simulation relations. The basic idea is to model the system at a high level using an I/O automaton and prove safety and progress properties on this model. Then the system is refined gradually by writing more concrete I/O automata and proving that the concrete I/O automata "refine" (a.k.a. implement) the behavior of the abstract I/O automaton using forward and backward simulation techniques. (Refinement means that all the collective behavior of the concrete automata is a behavior of the abstract automaton.) See Lampson's course notes for details.

A related way of looking at system design is by Leslie Lamport. In Lamport's approach, refinement is not proved by simulation relations of I/O automata, rather by mathematical precondition/postcondition reasoning on state machines. Lamport has designed the TLA framework to help in this process. Another highlight of Lamport's approach is its emphasis on invariant-based reasoning/design for taming/overcoming the concurrency problems in distributed systems. Invariant-based design provides a non-operational way to reason about concurrent systems and avoids the complexities and bugs of operational reasoning (a.k.a. handwaving) for concurrent systems. For invariant-based reasoning, it is enough to consider each operation/procedure once (instead of several times in a trace for operational reasoning) and prove that the procedure preserves the invariant. After proving each procedure preserves the invariant, we are guaranteed by induction that regardless of execution sequence of the procedures the invariant continues to hold in the system. So the safety conditions in the invariant are satisfied throughout the execution, and progress conditions can be proved assuming/using this invariant as a starting point and showing a variant function. For details see my Spring'10 distributed systems course notes.

Both approaches are instances of the stepwise refinement idea of Dijkstra (Turing award winner in 1972). The problem in both approaches are, as we move to more concrete models things get very complicated and get unavoidably operational due to the large amount of state, libraries, and context introduced. As a result, these approaches cannot scale without some compiler support. Yet, not all is in vain. The discipline of proving at the design level prevents major design errors. During implementation some implementation level errors could be introduced, however, those are not major errors and easier to fix compared to design errors. I hope to write a couple posts about invariant-based design and stepwise refinement later.

Crash-only software, HOTOS'03

Here is the summary for the second paper we covered in the first class of our seminar. The paper has a provocative title, and in fact there is a wikipedia entry on this title: "Crash-only software refers to computer programs that handle failures by simply restarting, without attempting any sophisticated recovery. Correctly written components of crash-only software can microreboot to a known-good state without the help of a user. Since failure-handling and normal startup use the same methods, this can increase the chance that bugs in failure-handling code will be noticed..." (We had previously seen this observation in the last section of Ousterhout's "The role of distributed state" paper.)

Since crashes are unavoidable, software must be at least as well prepared for a crash as it is for a clean shutdown. But then --in the spirit of Occam's Razor-- if software is crash-safe, why support additional, non-crash mechanisms for shutting down? A frequent reason is the desire for higher performance. For example, to avoid slow synchronous disk writes, many UNIX file systems cache metadata updates in memory. As a result, when a UNIX workstation crashes, the file system reaches an inconsistent state that takes a lengthy fsck to repair, an inconvenience that could have been avoided by shutting down cleanly. This captures the design tradeoff that improves steady state performance at the expense of shutdown and recovery performance. But, if the cost of such performance enhancements is dependability, perhaps it's time to reevaluate our design strategy.

The major benefit of a crash-only design is the following: A crash-only system makes it affordable to transform every detected failure into component-level crashes; this leads to a simple fault model, and components only need to know how to recover from one type of failure. If we state invariants about the system's failure behavior and make such behavior predictable, we are effectively coercing reality into a small universe governed by well-understood laws. If we don't use a crash-only system and allow any fault and recovery behavior, the resulting set of states becomes very big and complex due to many possible states.

Requirements of Crash-Only Software
To make components crash-only, the crash-only approach requires that all important non-volatile state be kept in dedicated crash-only state stores, leaving applications with just program logic. Specialized state stores (e.g., databases, file system appliances, distributed data structures, non-transactional hashtables, session state stores, etc.) are much better suited to manage state than code written by developers with minimal training in systems programming. Applications become stateless clients of the state stores, which allows them to have simpler and faster recovery routines. (This requirement will hurt the system performance unavoidably.)

To make a system of interconnected components crash-only, it must be designed so that components can tolerate the crashes and temporary unavailability of their peers. Thus, the approach prescribes strong modularity with relatively impermeable component boundaries, timeout-based communication and lease-based resource allocation, and self-describing requests that carry a time-to-live and information on whether they are idempotent. Many Internet systems today have some subset of these properties as they are built from many heterogenous components, accessed over standard request-reply protocols such as HTTP, and serve workloads that consist of large numbers of relatively short tasks that frame state updates.

What can go wrong here?
The paper mentions the following potential problem: The dynamics of loosely coupled systems can sometimes be surprising. For example, resubmitting requests to a component that is recovering can overload it and make it fail again; so the RetryAfter exceptions provide an estimated time-to-recover. To prevent reboot cycles and other unstable conditions during recovery, it is possible to quiesce the system when a set of components is being crash-rebooted. This can be done at the communication/RPC layer, or for the system as a whole. In our prototype, we use a stall proxy in front of the web tier to keep new requests from entering the system during the recovery process.

Taking things further, I can think of more scary scenarios. A component restart may take some time and, hence, may trigger restarts or problems at other components that depend on this component. As a result one restart may trigger a system-wide restart storm. The silver-lining in this cloud is that, hopefully, by designing thr system to be crash-only, you will test for and notice these problems before deployment as these cases will be exercised often.

Another complaint is that the crash-only approach puts some burden on the developers. It requires the components are designed to be very loosely coupled. It requires the developer to write code to keep track of lost requests, and retry the lost requests if they are idempotent, else either perform a rollback recovery or apply a compensating operation for it or somehow tolerate the inconsistency. While these code are for the component-level recovery (and thankfully not system-wide recovery), the code may still grow too complex for the developer to handle. Again, this may be unavoidable, fault-tolerance comes with a cost.

My another concern about the crash-only approach is a continuous cyclic corruption of components (this case is different than "the recorruption of a component during recovery" mentioned two paragraphs above). In this scenario, faults in component A leak and corrupts component B, and after A restarts and corrects itself, this time faults leak from component B to re-contaminate/corrupt A. Rinse repeat the above loop, and we have a vicious cycle of on-going corruption/contamination. My advisor Anish Arora had a solution to this problem for composition of self-stabilizing components. The solution used dependency graphs on the corruption and correction relations among components and accordingly prescribed some blocking wrappers to freeze contamination and break cycles. I found that the authors of the crash-only approach has a simpler solution to this, which they provide in their recursive-restartability paper. Their idea is to first attempt recovery of a small subset of components (say only one component), and if restart proves ineffective, the subsequent attempts recover progressively larger subsets. In other words this technique chases fault through successive boundaries, and can thus break the corruption cycle. It is not very efficient but it is simple and it works eventually.

The paper is realistic about the limitations of the crash-only approach, and does not paint an all rosy picture. Here are some quotes from the paper on this:
  • Building crash-only systems is not easy; the key to widespread adoption of our approach will require employing the right architectural models and having the right tools.
  • We are focusing initially on applications whose workloads can be characterized as relatively short-running tasks that frame state updates. Substantially all Internet services fit this description, in part because the nature of HTTP has forced designers into this mold. We expect there are many applications outside this domain that could not easily be cast this way, and for which deriving a crash-only design would be impractical or infeasible.
  • We expect throughput to suffer in crash-only systems, but this concern is secondary to the high availability and predictability we expect in exchange.
Comparison to self-stabilization
I think the crash-only approach is a special (more blackbox and more scalable) instance of self-stabilizing system design.

The crash-only approach uses the component crash abstraction to abstract away different fault-types and different arrival sequence of these faults. Similarly, the self-stabilization approach uses the arbitrary state corruption abstraction to avoid the need to characterize the effects of different types of faults. Of course the difference between the stabilization approach and crash-only approach is that stabilization requires access to the code and requires figuring out the good states of the system to converge to. In other words, stabilization is a chisel and crash-only is a large hammer. But, for scalability of adding fault-tolerance to large systems, you may be better off using a hammer than a chisel.

State-store approach and keeping application logic stateless also connects back to the self-stabilization approach. Building a stateless system is a trivial way to make a system self-stabilizing. If you don't have any state to corrupt, the system is trivially self-stabilizing. (Similarly, if the system is built to be soft-state, then the corrupted state expires with time and new/correct information is written in the new state, so the system is again easily shown to be self-stabilizing.)

Final remarks
The paper includes the following sentence in the conclusions section, which I think is a very good summary/evaluation of the approach: "Writing crash-only components may be harder, but their simple failure behavior can make the assembly of such components into large systems easier."

Yes, crash-only approach provides some compositionality of fault-tolerance because it uses restarts to transform arbitrary faults into crashes and each crash-only component is written anticipating that it will interact with other crash-only components so it can tolerate crashes of other components. However, things are not always easy in practice and there are caveats as we mentioned above. Some faults can leak through the crash-only abstraction and contaminate other components. Also there could be hidden/emergent interactions/couplings between components that the developer needs to tune.

Wednesday, January 5, 2011

My Spring'11 Seminar on Cloud Computing

I am offering a seminar on Cloud Computing this semester. Below is the list of papers I plan to discuss in the seminar. I have also put up a course webpage here. If you have some suggestions on other good/recent papers to cover, please let me know in the comments.

(My colleague, Tevfik Kosar, who joined the department this semester, is also offering a seminar on Data Intensive Scientific Discovery. He will be posting his reading list soon, in the meanwhile, here is a link to his 2006 seminar reading list. )















Monday, January 3, 2011

CRDTs: Consistency without concurrency control

This paper appeared in ACM SIGOPS Operating Systems Review in April 2010. One of the authors on this paper is Marc Shapiro. We have previously discussed the "optimistic replication" survey by Shapiro here. (If you haven't read that survey, you should take a moment now to fill this gap. I'll wait.)

This paper is also related to optimistic replication. Recall that there are two approaches to replication due to the CAP theorem. One approach insists on maintaining a strong consistency among replicas and requires consensus for serializing all updates on the replicas. Unfortunately this approach does not scale beyond a small cluster. The alternative, optimistic replication, ensures scalability by giving up consistency guarantees, however in the absence of consistency, application programmers are faced with overwhelming complexity. Yes, for some applications eventual consistency may suffice, but even there "complexity and consensus are hiding under a different guise, namely of conflict detection and resolution".

(In the review below, I re-used sentences--and often whole paragraphs-- from the paper without paraphrasing. Those original sentences expressed the content clearly, and I got lazy. Sorry.)

What this paper shows is in some (limited) cases, a radical simplification is possible. If concurrent updates to some datum commute, and all of its replicas execute all updates in causal order, then the replicas converge. The paper calls these data types whose operations commute when they are concurrent as Commutative Replicated Data Types (CRDTs). The CRDT approach ensures that there are no conflicts, hence, no need for consensus-based concurrency control. Replicas of a CRDT eventually converge without any complex concurrency control. As an existence proof of non-trivial, useful, practical and efficient CRDT, the paper presents one that implements an ordered set with insert-at-position and delete operations. It is called Treedoc, because sequence elements are identified compactly using a naming tree, and because its first use was concurrent document editing. The paper presents some experimental data based on Wikipedia traces.

Now, on with the details of the Treedoc data structure and application.

The paper considers a collection of sites (i.e., networked computers), each carrying a replica of a shared ordered-set object, and connected by a reliable broadcast protocol (e.g., epidemic communication). This approach supports a peer-to-peer, multi-master execution model: some arbitrary site initiates an update and executes it against its local replica; each other site eventually receives the operation and replays it against its own replica. All sites eventually receive and execute all operations; causally-related operations execute in order, but concurrent operations may execute in different orders at different sites.

Two inserts or deletes that refer to different IDs commute. Furthermore, operations are idempotent, i.e., inserting or deleting the same ID any number of times has the same effect as once. To ensure commutativity of concurrent inserts, we only need to ensure that no two IDs are equal across sites. The ID allocation mechanism is described next.

ID allocation mechanism
Atom identifiers must have the following properties: (i) Two replicas of the same atom (in different replicas of the ordered-set) have the same identifier. (ii) No two atoms have the same identifier. (iii) An atom's identifier remains constant for the entire lifetime of the ordered-set.2 (iv) There is a total order "<" over identifiers, which defines the ordering of the atoms in the ordered-set. (v) The identifier space is dense. Property (v) means that between any two identifiers P and F, P <>

This ID allocation is not difficult at all. The below figures show how it is done.

Treedoc insert and delete
A delete(TID) simply discards the atom associated with TID. The corresponding tree node is retained and marked as a tombstone.

To insert an atom, the initiator site chooses a fresh TID that positions it as desired relative to the other atoms. For instance, to insert an atom R to the right of atom L: If L does not have a right child, the TID of R is the TID of L concatenated with 1 (R becomes the right child of L). Otherwise, if L has a right child Q, then allocate the TID of the leftmost position of the subtree rooted at Q.

Restructuring the tree
Depending on the pattern of inserts and deletes, the tree may become badly unbalanced or riddled with tombstones. To alleviate this problem, a restructuring operation "flatten" transforms a tree into a flat array, eliminating all storage overhead.

However, flattening does not genuinely commute with update operations. This is solved by using an update-wins approach: if a flatten occurs concurrently with an update, the update wins, and the flatten aborts with no effect. A two-phase commit protocol is used for this purpose (or, better, a fault-tolerant variant such as Paxos Commit). The site that initiates the flatten acts as the coordinator and collects the votes of all other sites. Any site that detects an update concurrent to the flatten votes no, otherwise it votes yes. The coordinator aborts the flatten if any site voted no or if some site is crashed.

This commit-based solution also has problems. Commitment protocols are problematic in large-scale and dynamic systems; to alleviate this issue only a small number of (core) servers involve in the flatten, the remaining servers (nebula) copy from these core servers and catch up. This is again another patch, and suffers from other problems (such as converting between flattened and unflattened atom ids during catch-up). The paper explains how to solve those other problems. It is unclear if the catching up leads to availability problems. In any case, it is clear that flattening operation is the less-elegant side of the Treedoc CRDT.

Experiments on Wikipedia edit history
The paper presents experimental results from real-world cooperative editing traces from Wikipedia. A number of Wikipedia pages were stored as Treedocs, interpreting differences between successive versions of a page as a series of inserts and deletes. In some experiments the atoms were words; in the ones reported below an atom is a whole paragraph.

Those sudden drops in the graph are due to the flatten operation.

One of the features of Treedoc is that it ensures causal ordering without vector clocks. This is made possible by using an application specific technique: When the replica sees that the node it is inserting does not have a parent inserted on the tree yet, it makes that node wait. In general, though, we would need vector clocks, or logical clocks at least. The paper has this to say on that: "Duplicate messages are inefficient but cause no harm, since operations are idempotent. Therefore, a precise vector clock is not necessary; approximate variants that are scalable may be sufficient as long as they suppress a good proportion of duplicates."

I think this paper opens a very interesting path in the replication jungle. The CRDT technique does not have broad/general applicability. But for the cases where the technique is applicable, the benefit is huge. I wonder if using the Treedoc structure can achieve scalability benefits for collaborative real-time editors (GoogleDocs, Google Wave, EtherPad, SubEthaEdit [a peer-to-peer collaborative editor]).

I have seen a similar idea to CRDT before, in DISC 2001, "Stabilizing Replicated Search Trees". There the idea was to maintain a 2-3 search tree data structure at each replica, and despite transient state corruption (e.g., despite initially starting at different arbitrary states), the replicas eventually converge to the same correct state. In that work however, synchronous replication was required: all update operations were to be executed in the same order at all sites to keep consistency among the replicas. That is a pretty harsh requirement, and a deal breaker for wide area replication.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...