Transactional storage for geo-replicated systems

This paper from SOSP 2011 describes a distributed storage system called Walter. Walter targets web applications that operate across multiple geographic sites, and it aims to balance consistency and latency in such systems by providing strong consistency within a site, and weaker consistency across sites.


Parallel Snapshot Isolation (PSI)

The paper introduces a new isolation property called Parallel Snapshot Isolation (PSI). PSI ensures consistent snapshots and transaction ordering within a site, causal ordering across sites, and prevention of write-write conflicts.

PSI extends snapshot isolation (see Fig 3 for illustration of snapshot isolation, and here for an example of snapshot isolated database modeling) by allowing different sites to have different commit orderings. For example, suppose site A executes transactions T1, T2 and site B executes transactions T3, T4. PSI allows site A to first incorporate just T1, T2 and later T3, T4, while site B first incorporates T3, T4 and later T1, T2. This flexibility is needed for asynchronous replication: site A (or site B) can commit transactions T1, T2 (or T3, T4) without coordinating with the other site and later propagate the updates.

Because of this PSI allows long fork anomaly that snapshot isolation disallows (see Fig 8). The paper argues that long forks are acceptable in web applications when users in a site do not expect their updates to be instantly visible across all sites. If the user wants to know that her updates are visible everywhere, she can wait for her transaction to commit at all sites.

Note that, although PSI allows different commit orderings at different sites, it still preserves the property of snapshot isolation that committed transactions have no write-write conflicts, thereby avoiding the need for conflict resolution. Furthermore, PSI preserves causal ordering: if a transaction T2 reads from T1 then T1 is ordered before T2 at every site. Therefore, PSI offers several benefits for distributed system applications:

  • Atomic multi-object updates
  • Consistent read snapshots
  • Atomic read-modify-write operations
  • Conditional writes


Basic ideas in Walter 

Walter implements PSI using two key techniques. The first is preferred sites: Assigning objects to specific sites for efficient writes. Preferred site does not mean a primary site, it is more flexible. In contrast to primary-copy database systems, here objects can be updated at any site, not just the preferred site.

Using regular replicated objects with preferred sites doesn't always suffice. For example, a friends list can be updated by users in many sites. To handle the conflicting updates arising here, counting sets (csets) are implemented. This is a data type that allows conflict-free updates across sites (see this 2009 paper). A cset is similar to a multiset in that it keeps a count for each element. But, unlike a multiset, the count could be negative. A cset supports an operation add(x) to add element x, which increments the counter of x in the cset; and an operation rem(x) to remove x, which decrements the counter of x. Because increment and decrement commute, add and rem also commute, and so operations never conflict.

This is critical to remember for understanding Walter's operation. There are two types of objects: regular and cset. For regular objects, the available operations are read and write; for cset objects, the available operations are read, add element, and delete element. Note that csets can be updated efficiently anywhere so all sites are considered preferred sites for them, while regular objects can be updated efficiently only at their designated preferred site.

Walter uses multi-version concurrency control within each site, and it can quickly commit transactions that write objects at their preferred sites or that use csets. For other transactions, Walter resorts to two-phase commit (2PC) to check for conflicts. And it is often possible to avoid these 2PC transactions in applications as the application and evaluation sections show.


Versions and vector timestamps

Instead of using a single timestamp, Walter uses version numbers and vector timestamps to track data changes across multiple sites. A version number is like a tag for each change, showing where it happened and in what order. Vector timestamps are like snapshots, showing how many changes from each site are included.

When a transaction starts, it gets a vector timestamp. This helps it know which changes it should see. The system can then check if a particular change should be visible to the transaction. Each site keeps a history of changes for every object it manages. When a transaction starts, it gets a snapshot of the current state across all sites.


Transaction execution

For transactions, Walter uses two types of commit protocols: fast and slow.

The fast commit is used when all objects being written are managed by the local site. That means, each objects should either be a regular object whose preferred site is this site, or be a cset object (for which any site counts as a preferred site due to their conflict free nature). Fast commit is quick because it only needs to check if the objects haven't been changed since the transaction started (checked by using the start vector timestamp and history at the local site) and aren't currently being modified by another transaction (checked by seeing locally if any of them are locked as part of a 2PC slow commit).

Transactions that write a regular object whose preferred site is not local must be committed using the slow commit protocol. Slow commit uses 2PC to ensure consistency across sites and involves asking permission from all relevant sites before making any changes.

After a transaction commits, Walter spreads the changes to other sites in the background. It ensures that enough sites have received the transaction and all its predecessors before marking it as "disaster-safe durable". A transaction commits at a remote site only after all its predecessors have committed there too.

Server failures are managed by using a replicated storage system for the transaction log, allowing a replacement server to pick up where the failed one left off.

For entire site failures, Walter offers two recovery options: wait for the site to come back online (conservative) or remove the failed site and reassign its responsibilities (aggressive). The latter involves a careful process of determining which transactions to keep and which to discard. When a previously failed site recovers, it goes through a synchronization process before being fully reintegrated into the system.

To address scalability issues, Walter allows data centers to be divided into smaller "local sites", each with its own server. This approach distributes the workload to improve system performance.


Applications

The authors created two social networking applications using Walter: WaltSocial and a ported version of ReTwis. Both applications aim to demonstrate Walter's ease of use in developing distributed social networking platforms, while maintaining data integrity and supporting multi-site operations.

WaltSocial is a Facebook-like platform that supports common social networking features. It uses various objects to store user data, including profile information and counting sets (csets) for friend lists, messages, events, and photo albums. Transactions ensure data consistency across multiple operations.

ReTwis, originally a Twitter clone using Redis, was adapted to use Walter. The adaptation to Walter allowed for multi-site updates and replaced Redis's atomic operations ( simple get/put, adding to or removing from a list, and adding or subtracting from an integer) with Walter's transactions and csets. The port process was straightforward, and took a good programmer without Walter experience less than a day to complete.


Evaluation

Transactions that modify objects at their preferred sites commit quickly, with a 99.9-percentile latency of 27ms on EC2. Committed transactions are asynchronously replicated to remote sites within twice the network round-trip latency.

For fast commit on regular objects, Figure 17 shows Walter’s aggregate throughput across sites as the number of sites varies. Read throughput is bounded by the RPC performance and scales linearly with the number of sites, reaching 157 Ktps (thousands of transactions per second) with 4 sites. Write throughput is lower than read throughput due to lock contention within a Walter server. Specifically, when a transaction commits, a thread needs to acquire a highly contended lock to check for transaction conflicts.

Transactions that modify csets outside of their preferred sites also commit quickly without cross-site coordination. WaltSocial uses csets extensively and processes user requests with a 99.9-percentile latency under 50ms.

In contrast, slow commit requires cross-site coordination. Figure 20 shows that for slow commit, the commit latency is determined by the round-trip time between VA and the farthest preferred site of objects in the writeset. This is because slow commit runs a two-phase protocol among the preferred sites of the objects in the writeset.

The overhead for supporting transactions in Walter is reasonable. Fig 23 shows that ReTwis running on Walter has a throughput 25% smaller than running on Redis in a single site, but Walter allows ReTwis to scale to multiple sites.


Discussion

This is a solid work, and it shows: it has been cited 550+ times since 2011. No individual part is novel (vector clocks, csets) but their composition together and application to WAN web applications have been novel. Walter is closer to CALM approach than traditional transactions papers. You have less consensus/coordination, but this still works when focusing on certain applications and Walter focuses on web applications across WANs. It is also important to recognize the definition of PSI (parallel snapshot isolation) to extend the snapshot isolation semantics. Asking the right question for isolation semantics was the big contribution here!

It is important to note that, Walter came before most of the CALM work applications to cloud, including Highly Available Transactions, and coordination avoidance in databases. But it was in the same year as "Consistency analysis in Bloom: a CALM and collected approach" which explored eventual consistency and convergence. CALM (consistency and logical monotonicity) principle states that monotonic programs guarantee order independence, and hence, eventual consistency. A monotonic program is one where any true statement remains to be true as new axioms arrive. In contrast, in a non-monotonic program, new input can cause the earlier output to be revoked. A monotonic program guarantees eventual consistency, whereas non-monotonicity requires the use of coordination logic.

Followup work on csets would be really interesting. Csets are limited but they can still be very useful. They can track quantities, such as items in a shopping cart, inventory levels, or object access counts. The count directly represents a meaningful value to the application. Moreover Csets can manage lists like friends, messages, or photo albums. In this case, the count is hidden from users, typically kept at zero (even when negative) or one (even when multiple) to indicate absence or presence. It would be useful to explore csets more and extend their capabilities and applications. 

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book