Monday, October 27, 2014

Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks

This paper appeared in SRDS 2013, and is concerned with the snapshot isolation problem for distributed databases/data stores.

What is snapshot isolation (SI)?

(I took these definitions almost verbatim from the paper.)
SI is a multiversion concurrency control scheme with 3 properties:
1) Each transaction reads from a consistent snapshot, taken at the start of the transaction and identified by a snapshot timestamp. A snapshot is consistent if it includes all writes of transactions committed before the snapshot timestamp, and if it does not include any writes of aborted transactions or transactions committed after the snapshot timestamp.
2) Update transactions commit in a total order. Every commit produces a new database snapshot, identified by the commit timestamp.
3) An update transaction aborts if it introduces a write-write conflict with a concurrent committed transaction. Transaction T1 is concurrent with committed update transaction T2, if T1 took its snapshot before T2 committed and T1 tries to commit after T2 committed.

When a transaction starts, its snapshot timestamp is set to the current value of the database version. All its reads are satisfied from the corresponding snapshot. To support snapshots, multiple versions of each data item are kept, each tagged with a version number equal to the commit timestamp of the transaction that creates the version. The transaction reads the version with the largest version number smaller than its snapshot timestamp. If the transaction is read-only, it always commits without further checks. If the transaction has updates, its writes are buffered in a workspace. When the update transaction requests to commit, a certification check verifies that the transaction writeset does not intersect with the writesets of concurrent committed transactions. If the certification succeeds, the database version is incremented, and the transaction commit timestamp is set to this value.

What is the innovation in the Clock-SI paper?

The conventional SI implementations use a centralized timestamp authority for consistent versioning. This is because local clocks on different nodes may differ a lot (NTP synchronization may have 10s of ms of inaccuracies), and is not suitable for consistent versioning.

Clock-SI, instead, proposes a way to use loosely synchronized clocks to assign snapshot and commit timestamps to transactions. Compared to conventional SI, Clock-SI does not have a single point of failure and a potential performance bottleneck. It saves one round-trip message for a ready-only transaction (to obtain the snapshot timestamp), and two round-trip messages for an update transaction (to obtain the snapshot timestamp and the commit timestamp). A transaction's snapshot timestamp is the value of the local clock at the partition where it starts. Similarly, the commit timestamp of a local update transaction is obtained by reading the local clock.

If you read Google's Spanner paper, you know that Google Spanner solves this problem by introducing TrueTime, which uses atomic clocks.

How does Clock-SI work?

Clock-SI essentially response-delays a read in a transaction
1) to account for clock synchronization differences (epsilon) as in Fig1, and
2) to account for the pending commit of an update transaction.

In Fig1, the read arrives at time t′ on P2's clock, before P2’s clock has reached the value t, and thus t′ < t. The snapshot with timestamp t at P2 is therefore not yet available. Another transaction on P2 could commit at time t′′, between t′ and t, and change the value of x. This new value should be included in T1's snapshot.

T2's snapshot is unavailable due to the commit in progress of transaction T1, which is assigned the value of the local clock, say t, as its commit timestamp. T1 updates item x and commits. The commit operation involves a write to stable storage and completes at time t′. Transaction T2 starts between t and t′, and gets assigned a snapshot timestamp t′′, t < t′′ < t′. If T2 issues a read for item x, we cannot return the value written by T1, because we do not yet know if the commit will succeed, but we can also not return the earlier value, because, if T1's commit succeeds, this older value will not be part of a consistent snapshot at t′′.


The paper does not include a performance comparison to Spanner. The NTP synchronized clocks in the evaluation experiments have an NTP offset/epsilon less than 0.1 msec, which is actually more precise than Spanner's atomic clock! I guess this is thanks to the Gigabit Ethernet they use in their LAN deployment.

Discussion: Use of Hybrid Logical Clocks (HLC) for the Clock-SI problem

HLC is a hybrid version of logical clocks and physical clocks, introduced by us recently, to combine the advantages of both clocks, while avoiding their disadvantages. Since HLC captures happened-before relationship and uses this extra information in ordering, it does not need to wait out uncertainty regions of physical clock synchronization. Dually, since HLC is related to physical clocks it allows querying with respect to physical time. We had shown HLC's advantages for the consistent snapshot problem in our work.

Here we find that HLC indeed improves the clock-SI problem of snapshot isolation if it is used instead of physical clocks. HLC avoids the delay in Figure 1. HLC would not incur the delay because it also uses happened-before information as encoded in HLC clocks.

Saturday, October 18, 2014

Facebook's software architecture

I had summarized/discussed a couple papers (Haystack, Memcache caching) about Facebook's architecture before.

Facebook uses simple architecture that gets things done. Papers from Facebook are refreshingly simple, and I like reading these papers.

Two more Facebook papers appeared recently, and I briefly summarize them below.

TAO: Facebook's distributed data store for the social graph (ATC'13)

A single Facebook page may aggregate and filter 100s of items from the social graph. Since Facebook presents each user with customized content (which needs to be filtered with privacy checks) an efficient, highly available, and scalable graph data store is needed to serve this dynamic read-heavy workload.

Before Tao, Facebook's web servers directly accessed MySql to read or write the social graph, aggressively using memcache as a look aside cache (as it was explained in this paper).

The Tao data store implements a graph abstraction directly. This allows Tao to avoid some of the fundamental shortcomings of a look-aside cache architecture. Tao implements an objects and associations model and continues to use MySql for persistent storage, but mediates access to the database and uses its own graph-aware cache.
To handle multi-region scalability, Tao uses replication using the per-record master idea. (This multi-region scalability idea was again presented earlier in the Facebook memcache scaling paper.)

F4: Facebook's warm BLOB storage system (OSDI'14)

Facebook uses Haystack to store all media data, which we discussed earlier here.

Facebook's new architecture splits the media into two categories:
1) hot/recently-added media, which is still stored in Haystack, and
2) warm media (still not cold), which is now stored in F4 storage and not in Haystack.

This paper discusses the motivation for this split and how this works.

Facebook has big data! (This is one of those rare cases where you can say big data and mean it.) Facebook stores over 400 billion photos.

Facebook found that there is a strong correlation between the age of a BLOB (Binary Large OBject) and its temperature. Newly created BLOBs are requested at a far higher rate than older BLOBs; they are hot! For instance, the request rate for week-old BLOBs is an order of magnitude lower than for less-than-a-day old content for eight of nine examined types. Content less than one day old receives more than 100 times the request rate of one-year old content. The request rate drops by an order of magnitude in less then a week, and for most content types, the request rate drops by 100x in less than 60 days. Similarly, there is a strong correlation between age and the deletion rate: older BLOBs see an order of magnitude less deletion rate than the new BLOBs. These older content is called warm, not seeing frequent access like hot content, but they are not completely frozen either.

They also find that warm content is a large percentage of all objects. They separate the last 9 months Facebook data under 3 intervals: 9-6 mo, 6-3 mo, 3-0 months. In the oldest interval, they find that for the data generated in that interval more than 80% of objects are warm for all types. For objects created in the most recent interval more than 89% of objects are warm for all types. That is the warm content is large and it is growing increasingly.

In light of these analysis, Facebook goes with a split design for BLOB storage. They introduce F4 as a warm BLOB storage system because the request rate for its content is lower than that for content in Haystack and thus is not as hot. Warm is also in contrast with cold storage systems that reliably store data but may take days or hours to retrieve it, which is unacceptably long for user-facing requests. The lower request rate of warm BLOBs enables them to provision a lower maximum throughput for F4 than Haystack, and the low delete rate for warm BLOBs enables them to simplify F4 by not needing to physically reclaim space quickly after deletes.

F4 provides a simple, efficient, and fault tolerant warm storage solution that reduces the effective-replication-factor from 3.6 to 2.8 and then to 2.1. F4 uses erasure coding with parity blocks and striping. Instead of maintaining 2 other replicas, it uses erasure coding to reduce this significantly.

The data and index files are the same as Haystack, the journal file is new. The journal file is a write-ahead journal with tombstones appended for tracking BLOBs that have been deleted. F4 keeps dedicated spare backoff nodes to help with BLOB online reconstruction. This is similar to the use of dedicated gutter nodes for tolerating memcached node failures in the Facebook memcache paper.
F4 has been running in production at Facebook for over 19 months. F4 currently stores over 65PB of logical data and saves over 53PB of storage.


1) Why go with a design that has a big binary divide between hot and warm storage? Would it be possible to use a system that handles hot and warm as gradual degrees in the spectrum? I guess the reason for this design is its simplicity. Maybe it is possible to optimize things by treating BLOBs differentially, but this design is simple and gets things done.

2) What are the major differences in F4 from the Haystack architecture? F4 uses erasure coding for replication: Instead of maintaining 2 other replicas, erasure coding reduces replication overhead significantly.  F4 uses write-ahead logging and is aggressively optimized for read-only workload. F4 has less throughput needs. (How is this reflected in its architecture?)

Caching is an orthogonal issue handled at another layer using memcache nodes. I wonder if the caching policies treat content cached from Haystack versus F4 differently.

3) Why is energy-efficiency of F4 not described at all? Can we use grouping tricks to get cold machines/clusters in F4 and improve energy-efficiency further as we discussed here?

4) BLOBs have large variation in size. Can this be utilized in F4 to improve access efficiency? (Maybe treat/store very small BLOBs differently, store them together, don't use erasure coding for them. How about very large BLOBs?)

Facebook monitoring tools (The Facebook Mystery machine)
The Facebook Stack (by Malte Schwarzkopf) 

Wednesday, October 15, 2014

Paper Summary: A self-configurable geo-replicated cloud storage system

This paper is a followup work to the Pileus work, which I had covered here. Pileus aimed to help developers find a suitable consistency/latency combination for their application deployment. In Pileus, the configuration of primary and secondary nodes is assumed to be fixed (some storage nodes are designated as primary nodes, which hold the master data, while others are secondary nodes). The developer uses an SLA to state ranked preferences for latency and consistency of the reads that would make most sense for the application, and using this SLA, Pileus provides dynamic tuning of the performance of the application by deciding which read to forward to which replica/master.

This paper introduces a followup system, called Tuba, where the configuration is not fixed, and can be changed on the fly. Tuba extends Pileus to address the problem of finding an optimal configuration of primary and secondary replicas that maximizes the overall utility and minimizes the cost for the application.

Tuba extends Pileus with a configuration service (CS) delivering the following capabilities:
1. performing a reconfiguration periodically for different tablets, and
2. informing clients of the current configuration for different tablets.

Data is organized into tablets as access units, and each tablet is assigned to primary and secondary nodes. Of course different tablets may be assigned to different primary and secondary nodes. In order for the CS to configure a tablet's replicas to maximize the overall utility, the CS must be aware of the way the tablet is being accessed globally. Therefore, all clients in the system periodically send their observed latency and the hit and miss ratios of their SLAs to the CS.

Once a new configuration is decided, one or more of the following operations are performed as the system changes to the new configuration: (i) changing the primary replica, (ii) adding or removing secondary replicas, and (iii) changing the synchronization periods between primary and secondary replicas.

Configuration service (CS)

The CS is a centralized service. To improve utility, the CS selects a new configuration by first generating all configurations that satisfy a list of defined constraints. To permute configurations, it is free to use one of the following operations:
+ Adjust the Synchronization Period
+ Add/Remove Secondary Replica
+ Change Primary Replica
+ Add Primary Replica

For example, for a missed subSLA with strong consistency, two potential new configurations would be: (i) creating a new replica near the client and making it the solo primary replica, or (ii) adding a new primary replica near the client and making the system run in multi-primary mode. As another example, for a social network application that spans Brazil and India, the CS may decide to swap the primary and secondary replica roles to improve utility. During peak times in India, the secondary replica in South Asia becomes the primary replica. Likewise, during peak times in Brazil, the replica in Brazil becomes primary.

The constraints for configurations may include (i) Geo-replication factor, (ii) Location, (iii) Synchronization period, and (iv) Cost constraints. With the cost constraint, the application developers can indicate how much they are willing to pay (in terms of dollars) to switch to a new configuration. For instance, one possible configuration is to put secondary replicas in all available datacenters. While the gained utility for this configuration would be great, the cost of this configuration is likely unacceptably large.

For each configuration possibility that meets the constraints, the CS computes the expected gained utility and the cost of reconfiguration. The CS considers the following costs for a new configuration:
+ Storage: the cost of storing a tablet in a particular site.
+ Read/Write Operation: the cost of performing read/write operations.
+ Synchronization: the cost of synchronizing a secondary replica with a primary one.

Finally the CS choses the configuration that offers the highest utility-to-cost ratio, and executes the reconfiguration operations required to transition to the new one.

Client execution modes

Clients need to avoid two potential safety violations: (i) performing a read operation with strong consistency on a non-primary replica, or (ii) executing a write operation on a non-primary replica. Based on the freshness of a client's view, the client is either in fast or slow mode. A client is in the fast mode for a given tablet if it knows the locations of primary and secondary replicas, and it is guaranteed that the configuration will not change in the near future. Whenever a client suspects that a configuration may have changed, it enters slow mode until it refreshes its local cache.

Fast Mode. When a client is in fast mode, read and single-primary write operations involve a single round-trip to one selected replica. No additional overhead is imposed on these operations. Multi-primary write operations use a three-phase protocol in fast or slow mode.

Slow Mode. Slow mode has no affect on read operations with relaxed consistency. On the other hand, since read operations with strong consistency must always go to a primary replica, the client needs to validate that the responding replica to a strong-consistency read is still a primary replica. If not, the client retries the read operation. Write operations are more involved when a client is in slow mode. Any client in slow mode that wishes to execute a write operation on a tablet needs to take a non-exclusive lock on the tablet's configuration before issuing the write operation. On the other hand, the CS needs to take an exclusive lock on the configuration if it decides to change the set of primary replicas. This lock procedure is required to ensure the linearizability of write operations.


They implemented Tuba to extend Microsoft Azure Storage, with broad consistency choices (as in Bayou), consistency-based SLAs (as in Pileus), and a novel replication configuration service. Their evaluation compared with a system that is statically configured. An experiment with clients distributed in datacenters around the world shows that reconfiguration every two hours increases the fraction of reads guaranteeing strong consistency from 33% to 54%. This confirms that automatic reconfiguration can yield substantial benefits which are realizable in practice.

SEA is South East Asia and  WEU is West Europe.

Wednesday, October 8, 2014

Consistent snapshot analogies

Last week I taught distributed snapshot in my CSE 586: Distributed Systems class. While I teach snapshot, I invariably find myself longing for analogies to provide some intuition about this concept. The global state captured by a distributed snapshot (say using Lamport/Chandy marker algorithm) does not correspond to the "state of the system at initiation of the snapshot". Furthermore, it also may not correspond to a "state of the system from initiation to current state during this computation". This is because while the snapshot taking is progressing in the system, the underlying system computation is also proceeding and changing the state of the system progressively. (Distributed snapshot is not allowed to stop/freeze underlying system computation as that reduces availability.)

For those curious about the question, "what good is a snapshot then?": The snapshot captures a reachable state from initiation state, and from the snapshot state the current state of the computation is also reachable. In other words, snapshot is a likely state of the computation, even though it may not have occurred in this particular computation. So, for stable predicate detection and distributed system debugging the snapshot is still valuable.

Going back to my predicament, the analogy I resort to is that of 1000 ants trying to take/construct a picture of the elephant as the elephant is moving. (I had heard this example from Paul Sivilotti while I was a graduate student at Ohio State.) Here the ants correspond to the marker algorithm, and the elephant the underlying computation that we want to take a snapshot of. Of course the pictures the ants will construct will be vaguely elephant-like, it will be a picture of the elephant's outer surface as it progresses in the spacetime continuum. (Achievement Unlocked: Today I used spacetime continuum in serious writing.)

Last week I was using this analogy in class, when a better (at least more modern) analogy occurred to me. Panoramic photographs! When you use your smartphone to take a panorama picture, you are in fact taking a distributed snapshot of your surrounding. Your snapshot is not instantaneous, it needs time to complete: you need to rotate 180 to 360 degrees and probably that takes a good 5-10 seconds. If in the meanwhile something moves, that object will not be reflected in its original form/place/state in your panorama picture.

We may attempt to take the analogy further to categorizing the panorama pictures as consistent snapshots and inconsistent snapshots. In an inconsistent snapshot, although the send of a message is not recorded as part of the snapshot, the receive of the message is recorded as part of the snapshot. (You received a message from the future.) So we can say that, your panorama picture is inconsistent if the object moves in the opposite direction of the panorama/snapshot. These are examples of inconsistent snapshots.

And, these are examples of consistent snapshot. (Maybe the last two are debatable as they duplicate some state.)

Finally, this seemingly-consistent inconsistent snapshot (the bearded guy on the leftmost is teleported to reappear as the rightmost person) points to the dangers of ignoring backchannels when taking a snapshot.

Probably it is not worth trying to strain the analogy further, so I will stop.  Here are some more funny iphone panoramic pictures. 

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...