Sunday, November 23, 2014

Google Cloud Messaging (GCM): An evaluation

I had written about our work on building a crowdsourced superplayer for the "Who wants to be a millionaire (WWTBAM)" quiz show earlier. In that work we developed an Android app that enabled the app users in Turkey to participate in WWTBAM in real time as the show was airing on TV. When a question was read by the show host, my PhD students typed the question and the multiple-choice options, which were transmitted via Google Cloud Messaging (GCM) to the app users. App users played the game, and enjoyed competing with other app users, and we got a chance to collect precious data about MCQA dynamics in crowdsourcing. Our app was downloaded 300K+ times, and at the peak of its popularity 20K participants played the game simultaneously.

We used GCM to send the questions to the participants because we wanted to keep the app simple. GCM is the default push messaging solution for the Android platform and is maintained by Google as a free service with no quotas. GCM allows app developers to send push messages to Android devices from the server. As an alternative to GCM, we could have developed our app to maintain open TCP connections with the backend servers, but this would have made our backend design complicated and we would need many machines to be able to serve the need. Using GCM, we were able to send quiz questions as push messages easily, and can get away with using only one backend server to collect the answers from the app users. Another alternative to GCM was to deploy our messaging service, say using MQTT, but since we were developing this app as an experiment/prototype, we wanted to keep things as simple as possible.

After we deployed our app, we noticed that there were no studies or analysis of GCM performance, and we wondered whether GCM was enough to satisfy the real-time requirements in our application. Since we had a lot of users, we instrumented our app to timestamp the reception time of the GCM messages and reply back to the backend server with this information.

Our evaluation was done for both offline and online mode of GCM messaging. In the online mode, we send the message to the online participants that are playing the game while WWTBAM is broadcasting. That is, in the online mode, we know that the client devices are powered on, actively in use and have network connection. In the offline mode, we send the GCM message at a random time, so the server has no knowledge of whether the client devices are powered on, are in use, or have network connection. Table2 shows the number of devices involved in each of our experiments.

We found a giant tail in the GCM delivery latencies. Figure 4 above shows the message arrival time cumulative distribution in the online and offline modes. Note that the x-axis is in logarithmic scale. Table 6 shows message arrival times broken into quartiles. The median and the average message arrival times in the table indicate that the latency in the offline experiment is significantly high compared to the online experiment.

In Figure 5, comparing the GCM delivery times under WiFi versus cellular data connection in the offline mode shows that GCM delivery latencies are lower using the cellular data connection.
To investigate the GCM arrival times in the offline scenario further, we devised a double message experiment, where we send 2 GCM messages initiated at the same time on our server. This time, instead of measuring the message arrival time, we measure the difference between the arrival times of the first message and the arrival time of the second message. We were expecting to see very low time difference between these twin messages, but Figure 9 shows that the giant tail is preserved, and that some of the devices receive the second message hours after receiving the first one. Since we know that, at the time the first of the twin messages has arrived to the device, the device is reachable by Google’s GCM servers, the latency for the second message should be due to scheduling it to be sent independently than the first message.

These results show that the GCM message delivery is unpredictable, namely having a reliable connection to Google's GCM servers on the client device does not by itself guarantee a timely message arrival. It would be worth replicating this experiment where all participants are in the US, to see how that changes the results.

Our experiments raised more questions than it answered. Delving deeper at the documentation, we found some information that may explain some of the problems with GCM in the offline mode, but we need more examples to test these and understand the behavior better. It turns out that both cellular and WiFi modes keep a long lived connection to GCM servers. In the case of deep sleep mode (where user does not turn on the screen for long), the CPU wakes up and sends heartbeat to the servers every 28 minutes on cellular, every 15 minutes on the WiFi. This causes the connection to drop in many cases, because routers and providers kill the inactive TCP connections without sending any info to the client and/or server. Not all users are affected by this problem, it depends on carrier and/or router they use.

Related links

Our GlobeCom14 paper includes more information.

Saturday, November 15, 2014

Paper Summary: Granola, Low overhead distributed transaction coordination

This paper is by Cowling and Liskov, and it appeared at Usenix ATC'12.  Most closely related papers to this paper are the Sinfonia and Calvin papers. So, it may be helpful to also read the summaries of those papers from the links above to familiarize yourself with them.

This paper looks at coordination of 1-round transactions, which are different from general transactions that involve multi-round interaction with the client. 1-round transactions execute at the participant nodes, with no communication with other nodes except for at most a single commit/abort vote.

Figure 1 shows the system architecture. The clients are the transaction managers for these 1-round transactions. They initiate transactions and evaluate the commit/conflict/abort decisions returned for the transactions. The repositories communicate with one another (for at most a single commit/abort vote) to coordinate transactions.

There are 2 types of transactions in Granola: coordinated ones, and uncoordinated ones. Actually, better/less-confusing names for these 2 types would be lock-coordinated transactions and timestamp-coordinated transactions, since the latter type of transactions still involve coordination among the repositories.

To reflect this duality in coordination mode, each repository runs in one of two modes. When there are no coordinated transactions running at a repository, it runs in timestamp mode. When a repository receives a request for a coordinated transaction, it switches to the locking mode.

Uncoordinated transactions

There are two kinds of uncoordinated transactions: single repository transactions, or independent distributed transactions. Granola uses a timestamped-based coordination to provide serializability to both kinds of uncoordinated transactions and avoids locking for them. Examples of independent distributed transactions include read-only transactions, and local-read transactions, such as give every employee a 2% raise.

(In our previous work, the slow-fast paper (2010), we had also made an analysis similar to the independent transactions idea. Our slow-fast analysis inspects the program actions, which are precondition guarded assignment statements, and determined for which actions the atomicity can be relaxed so that they can execute in an uncoordinated manner. Our finding was that if the precondition of a program action is "locally-stable" (i.e., this precondition predicate cannot get falsified by execution of other program actions), then it is safe to execute this program action in an uncoordinated/nonatomic manner.)

Repositories are assumed to have access to loosely synchronized clocks (say NTP synchronized clocks). The use of timestamps in Granola are for ordering/serializability of transactions, and logical clocks would also suffice. Granola needs time synchronization for performance/throughput not for correctness. This separation is always a welcome one. (I can't resist but refer to our hybrid logical clock, HLC, work here. I think HLC would improve exposition of Granola, and would make it tolerant to malformed clients which may increase highTS and deny service to normal clients' requests.)

Figure 5 shows the straightforward single repository execution, and Figure 6 shows the more interesting independent transaction execution. Since these are 1-round transactions, the commit/abort decisions are local and one-shot at repositories. Voting is used to notify other repositories about the local decision (a conflict vote cause a repository to abort the transaction), and also for nominating a proposed timestamp for the transaction. The transaction is assigned the highest timestamp from among the votes.

In the "pure" timestamp mode, all single repository and independent distributed transactions commit successfully, and serializability is guaranteed as they are executed in timestamp order. This provides a substantial reduction in overhead from locking and aborts, which improve the throughput.

While in the timestamp mode, coordinated transactions may also arrive which may lead to conflict decisions to be returned for the single or distributed independent transactions executing on those repositories. Those repositories will switch to coordinated mode to serve the coordinated transactions. Once all coordinated transactions have completed those repositories can again transition to the timestamp mode.

Dually, in locking mode, repositories can still serve single repository and distributed independent transactions provided they do not conflict, i.e., they do not need to update a locked item. (In effect, the timestamp mode is just a shortcut to denote the lack of any coordinated transactions in the system.)

Coordinated transactions

Figure 7 shows coordinated transaction execution. Coordination is needed for a transaction that requires a remote read for abort/commit decision. For example, the transaction to transfer \$50 from Alice's account to Bob's account first needs to remote-read Alice's account to certify that it indeed has more than \$50. (Recall that in contrast for an independent distributed transaction the read, or the guard of the transaction, was local: give every employe 2% raise.) Different from independent distributed transaction execution, the coordinated transaction execution involves a prepare phase to acquire required locks. Locks ensure serializability for coordinated transaction execution. So timestamp order may not be satisfied in commit of transactions, and thus, external consistency may not be provided for coordinated transaction execution.


This paper is written with a focus on describing the system in reasonable detail to potential users of the system. This diverges a bit from the academic style, which would put the focus on the novelties and "intellectual merit" (a la NSF). Maybe this is because of the style/focus of the USENIX ATC conference. The evaluation section is also really nice and detailed. (Granola performs like Sinfonia for coordinated transacations, and improve throughput for uncoordinated transactions.)

It seems that Granola does not offer much advantage over Calvin. The paper compares in the related work with Calvin and states the following. "The Calvin transaction coordination protocol was developed in parallel with Granola, and provides similar functionality. Rather than using a distributed timestamp voting scheme to determine execution order, Calvin delays read/write transactions and runs a global agreement protocol to produce a deterministic locking order."

I think Granola may have an edge over Calvin for low-latency, especially for transactions that involve a couple repositories. The best way to see why is to consider this analogy.
Granola:Calvin :: CSMA:TDMA.

Friday, November 14, 2014

Paper Summary: Calvin, Distributed transactions for database systems

Calvin is a transaction scheduling and replication management layer for distributed storage systems. By first writing transaction requests to a durable, replicated log, and then using a concurrency control mechanism that emulates a deterministic serial execution of the log's transaction requests, Calvin supports strongly consistent replication and fully ACID distributed transactions, and manages to incur lower inter-partition transaction coordination costs than traditional distributed database systems.

Calvin emphasizes modularity. The holy trinity in Calvin is: log, scheduler, executor. When a client submits a transaction request to Calvin, this is immediately appended to a durable log, before any actual execution begins. Calvin's scheduling mechanism then processes this request log, deciding when each transaction should be executed in a way that maintains an invariant slightly stronger than serializable isolation: Transaction execution may be parallelized but must be equivalent to a deterministic serial execution in log-specified order. As a result, the log of transaction requests itself serve as an ultimate "source of truth" about the state of a database, which makes the recovery very easy.

I thought Calvin resembles the Tango approach a lot. (I had discussed Tango here recently.) It is almost as if Calvin is Tango's cousin in databases domain. As such, Calvin has similar strengths and disadvantages like Tango. For the advantages: Calvin provides good throughput, but will not get stars for low-latency. Calvin provides scalable replication, and strongly-consistent replication. (After you have one authoritative log, this is not hard to provide anyways.)

The centralized log is the source of all disadvantages in Calvin as well.  The transactions need to always go through the centralized log; so there are no truly local transactions. Thus Calvin will perform worse for workloads that have local/non-coordinating workload. So, the TPC-C workload Calvin uses for evaluation is actually best workload to show Calvin's relative performance to other systems.

The Log component

Calvin uses Paxos to achieve availability of the log by replicating it consistently. A group of front-end servers collect client requests into batches. Each batch is assigned a globally unique ID and written to an independent, asynchronously replicated block storage service such as Voldemort or Cassandra. Once a batch of transaction requests is durable on enough replicas, its GUID is appended to a Paxos “MetaLog”. Readers can then reassemble the global transaction request log by concatenating batches in the order that their GUIDs appear in the Paxos MetaLog.

Batching trades off throughput with low-latency: you cannot have transaction latency lower than the batching duration (epoch). So an epoch is a guaranteed overhead on latency for every transaction.

The scheduler component

The Scheduler component (which is a centralized component) examines a transaction before it begins executing and decides when it is safe to execute the whole transaction, then hands the transaction request off to the storage backend for execution with no additional oversight. The storage backend therefore does not need to have any knowledge of the concurrency control mechanism or implementation. Once a transaction begins running, the storage backend can be sure that it can process it to completion without worrying about concurrent transactions accessing the same data. However, each storage backend must provide enough information prior to transactional execution in order for the scheduler to make a well-informed decision about when each transaction can safely (from a concurrency control perspective) execute.

For transaction execution, the scheduler still uses locks. Deterministic locking ensures concurrent execution equivalent to the serial transaction order in the log, and also makes deadlock impossible (and the associated nondeterministic transaction aborts).


I don't know why Calvin doesn't adopt Tango style log maintenance: Using chain replication to improve throughput of the centralized log. This might actually help Calvin.

Similarly Calvin should also adopt selective/custom stream replication per replica feature in Tango. That would implement the flexibility/generality of Calvin.

Related links

Saturday, November 1, 2014

Paper Summary: Coordination Avoidance in Database Systems

Serializing transactions is sufficient for correctness, but it is not necessary for all operations of all applications. The downside of serialization is that it kills scalability and is overkill in many cases.

This paper (which will appear in VLDB'15) has the following insight: Given knowledge of application transactions and correctness criteria (i.e., invariants), it is possible to avoid this over-coordination of serializability and execute some transactions without coordination while still preserving those correctness criteria (invariants).

In particular the authors propose the concept of "invariant confluence" to relax the use of serialization for some operations of a coordination-requiring application. By operating on application-level invariants over database states (e.g., integrity constraints), the invariant confluence analysis provides a necessary and sufficient condition for safe, coordination-free execution. When programmers specify application invariants, this analysis allows databases to coordinate only when concurrency may violate those application invariants.

So how do they get the application invariants? "Many production databases today already support invariants in the form of primary key, uniqueness, foreign key, and row-level check constraints. We analyze this and show that many are invariant-confluent, including forms of foreign key constraints unique value generation, and check constraints, while others, like primary key constraints are, in general, not."

They claim that many common integrity constraints found in SQL and standardized benchmarks are invariant confluent, allowing order-of-magnitude performance gains over coordinated execution. To substantiate this claim, they apply invariant confluence analysis to a database prototype and show 25-fold improvement over prior TPC-C New-Order performance on a 200 server cluster. They find that 10 out of 12 of TPC-C's invariants are invariant-confluent, under the workload transaction.

The invariant-confluence model

Invariant-confluence captures a simple (informal) rule: coordination can be avoided if and only if all local commit decisions are globally valid. (In other words, the commit decisions are composable.)

They model transactions to operate over independent logical snapshots of database state. Transaction writes are applied at one or more snapshots initially when the transaction commits and are then integrated into other snapshots asynchronously via a merge operator that incorporates those changes into the snapshot's state. "Merge" is simply the set union of versions, and is used to capture the process of reconciling divergent states.

In effect, this model states that each transaction can modify its replica state without modifying any other concurrently executing transactions' replica state. Replicas therefore provide transactions with partial snapshot views of global state. They define local validity/consistency as a safety property, but global replica consistency is not defined as a safety property. Instead it is defined as a liveness property under the name "convergence".

(Formal definition of invariant-confluent:)
A set of transactions T is I-confluent with respect to invariant I if, for all I-T-reachable states Di, Dj with a common ancestor state, Di union Dj is I-valid.

Applying the invariant-confluence concept

As the definition implies, I-confluence holds for specific combinations of invariants and transactions. Removing a user from the database is I-confluent with respect to the invariant that the user IDs are unique. However, two transactions that remove two different users from the database are not I-confluent with respect to the invariant that there exists at least one user in the database at all times. As another example, uniqueness is not I-confluent for inserts of unique values. However, reads and deletions are both I-confluent under uniqueness invariants: reading and removing items cannot introduce duplicates.

Table 3 summarizes the 12 invariants found in TPC-C benchmark as well as their I-confluence analysis results as determined by Table 2. They classify the invariants into 3 broad categories: materialized view maintenance, foreign key constraint maintenance, and unique ID assignment.

Figure 5 shows the concurrency/throughput improvement made possible by applying the invariant-confluence analysis to the TPC-C workload.

Related work

This paper is an extension of the "CALM" approach that uses monotonicity and convergence concepts to relax the coordination needs of applications.

The "Scalable commutatitivity rule" paper, which was one of the best papers in SOSP'13, is a closely related work. In order to relax serializability and boost concurrency, that work prescribes exploiting the commutativity of operations. Another related work that exploits commutativity to relax serializability is the "Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary" paper. The invariant confluence analysis concept is more general (but probably harder to apply) than the commutativity rule approach, because while commutativity is sufficient for correctness it is not always necessary.

In our previous work, the slow-fast paper (2010), we had also used the concept of "invariant-relaxed serializability" in distributed systems domain, particularly in application to wireless sensor/actor network concurrency control. (Maybe somewhat of a misnomer we called a slow action as one which can be executed in a concurrent/uncoordinated/nonatomic manner, and a fast action as one which needs to be executed in an atomic/coordinated/no-conflicts manner.)

I suspect our slow-fast approach used a less aggressive optimization than invariant-confluence: Slow-fast did not require/inspect program invariants explicitly, it only required access to the program actions (i.e., transactions). The slow-fast approach inferred that the invariant holds when program actions (transactions) execute atomically. (Invariant-confluence may potentially use even a weaker invariant than what slow-fast used.)

Our slow-fast analysis inspects the program actions, which are precondition guarded assignment statements, and determined for which actions the atomicity can be relaxed so that they can execute in an uncoordinated manner. Our finding was that if the precondition of a program action is "locally-stable" (i.e., this precondition predicate cannot get falsified by execution of other program actions), then it is safe to execute this program action in an uncoordinated/nonatomic manner. (This check probably implies "mergeability" of the state.) Our analysis also prescribed ways to break a coordination requiring action into two smaller actions to make it coordination free.

The "coordination avoidance in databases" paper applies the "invariant-relaxed serializability" idea in a more restricted and more useful domain, database transactions, and demonstrates the idea in a very practical way.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...