Wednesday, October 7, 2015

HPTS trip report (days 0 and 1)

Last week, from Sunday to Tuesday night, I attended the 16th International Workshop on High Performance Transaction Systems (HPTS). HPTS is an unconventional workshop. "Every two years, HPTS brings together a lively and opinionated group of participants to discuss and debate the pressing topics that affect today's systems and their design and implementation, especially where scalability is concerned. The workshop includes position paper presentations, panels, moderated discussions, and significant time for casual interaction. The only publications are slide decks by presenters who choose to post them." HPTS is by invitation only and keeps it under 100 participants. The workshop brings together experts from both industry and academia so they mix and interact. Looking at the program committee, you can see names of entrepreneurs venture capitalists (David Cheriton, Sequoia Capital), large web companies (Google, Facebook, Salesforce, Cloudera), and academics. HPTS is legacy of Jim Gray, and among its regular participants include Mike Stonebraker (Turing Award winner) and C. Mohan (IBM).

HPTS travel and venue

HPTS is always held at the same venue, Asilomar Conference Grounds, Pacific Grove, CA. My flights Sunday morning (JetBlue BUF-JFK, JFK-SFO) were smooth and on time. I even get to do some writing on the JFK-SFO flight. Since Asilomar is not easily accessible from SFO (or San Jose airport for that matter), I had to rent a car. I used the scenic Route 1 for my drive. It was a 3 hour drive. I stopped a couple of times to take pictures. I made it to the Asilomar conference center at 5pm, just enough time to settle before the dinner at 6pm.

The Asilomar conference grounds is just on the edge of the Monterey Bay. It is overlooking the Pacific, and next to white sand dunes (a nature reserve area) and a nice beach. The barracks were ran down and showing their age. There is a dining hall, a separate building where the HPTS participants dined together as a group (breakfast/lunch/dinner). The talks were held in the chapel, another building close to the dining hall. The program was so full that we would go to our rooms only for sleeping, and that too briefly. After dinner, there was social hour the first night, and lightning talks the next 2 nights, all accompanied by beer and wine. And after 9pm, the crowd moved to Director's cottage, a fancy vacation house for chatting till midnight, lubed by whisky, wine, beer (see, the order is different this time). Then breakfast starts at 7:30am, rinse and repeat each day.

HPTS first night

So Sunday evening, at the first dinner, I somehow sat right next to David Cheriton. He is smart and curious. He asked me to explain about my work, but I wasn't able to communicate the motivation for our hybrid clocks work. I suspect this was partly because we don't share the same terminology/background, and partly because I was unprepared to explain the motivation from a database industry perspective. David was interested and persistent and pushed to understand the problem completely, a trait shared by ultra successful people like him. After spending 10 minutes, I was the party to quit, and told David that I hope to clarify these in the lightning talk, which I proceeded to botch up Monday night :-(

There was a meet and greet after the dinner, from 7-9pm, at the chapel. I am actually a shy guy, but I made an effort to meet people. When I saw a crowd of 3 or more people, I joined to listen and participate. I had nice conversations with the Salesforce crew. I asked them a lot of questions and learned a lot.

At 9pm, the group moved to the director's cottage. I was very tired from the JFK-SFO flight and the 3 hour drive, so after hanging around in the director's cottage for 10 minutes, I went to bed. Since I was jetlagged I woke up at 4am, tried to sleep again but woke up at 6 am. I went for a run, for 3.2 miles. It felt good. Sometimes the best way to fight off exhaustion is by exercising.

Monday morning sessions

Pat Helland (Salesforce) opened the workshop. He said that the workshop is a platform for the academicians and practitioners of databases interact and exchanged ideas. He urged participants to make good use of the 30 minute coffee breaks between the sessions. He said that "actually the sessions are there to punctuate the breaks" :-). Phil Bernstein (Microsoft) asked us to refrain from live blogging or tweeting, as some speakers may talk about confidential technology. Speakers who like release their slidedecks after a week to be posted at the HPTS website. Checking back I see 1/3rds of the slides from talks are posted, and almost all the lightning talk slides are posted.

Here are some of the interesting talks from the two sessions (transactions and applications sessions) Monday morning.

The talk "A1 and FARM scalable graph database on top of a transactional memory layer" from Microsoft Research was about a high performance graph database platform enabled by three hadware trends: 1. inexpensive DRAM (currently $8/GB, machines with 128GB, container will hold more than 100TBs), 2. nonvolatile RAM (DRAM + battery + SSD), and 3. fast commodity networks with RDMA becoming available. (Turns out the Microsoft Research group has an SOSP 15 paper describing this system.)

Kyle Kingsburry, a computer safety researcher at Stripe, gave a nice talk on his Jepsen toolkit for blackbox verification of systems by injecting network partitions and testing basic invariants.

"The Quick and the Dead" talk by James Barrese (PayPal) described Paypal technology initiatives for creating a PAAS platform. These were motivated by a need to innovate quickly and to automate everything.

"From Microservices to Teraservices" talk by Adrian Cockcroft (Battery Ventures) described how microservices and containerazation are useful for accelerating innovation by allowing doing daily releases. Adrian defined a microservice as a loosely coupled service oriented architecture with bounded contexts.

Monday afternoon sessions

In the "Data Federations: An Idea Whose Time Has Come Again", Michael Stonebraker (MIT) talked about their bigdawg polystore platform which will be released on github in a couple months.

"From Trash to Treasure" talk by Pat Selinger (Paradata) was about how to clean and integrate dirty customer data with duplicates, missing values, corrupted values, and invalid values.

I really liked Rodrigo Fonseca's (Brown Univ.) talk on causal metadata tracking in distributed systems. 

Eric Grosse (Google Security Team) gave an informative talk titled "Security Lessons from the Front Lines".

Monday evening lighting talks

Monday evening lighting talks slides are available here. The lighting talks are of 5 minute duration. In my talk, I somehow ran out of 3 minutes in the first 2 slides. After I got the "last 2 minute warning" and I rushed through a couple more slides waiving my arms frantically :-) I should have prepared and practiced before the talk. I was upset about how the talk went but several people (including Phil Bernstein at Microsoft Research) showed interest and approached me later to learn more about the hybrid clocks, which made me feel better.

I will write about day 2, newly-minted McArthur Genius Chris Re's talk, and the overall buzz at HPTS in a couple days. I hope more slides will be added to HPTS site by then.

Tuesday, October 6, 2015

Consensus in the wild

The consensus problem has been studied in the theory of distributed systems literature extensively. Consensus is a fundamental problem in distributed systems. It states that n nodes agree on the same decision eventually. Consistency part of the specification says that no two nodes decide differently. Termination states that all nodes eventually decide. And NonTriviality says that the decision cannot be static (you need to decide a value among inputs/proposals to the system, you can't keep deciding 0 discarding the inputs/proposals). This is not a hard problem if you have reliable and bounded-delay channels and processes, but becomes impossible in the absence of either. And with even temporary violation of reliability and timing/synchronicity assumptions, a consensus system can easily spawn multiple corner-cases where consistency or termination is violated. E.g., 2-phase commit is blocking (this violates termination), and 3-phase commit is unproven and has many corner cases involving the old leader waking up in the middle of execution of the new leader (this violates consistency).

Paxos appeared in 1985 and provided a fault-tolerant solution to consensus. Paxos dealt with asynchrony, process crash/recovery, and message loss in a uniform and elegant algorithmic way. When web-scale services and datacenter computing took off in early 2000s, fault-tolerant consensus became a practical concern. Google started to run into corner cases of consensus that introduced downtime. Luckily Google had people who had academic background in distributed systems (like Tushar Chandra) and they knew what to do. Paxos algorithm got adopted at Google in the Chubby lock service, and used in Google File System and for replicating master node in Map Reduce systems. Then Paxos, the algorithm only distributed systems researchers knew about, got popular in the wild. Several other companies adopted Paxos, and several opensource implementations appeared.

(Had we not have a well-enginereed robust algorithm for consensus in the form of Paxos, what would happen? It would probably be a mess with many groups coming up with their own implementation of a consensus protocol which would be buggy in some small but significant manner.)

My student Ailidani and I are working on a survey of consensus systems in the wild. We compare different flavors of the Paxos consensus protocol with their associated advantages and drawbacks. We also survey how consensus protocols got adopted in the industry and for which jobs. Finally, we discuss where Paxos is used in an overkill manner, where a consensus algorithm could be avoided, or could be tucked out of the main/critical pathway (consensus is expensive afterall.).

Paxos flavors

There are three main/popular flavors: classical multi-Paxos, ZooKeeper Zab protocol, and Raft protocol.

The classical multi-Paxos protocol is nicely reviewed and presented in Robbert Van Rennesse's "Paxos Made Moderately Complex" paper.

Zab is used in ZooKeeper, the popular "coordination kernel". ZooKeeper is used by Hadoop (replicating master at HDFS, Map Reduce), and in the industry for keeping/replicating configurations (Netflix, etc.)

Raft provides a flavor of Paxos very similar to Zab. It comes with a focus on understandability and simplicity and has seen several opensource implementations.

Differences between Paxos and Zab

Zab provides consensus by atomic broadcast protocol. Zab implements a primary process as the distinguished leader, which is the only proposer in the system. The log entries flow only from this leader to the acceptors.

The leader election in Paxos can be concurrent with the ongoing consensus requests/operations, and multiple leaders may even get requests proposed and accepted. (Mencius/e-Paxos systematize this and use it for improving throughput.) In contrast, in Zab, a new leader cannot start proposing a new value before it passes a barrier function which ensures that the leader has the longest commit history and every previously proposed value are commited at each acceptor. This way, Zab divides time into three sequential phases.

Another major difference between Zab and Paxos is that Zab protocol also includes client interaction, which introduced an additional order guarantee, per-client FIFO order. All requests from a given client are executed in the order that they were sent by the client. Such guarantee does not hold with Paxos.

Differences between Zab and Raft

There isn't much difference between Zab and Raft. ZooKeeper keeps a filesystem like API and hierarchical znodes, whereas Raft does not specify the state machine. On the whole, if you compare Zab (the protocol underlying ZooKeeper) and Raft there aren't any major differences in each component, but only minor implementation differences.

Abusing Paxos consensus

1) Paxos is meant to be used as fault-tolerant storage of *metadata*, not data. Abusing Paxos for replicated storage of data will kill the performance.

Apache Giraph made this mistake in aggregators. (This was mentioned in the Facebook's recent Giraph paper.) In Giraph, workers would write partial aggregated values to znodes (Zookeeper's data storage) and the master would aggregate these and write the final result back to its znode for the workers to access. This wasn't scalable due to Zookeeper write throughput limitations and caused a big problem for Facebook which needed to support very large sized aggregators.

In the same vein, using Paxos for queueing or messaging service is a bad idea. When the number of messages increase, performance doesn't scale.

What is the right way of approaching this then? Use chain replication! Chain replication uses Paxos for fault-tolerant storage of metadata:"the configuration of replicas in the chain" and lets replication/storage of data occur in the chain, without involving Paxos. This way, Paxos doesn't get triggered with every piece of data entering the system. Rather it gets triggered rarely, only if a replica fails and a new configuration needs to be agreed.

Apache Kafka and Bookkeeper work based on this principle and are the correct ways to address the above two scenarios.

2) Paxos implies serializability but serializability does not imply Paxos. Paxos provides a total order on operations/requests replicated over k replicas and can be an overkill for achieving serializability for two reasons. First Paxos's true goal is fault-tolerant replication and serialization only its side effect. If you just need serializability and don't need fault-tolerant replication of each operation/request, then Paxos slows your performance. Secondly, Paxos gives you total order but serializability does not require a total order. A partial order that is serializable is good enough and gives you more options.

Monday, October 5, 2015

Analysis of Bounds on Hybrid Vector Clocks

This work is in collaboration with Sorrachai Yingchareonthawornchai and Sandeep Kulkarni at the Michigan State University and is currently under submission.

Practice of distributed systems employs loosely synchronized clocks, mostly using NTP. Unfortunately, perfect synchronization is unachievable due to messaging with uncertain latency, clock skew, and failures. These sync errors lead to anomalies. For example, a send event at Branch1 may be assigned a timestamp greater than the corresponding receive event at Branch2, because Branch1's clock is slightly ahead of Branch2's clock. This leads to /inconsistent state snapshots/ because, at time T=12:00, a money transfer is recorded as received at Branch2, whereas it is not recorded as sent at Branch1.

Theory of distributed systems shrugs and doesn't even try. Theory abstracts away from the physical clock and uses logical clocks for ordering events. These are basically counters, as in Lamport's clocks and vector clocks. The causality relationship captured by these logical clocks is defined based on passing of information rather than passing of time. As such, it is not possible to query events in relation to physical time.

Recently, we introduced a third option, hybrid clocks. Hybrid clocks combine the best of logical and physical clocks and avoid their disadvantages. Hybrid clocks are loosely synchronized using NTP, yet they also provide provable comparison conditions as in LC or VC.

Our hybrid clocks come in two flavors: hybrid logical clocks (HLC) and hybrid vector clocks (HVC). HLC satisfy the logical clock comparison condition and find applications in multiversion distributed database systems (such as in CockroachDB) where it enables efficient querying of consistent snapshots for read transactions, and ensures that commits of write transactions do not get delayed despite the uncertainties in NTP clock synchronization. HVC satisfy the vector clock comparison condition: in contrast to HLC that can provide a single consistent snapshot for a given time, HVC provide all consistent snapshots for that given time. HVC find applications in debugging and in causal delivery of messages.

The space requirement of VC is shown to be of size n, the number of nodes in the system, which is prohibitive. HVC reduces the overhead of causality tracking in VC by using the fact that the clocks are reasonably synchronized within epsilon. If j does not hear (directly or transitively) from k within epsilon then hvc.j[k] need not be explicitly maintained. We still infer that hvc.j[k] equals hvc.j[j]-epsilon, thanks to clock sync. So hvc.j only maintains entries for nodes that talked to j within last epsilon and provided a fresh timestamp higher than hvc.j[j]-epsilon. This way HVC can potentially scale the VC benefits to many thousands of processes by still maintaining small HVC at each process.

HVC bounds

But how effective are HVC for reducing the size of VC? To address this question, we developed an analytical model that uses four parameters, epsilon: uncertainty of clock synchronization, delta: minimum message delay, alpha: message sending rate, and n: number of nodes in the system. We use a model with random unicast message transmissions and derive the size of HVC in terms of a delay differential equation.

This differential equation captures the rate of propogation of "redness". Red means the node maintains an entry for a node j. Initially only j is red and all other nodes are green. If a red node communicates a message to a green node which is received within epsilon, that node also becomes red (starts maintaining an entry for j in its hvc). A red node may turn back to being green if it doesn't receive a message that contains fresh information about j in the last epsilon.

Our model and simulations show the HVC size is a sigmoid function with respect to increasing epsilon: it has a slow start but grows exponentially after a critical phase transition. Before the phase transition threshold, HVC maintains couple entries per node, however when a threshold is crossed, a node not only gets entries added to its clock from direct interaction but also indirect transfer from another processes HVC, and this makes the HVC entries blow up. In other words, the redness goes viral after a threshold. We derive this threshold as (1/alpha + delta)* ln((2-√3)*(n-1)), for alpha*delta<1.

Our bounds are tight and we find that the size predicted by our model is almost identical to the results obtained by our simulation results.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...