Wednesday, October 7, 2015

HPTS trip report (days 0 and 1)

Last week, from Sunday to Tuesday night, I attended the 16th International Workshop on High Performance Transaction Systems (HPTS). HPTS is an unconventional workshop. "Every two years, HPTS brings together a lively and opinionated group of participants to discuss and debate the pressing topics that affect today's systems and their design and implementation, especially where scalability is concerned. The workshop includes position paper presentations, panels, moderated discussions, and significant time for casual interaction. The only publications are slide decks by presenters who choose to post them." HPTS is by invitation only and keeps it under 100 participants. The workshop brings together experts from both industry and academia so they mix and interact. Looking at the program committee, you can see names of entrepreneurs venture capitalists (David Cheriton, Sequoia Capital), large web companies (Google, Facebook, Salesforce, Cloudera), and academics. HPTS is legacy of Jim Gray, and among its regular participants include Mike Stonebraker (Turing Award winner) and C. Mohan (IBM).

HPTS travel and venue

HPTS is always held at the same venue, Asilomar Conference Grounds, Pacific Grove, CA. My flights Sunday morning (JetBlue BUF-JFK, JFK-SFO) were smooth and on time. I even get to do some writing on the JFK-SFO flight. Since Asilomar is not easily accessible from SFO (or San Jose airport for that matter), I had to rent a car. I used the scenic Route 1 for my drive. It was a 3 hour drive. I stopped a couple of times to take pictures. I made it to the Asilomar conference center at 5pm, just enough time to settle before the dinner at 6pm.

The Asilomar conference grounds is just on the edge of the Monterey Bay. It is overlooking the Pacific, and next to white sand dunes (a nature reserve area) and a nice beach. The barracks were ran down and showing their age. There is a dining hall, a separate building where the HPTS participants dined together as a group (breakfast/lunch/dinner). The talks were held in the chapel, another building close to the dining hall. The program was so full that we would go to our rooms only for sleeping, and that too briefly. After dinner, there was social hour the first night, and lightning talks the next 2 nights, all accompanied by beer and wine. And after 9pm, the crowd moved to Director's cottage, a fancy vacation house for chatting till midnight, lubed by whisky, wine, beer (see, the order is different this time). Then breakfast starts at 7:30am, rinse and repeat each day.

HPTS first night

So Sunday evening, at the first dinner, I somehow sat right next to David Cheriton. He is smart and curious. He asked me to explain about my work, but I wasn't able to communicate the motivation for our hybrid clocks work. I suspect this was partly because we don't share the same terminology/background, and partly because I was unprepared to explain the motivation from a database industry perspective. David was interested and persistent and pushed to understand the problem completely, a trait shared by ultra successful people like him. After spending 10 minutes, I was the party to quit, and told David that I hope to clarify these in the lightning talk, which I proceeded to botch up Monday night :-(

There was a meet and greet after the dinner, from 7-9pm, at the chapel. I am actually a shy guy, but I made an effort to meet people. When I saw a crowd of 3 or more people, I joined to listen and participate. I had nice conversations with the Salesforce crew. I asked them a lot of questions and learned a lot.

At 9pm, the group moved to the director's cottage. I was very tired from the JFK-SFO flight and the 3 hour drive, so after hanging around in the director's cottage for 10 minutes, I went to bed. Since I was jetlagged I woke up at 4am, tried to sleep again but woke up at 6 am. I went for a run, for 3.2 miles. It felt good. Sometimes the best way to fight off exhaustion is by exercising.

Monday morning sessions

Pat Helland (Salesforce) opened the workshop. He said that the workshop is a platform for the academicians and practitioners of databases interact and exchanged ideas. He urged participants to make good use of the 30 minute coffee breaks between the sessions. He said that "actually the sessions are there to punctuate the breaks" :-). Phil Bernstein (Microsoft) asked us to refrain from live blogging or tweeting, as some speakers may talk about confidential technology. Speakers who like release their slidedecks after a week to be posted at the HPTS website. Checking back I see 1/3rds of the slides from talks are posted, and almost all the lightning talk slides are posted.

Here are some of the interesting talks from the two sessions (transactions and applications sessions) Monday morning.

The talk "A1 and FARM scalable graph database on top of a transactional memory layer" from Microsoft Research was about a high performance graph database platform enabled by three hadware trends: 1. inexpensive DRAM (currently $8/GB, machines with 128GB, container will hold more than 100TBs), 2. nonvolatile RAM (DRAM + battery + SSD), and 3. fast commodity networks with RDMA becoming available. (Turns out the Microsoft Research group has an SOSP 15 paper describing this system.)

Kyle Kingsburry, a computer safety researcher at Stripe, gave a nice talk on his Jepsen toolkit for blackbox verification of systems by injecting network partitions and testing basic invariants.

"The Quick and the Dead" talk by James Barrese (PayPal) described Paypal technology initiatives of migrating to node.js for front end systems, and openstack and docker containerization for creating a paas platform. These were motivated by a need to innovate quickly and to automate everything.

"From Microservices to Teraservices" talk by Adrian Cockcroft (Battery Ventures) described how microservices and containerazation are useful for accelerating innovation by allowing doing daily releases. Adrian defined a microservice as a loosely coupled service oriented architecture with bounded contexts.

Monday afternoon sessions

In the "Data Federations: An Idea Whose Time Has Come Again", Michael Stonebraker (MIT) talked about their bigdawg polystore platform which will be released on github in a couple months.

"From Trash to Treasure" talk by Pat Selinger (Paradata) was about how to clean and integrate dirty customer data with duplicates, missing values, corrupted values, and invalid values.

I really liked Rodrigo Fonseca's (Brown Univ.) talk on causal metadata tracking in distributed systems. 

Eric Grosse (Google Security Team) gave an informative talk titled "Security Lessons from the Front Lines".

Monday evening lighting talks

Monday evening lighting talks slides are available here. The lighting talks are of 5 minute duration. In my talk, I somehow ran out of 3 minutes in the first 2 slides. After I got the "last 2 minute warning" and I rushed through a couple more slides waiving my arms frantically :-) I should have prepared and practiced before the talk. I was upset about how the talk went but several people (including Phil Bernstein at Microsoft Research) showed interest and approached me later to learn more about the hybrid clocks, which made me feel better.

I will write about day 2, newly-minted McArthur Genius Chris Re's talk, and the overall buzz at HPTS in a couple days. I hope more slides will be added to HPTS site by then.

Tuesday, October 6, 2015

Consensus in the wild

The consensus problem has been studied in the theory of distributed systems literature extensively. Consensus is a fundamental problem in distributed systems. It states that n nodes agree on the same decision eventually. Consistency part of the specification says that no two nodes decide differently. Termination states that all nodes eventually decide. And NonTriviality says that the decision cannot be static (you need to decide a value among inputs/proposals to the system, you can't keep deciding 0 discarding the inputs/proposals). This is not a hard problem if you have reliable and bounded-delay channels and processes, but becomes impossible in the absence of either. And with even temporary violation of reliability and timing/synchronicity assumptions, a consensus system can easily spawn multiple corner-cases where consistency or termination is violated. E.g., 2-phase commit is blocking (this violates termination), and 3-phase commit is unproven and has many corner cases involving the old leader waking up in the middle of execution of the new leader (this violates consistency).

Paxos appeared in 1985 and provided a fault-tolerant solution to consensus. Paxos dealt with asynchrony, process crash/recovery, and message loss in a uniform and elegant algorithmic way. When web-scale services and datacenter computing took off in early 2000s, fault-tolerant consensus became a practical concern. Google started to run into corner cases of consensus that introduced downtime. Luckily Google had people who had academic background in distributed systems (like Tushar Chandra) and they knew what to do. Paxos algorithm got adopted at Google in the Chubby lock service, and used in Google File System and for replicating master node in Map Reduce systems. Then Paxos, the algorithm only distributed systems researchers knew about, got popular in the wild. Several other companies adopted Paxos, and several opensource implementations appeared.

(Had we not have a well-enginereed robust algorithm for consensus in the form of Paxos, what would happen? It would probably be a mess with many groups coming up with their own implementation of a consensus protocol which would be buggy in some small but significant manner.)

My student Ailidani and I are working on a survey of consensus systems in the wild. We compare different flavors of the Paxos consensus protocol with their associated advantages and drawbacks. We also survey how consensus protocols got adopted in the industry and for which jobs. Finally, we discuss where Paxos is used in an overkill manner, where a consensus algorithm could be avoided, or could be tucked out of the main/critical pathway (consensus is expensive afterall.).

Paxos flavors

There are three main/popular flavors: classical multi-Paxos, ZooKeeper Zab protocol, and Raft protocol.

The classical multi-Paxos protocol is nicely reviewed and presented in Robbert Van Rennesse's "Paxos Made Moderately Complex" paper.

Zab is used in ZooKeeper, the popular "coordination kernel". ZooKeeper is used by Hadoop (replicating master at HDFS, Map Reduce), and in the industry for keeping/replicating configurations (Netflix, etc.)

Raft provides a flavor of Paxos very similar to Zab. It comes with a focus on understandability and simplicity and has seen several opensource implementations.

Differences between Paxos and Zab

Zab provides consensus by atomic broadcast protocol. Zab implements a primary process as the distinguished leader, which is the only proposer in the system. The log entries flow only from this leader to the acceptors.

The leader election in Paxos can be concurrent with the ongoing consensus requests/operations, and multiple leaders may even get requests proposed and accepted. (Mencius/e-Paxos systematize this and use it for improving throughput.) In contrast, in Zab, a new leader cannot start proposing a new value before it passes a barrier function which ensures that the leader has the longest commit history and every previously proposed value are commited at each acceptor. This way, Zab divides time into three sequential phases.

Another major difference between Zab and Paxos is that Zab protocol also includes client interaction, which introduced an additional order guarantee, per-client FIFO order. All requests from a given client are executed in the order that they were sent by the client. Such guarantee does not hold with Paxos.

Differences between Zab and Raft

There isn't much difference between Zab and Raft. ZooKeeper keeps a filesystem like API and hierarchical znodes, whereas Raft does not specify the state machine. On the whole, if you compare Zab (the protocol underlying ZooKeeper) and Raft there aren't any major differences in each component, but only minor implementation differences.

Abusing Paxos consensus

1) Paxos is meant to be used as fault-tolerant storage of *metadata*, not data. Abusing Paxos for replicated storage of data will kill the performance.

Apache Giraph made this mistake in aggregators. (This was mentioned in the Facebook's recent Giraph paper.) In Giraph, workers would write partial aggregated values to znodes (Zookeeper's data storage) and the master would aggregate these and write the final result back to its znode for the workers to access. This wasn't scalable due to Zookeeper write throughput limitations and caused a big problem for Facebook which needed to support very large sized aggregators.

In the same vein, using Paxos for queueing or messaging service is a bad idea. When the number of messages increase, performance doesn't scale.

What is the right way of approaching this then? Use chain replication! Chain replication uses Paxos for fault-tolerant storage of metadata:"the configuration of replicas in the chain" and lets replication/storage of data occur in the chain, without involving Paxos. This way, Paxos doesn't get triggered with every piece of data entering the system. Rather it gets triggered rarely, only if a replica fails and a new configuration needs to be agreed.

Apache Kafka and Bookkeeper work based on this principle and are the correct ways to address the above two scenarios.

2) Paxos implies serializability but serializability does not imply Paxos. Paxos provides a total order on operations/requests replicated over k replicas and can be an overkill for achieving serializability for two reasons. First Paxos's true goal is fault-tolerant replication and serialization only its side effect. If you just need serializability and don't need fault-tolerant replication of each operation/request, then Paxos slows your performance. Secondly, Paxos gives you total order but serializability does not require a total order. A partial order that is serializable is good enough and gives you more options.

Monday, October 5, 2015

Analysis of Bounds on Hybrid Vector Clocks

This work is in collaboration with Sorrachai Yingchareonthawornchai and Sandeep Kulkarni at the Michigan State University and is currently under submission.

Practice of distributed systems employs loosely synchronized clocks, mostly using NTP. Unfortunately, perfect synchronization is unachievable due to messaging with uncertain latency, clock skew, and failures. These sync errors lead to anomalies. For example, a send event at Branch1 may be assigned a timestamp greater than the corresponding receive event at Branch2, because Branch1's clock is slightly ahead of Branch2's clock. This leads to /inconsistent state snapshots/ because, at time T=12:00, a money transfer is recorded as received at Branch2, whereas it is not recorded as sent at Branch1.

Theory of distributed systems shrugs and doesn't even try. Theory abstracts away from the physical clock and uses logical clocks for ordering events. These are basically counters, as in Lamport's clocks and vector clocks. The causality relationship captured by these logical clocks is defined based on passing of information rather than passing of time. As such, it is not possible to query events in relation to physical time.

Recently, we introduced a third option, hybrid clocks. Hybrid clocks combine the best of logical and physical clocks and avoid their disadvantages. Hybrid clocks are loosely synchronized using NTP, yet they also provide provable comparison conditions as in LC or VC.

Our hybrid clocks come in two flavors: hybrid logical clocks (HLC) and hybrid vector clocks (HVC). HLC satisfy the logical clock comparison condition and find applications in multiversion distributed database systems (such as in CockroachDB) where it enables efficient querying of consistent snapshots for read transactions, and ensures that commits of write transactions do not get delayed despite the uncertainties in NTP clock synchronization. HVC satisfy the vector clock comparison condition: in contrast to HLC that can provide a single consistent snapshot for a given time, HVC provide all consistent snapshots for that given time. HVC find applications in debugging and in causal delivery of messages.

The space requirement of VC is shown to be of size n, the number of nodes in the system, which is prohibitive. HVC reduces the overhead of causality tracking in VC by using the fact that the clocks are reasonably synchronized within epsilon. If j does not hear (directly or transitively) from k within epsilon then hvc.j[k] need not be explicitly maintained. We still infer that hvc.j[k] equals hvc.j[j]-epsilon, thanks to clock sync. So hvc.j only maintains entries for nodes that talked to j within last epsilon and provided a fresh timestamp higher than hvc.j[j]-epsilon. This way HVC can potentially scale the VC benefits to many thousands of processes by still maintaining small HVC at each process.

HVC bounds

But how effective are HVC for reducing the size of VC? To address this question, we developed an analytical model that uses four parameters, epsilon: uncertainty of clock synchronization, delta: minimum message delay, alpha: message sending rate, and n: number of nodes in the system. We use a model with random unicast message transmissions and derive the size of HVC in terms of a delay differential equation.

This differential equation captures the rate of propogation of "redness". Red means the node maintains an entry for a node j. Initially only j is red and all other nodes are green. If a red node communicates a message to a green node which is received within epsilon, that node also becomes red (starts maintaining an entry for j in its hvc). A red node may turn back to being green if it doesn't receive a message that contains fresh information about j in the last epsilon.

Our model and simulations show the HVC size is a sigmoid function with respect to increasing epsilon: it has a slow start but grows exponentially after a critical phase transition. Before the phase transition threshold, HVC maintains couple entries per node, however when a threshold is crossed, a node not only gets entries added to its clock from direct interaction but also indirect transfer from another processes HVC, and this makes the HVC entries blow up. In other words, the redness goes viral after a threshold. We derive this threshold as (1/alpha + delta)* ln((2-√3)*(n-1)), for alpha*delta<1.

Our bounds are tight and we find that the size predicted by our model is almost identical to the results obtained by our simulation results.

In the above formula, when we plug very aggressive numbers (incessant message sending at 10Gbps rate over a network with 1 microsecond propagation delay), we find that the phase transition occurs at 1.73 secs. NTP provides epsilon around 10ms, much smaller than 1.73 secs. So we conclude that for all practical applications/environments, HVC sizes remain to be only a couple entries at the nodes. Of course HVC expands to capture causality fully in an epsilon uncertainty slice when recent spike in communication warrants it. However, these do not go viral/epidemic and blow up the hvc of other nodes, instead diminish after the burst/spike in communication.

Tuesday, September 15, 2015

Serving at NSF panels and what it teaches about how to pitch the perfect proposal

NSF is one of the largest funding sources for academic research.  It accounts for about one-fourth of federal support to academic institutions for basic research. NSF accepts 1000s of proposals from researchers, and organizes peer-review panels to decide which ones to fund.

Serving at NSF panels are fun. They are also very useful to understand the proposal review dynamics. NSF funding rates are around 10% for computer science and engineering research proposals, so understanding the dynamics of the panel is useful for applying NSF to secure some funding.

How do you get invited as a panelist? 

You get invited to serve at an NSF panel by the program director of that panel. (Program directors are researchers generally recruited from the academia to serve at NSF for a couple years to run panels and help make funding decisions.)

If you have been around and have successfully secured NSF funding, you will get panel invitations. They will have your name and contact you. But, if you are new, don't just wait. You can email program managers at NSF at your topic, and ask them to consider inviting you as a panelist, because you will need the experience. This doesn't always work, but it helps. I found this from NSF about volunteering as a reviewer at a panel.

Preparation for a panel: Reading, reading, reading, writing reviews

As a panelist, you will be assigned around 8 proposals to review in 3 weeks. Each proposal body consists of 15 pages. So that means a lot of reading in the next couple weeks. And it can get boring, if you just read idly. You should try to read actively, discuss with the proposal, and write notes  as you read the proposal that will help you prepare your review.

Tips on reading papers also apply somewhat to reading proposals. But proposals have their peculiarities. Proposals need to have nicely motivated vision and clearly defined research problems, but they don't need to have all the solutions worked up. Instead of full-fledged solutions, the proposals provide draft solution approaches and competitive advantage insights to attack these research questions.

Pitching a successful proposal is an art. I provide some tips at the end of this post.

NSF panel structure

You travel to NSF headquarters at Washington DC a day before the panel. (NSF pays for your travel and reimburses your stay.) Panels are generally for one and a half day. The first day all the proposals get discussed. (Actually some proposals that don't receive any "Very Good" ratings can be triaged/skipped; for those only the 3 reviewer reports are sent back without a panel discussion summary.)  The second day (i.e., the half day) is for preparing and reviewing the panel discussion summaries of proposals discussed in the first day.

In the panel, everybody is smart and knowledgeable in their fields. Of course some are smarter and more knowledgeable, and it is a treat to listen them discuss proposals. If the panel is in your narrow field of expertise, it is gonna be a geek-fest for you. You will geek out and have a lot of fun. If the panel is in a field you are familiar but is not in your specialized field of expertise, it is still a lot of fun as you get to learn new things.

At the end of first day proposals are sorted into 4 categories: HC (High Competitive), C (Competitive), LC (Low Competitive), NC (Not Competitive). HC proposals get funded. Usually only 1-2 proposals out of around 20+ proposals make it to HC. The C proposals need to get compared and ranked, this generally happens the morning of second day. Only the top 1-2 proposals in C would get funding. LC and NC proposals do not need to get sorted.

The panel does not make the final funding decision; it only provides feedback to NSF to make the final funding decisions. NSF is fairly transparent and mostly goes with panel recommendations. If a proposal is not funded, the proposers still get detailed proposal reviews with the rationale of the panel review. In contrast, some other funding agencies such as DARPA, DOE may not even provide you with reviews/evalutions of your proposal.

NSF panels are tiring. You sit for one a half day and listen and discuss. And in the evening of the first day, you often have homework (hotelwork?) to read some extra proposals to help out in comparing/ranking proposals.  Instead of trying to multitask during the panel (by answering your email, or reading other stuff), it is much better to just participate in the discussion, and listen to the discussion of other proposals, even the ones you have not reviewed. After traveling all that distance to NSF headquarters, you should try to savor the panel as much as possible.

Panel discussions, interesting panel dynamics

Panelists can make mistakes and may have biases. Common mistakes include the following:
  • Panelists may play it safe and prefer incremental proposals over novel but risky proposals. Program managers sometimes warn about this bias, and try to promote high-risk/high-reward proposals.
  • Sometimes panelists may read too much into a proposal. They may like to write/complete the proposal in their heads, and give more credit than deserved to the proposal.
  • Panelists may be overly conformist, resulting in groupthink.
  • There can be some good arguer, a charismatic person that dominates over the other panelists. 

Lessons for pitching the perfect proposal to the panel

Make sure your proposal has a novel research component and intellectual merit. Your proposed project should "advance knowledge and understanding within its own field" and "explore creative, original, or potentially transformative concepts". You need to get at least one panelist excited for you. So stand for something.

Be prepared to justify/support what you stand for. You can fool some panelists some of the time, but you can't fool all of the panelists all of the time. Don't make promises you can't deliver. Show preliminary results from your work. It is actually better to write the proposal after you write an initial interesting (workshop?) paper on the topic.

Target the correct panel and hit the high notes in the CFP. If your proposal falls into the wrong panel, it will get brutally beaten. When in doubt about the scope and field of a CFP, contact the program director to get information. Most academic sub-disciplines/communities will have their biases and pet peeves. You want to target a panel that will get your proposal and is not adversarial to those ideas.

Write clearly and communicate clearly. Remember, the panelists are overworked. They need to review 8 proposals over a short time. It gets boring. So make it easy for the reviewer. Don't make the reviewer do the work. Spell out the contributions and novelty clearly, put your contributions in the context of the literature on that topic. If you make the reviewer work, you will leave him angry and frustrated.

Don't forget to write about the proposal's broader impact including education and minority outreach. If you omit it, it will bite you back.

All being said, there is still a luck factor. An adversarial or cranky panelist may ruin your proposal's chances, or a panelist that loves your work may make your case and improve your chances. The acceptance rate is under 10%. So good luck!

Disclaimer: These are of course my subjective views/opinions as an academician that participated in NSF peer-review panels. My views/opinions do not bind NSF and may not reflect NSF's views/stance.

paper summary: One Trillion Edges, Graph Processing at Facebook-Scale

This paper was recently presented at VLDB15.
A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, S. Muthukrishnan, "One Trillion Edges: Graph Processing at Facebook-Scale." Proceedings of the VLDB Endowment 8.12 (2015).

This paper is about graph processing. Graphs provide a general flexible abstraction to model relations between entities, and find a lot of demand in the field of big data analysis (e.g., social networks, web-page linking, coauthorship relations, etc.)
You think the graphs in Table 1 are big, but Frank's laptop begs to differ. These graphs also fail to impress Facebook. In Facebook, they work with graphs of trillion edges, 3 orders magnitude larger than these. How would Frank's laptop fare for this? @franks_laptop may step up to answer that question soon. This paper presents how Facebook deals with these huge graphs of one trillion edges.

Apache Giraph and Facebook

In order to analyze social network data more efficiently, Facebook considered some graph processing platforms including Hive, GraphLab, Giraph in the summer of 2012. They ended up choosing Apache Giraph for several reasons: it is open source, it directly interfaces with Facebook's internal version of HDFS and Hive, it is written in Java, and its BSP model is simple and easy to reason about.

(The BSP model and Pregel, which Apache Giraph derived from, was covered in an earlier post of mine. You can read that first, if you are unfamiliar with these concepts. I have also summarized some of the Facebook data storage and processing systems before, if you like to read about them.)

However, chosing Apache Giraph was not the end of the story. Facebook was not happy with the state of Apache Giraph, and extended, polished, optimized it for their production use. (And of course Facebook contributed these back to the Apache Giraph project as open source.) This paper explains those extensions.

Significant technical extensions to Giraph

Several of Facebook's extensions were done in order to generalize the platform. Facebook extended the original input model in Giraph, which required a rather rigid and limited layout (all data relative to a vertex, including outgoing edges, had to be read from the same record and were assumed to exist in the same data source) to enable flexible vertex/edge based input. Facebook added parallelization support that enabled adding more workers per machine, and introduced worker local multithreading to take advantage of additional CPU cores. Finally Facebook added memory optimizations to serialize the edges of every vertex into a byte array rather than instantiating them as native Java objects.

I was more interested in their extensions to the compute model, which I summarize below.

Sharded aggregators

The aggregator framework in Giraph was  implemented over ZooKeeper rather inefficiently. Workers would write partial aggregated values to znodes (Zookeeper data storage). The master would aggregate all of them, and write the final result back to its znode for workers to access it. This wasn't scalable due to znode size constraints (maximum 1 megabyte) and Zookeeper write limitations and caused a problem for Facebook which needed to support very large aggregators (e.g. gigabytes).

(That was in fact a bad use of the ZooKeeper framework, as outlined in this post there are better ways to do it. Incidentally, my student Ailidani and I are looking at Paxos use in production environments and we collect anectodes like this. Email us if you have some examples to share.)

In the sharded aggregator architecture implemented by Facebook (Figure 3), each aggregator is randomly assigned to one of the workers. The assigned worker is in charge of gathering the values of its aggregators from all workers and distributing the final values to the master and other workers. This balances aggregation across all workers rather than bottlenecking the master and aggregators are limited only by the total memory available on each worker. Note that this is not fault-tolerant; they lost the crash-tolerance of ZooKeeper.

Worker and Master Phase Extensions

For the worker-side, the methods preSuperstep(), postSuperstep(), preApplication(), and postApplication() were added. As an example, the preSuperstep() method is executed on every worker prior to every superstep, and can be used in k-means clustering implementation to let every worker compute the final centroid locations just before the input vectors are processed.

Similarly, Facebook added master computation to do centralized computation prior to every superstep that can communicate with the workers via aggregators. This is generally a lightweight task (easily computable without requiring much data analysis) that has a global scope (applies as input to all workers in the next supercomputing step).

Superstep Splitting

When operating on very large scale graphs, a superstep may generate a lot of data to share with other workers (e.g., in the friends-of-friends score calculation), that the output does not fit in memory. Giraph can use disk but this slows things signification. The superstep technique is for doing the same computation all in-memory for such applications. The idea is that in such a message heavy superstep, a worker can send a fragment of the messages to their destinations and do a partial computation that updates the state of the vertex value.

Operational experience

Facebook uses Apache Giraph for production applications, for a variety of tasks including label propagation, variants of PageRank, and k-means clustering. The paper reports that most of Facebook's production applications run in less than an hour and use less than 200 machines. Due to the short execution duration and small number of machines, the chance of failure is relatively low and, when a failure occurs, it is handled by restarts.

The production application workflow is as follows. The developer first develops and unit tests the application locally. Then tests the application on small number of servers on a test dataset (e.g., Facebook graph for one country). Then the application is run at scale on 200 workers. After tests, the application is ready for production use.


This is where Facebook shows off. They ran an iteration of unweighted PageRank on the 1.39B Facebook user dataset with over 1 trillion social connections. They were able to execute PageRank in less than 3 minutes per iteration with only 200 machines.


The paper gives the following as concluding remarks:
First, our internal experiments show that graph partitioning can have a significant effect on network bound applications such as PageRank.
Second, we have started to look at making our computations more asynchronous as a possible way to improve convergence speed.
Finally, we are leveraging Giraph as a parallel machine-learning platform.
These are of course related to what we mentioned recently about the trends in distributed systems research in cloud computing. (Part 1, Part 2)


After I wrote this summary, I found that Facebook already has a nice post summarizing their paper.

Saturday, August 15, 2015

New directions for distributed systems research in cloud computing

This post is a continuation of my earlier post on "a distributed systems research agenda for cloud computing". Here are some directions I think are useful directions for distributed systems research in cloud computing.

Data-driven/data-aware algorithms

Please check the Facebook and Google software architecture diagrams in these two links: Facebook Software Stack, Google Software Stack. You will notice that the architecture is all about data: almost all components are about either data processing or data storage.

This trend may indicate that the distributed algorithms should need to adopt to the data it operates on to improve performance. So, we may see the adoption of machine-learning as input/feedback to the algorithms, and the algorithms becoming data-driven and data-aware. (For example, this could be a good way to attack the tail-latency problem discussed here.)

Similarly, driven by the demand from the large-scale cloud computing services, we may see power-management, energy-efficiency, electricity-cost-efficiency as requirements for distributed algorithms. Big players already partition data as hot, warm, cold, and employ tricks to reduce power. We may see algorithms becoming more aware of this.

Scalable coordination 

Again refer to the Facebook and Google service stacks linked above. Facebook stack does not have dedicated coordination services, only monitoring tools. (Of course the data stores employ Paxos to replicate the masters.) Google stack has some coordination and cluster management tools. These large scale systems already seem to embrace the principle of operating with as little coordination as possible.

Heeding the advice in the first challenge in my previous post, this may suggest that we should look into implicit/diffusing/asynchronous/eventual coordination, such as coordination by writing to datastores and other processes reading off of it. Pat Helland's article suggested entity and activities abstractions which can be useful primitives to get started on implicit/diffusing coordination.

Another way to scale coordination is to relax consistency. It is easy to scale consistency, it is easy to scale availability, but not both! Eventual-agreement/convergent-consistency provides a way out of this. There are already a lot of exciting work in this area, and this area will receive getting more attention. Brewer, in his "CAP 12 years later" article, has given nice clues to pursue these kind of systems. We may see systems that also associates uncertainty with the consistency of current state in order to facilitate conflict recovery and eventual consistency.

Extremely resilient systems

Cloud computing transformed the fault-tolerance landscape. Node failures are not a big deal, thanks to the abundance and replication in cloud systems, the nodes are replaceable. Now complex failure modes and distributed state corruption based failures became more critical problems. Improving the availability of these cloud systems are very important to the face of these unanticipated failure modes; it makes the news if a large cloud service is unavailable for several minutes.  In his advice for research on cloud computing, Matt Welsh mentions these two: 1) Building failure recovery mechanisms that are robust to massive correlated outages. 2) Handling both large-scale upgrades to computing capacity as well as large-scale outages seamlessly, without having to completely shut down your service and everything it depends on."

Self-stabilization is a great approach to deal with unanticipated faults. I am guessing we will see a surge in research on self-stabilizing algorithms to achieve extreme resiliency to the face of unanticipated faults in cloud computing systems. Recovery oriented computing (ROC), resettable systems (crash-only software) is a special case of self-stabilization. And we may see extensions of that work for distributed systems. A critical question here will be "How can we make ROC compose nicely for distributed systems?"

To insure against correlated failures, we may see multiversion programming approaches to be revisited. This can also be helpful to avoid the spooky/self-organizing synchronization in cloud computing systems.

For scalable fault-tolerance, asynchronous algorithms like self-stabilizing Propogation of Information with Feedback (PIF) algorithms may be adopted in the cloud domain. Furthermore, pheromene/hormone based algorithms that run in the background in a slow mode can be made extremely scalable exploiting peer-to-peer random-gossip techniques.

New graph-based programming abstractions for the cloud

Good programming abstractions are like good tools, they can boost productivity by several folds. In way of analogy, in wireless sensor networks, several interesting programming abstractions were proposed including, treating neighborhood or area of nodes as the unit of programming instead of simple node, stream-based programming (map/join), excel-spreadsheet like high-level programming. These abstractions bring a different perspective which can be very helpful. There has been work on designing programming abstractions for cloud computing systems, especially for dealing with big data and big graphs. I hope we can see new useful abstractions emerge for programming large scale distributed cloud services. Since scalability is very important for cloud systems, we may see hierarchical abstractions, logical tree abstractions. We may also see abstractions that capture call graph of services or dataflow through services.

Auditability tools

With very large scale complex distributed systems, observability/auditability becomes very important. We recently presented our proposal on this topic in HotCloud'15. I hope to write a post about this work soon.

Abstract models to kickstart algorithms work

Finally, I hope we will see theoretical abstract modeling to simplify the cloud computing model (goals, challenges, environment) and kickstart more algorithms work on the area. As an analogy, Dijkstra's token ring formulation was really transformative, and started the self-stabilization field of distributed systems. A useful abstraction will hide irrelevant/accidental details and allow work to focus on the inherent most important parts of the problem, and allow other researchers to adopt the same terminology/model and start building on each others work and improve.

Thursday, August 13, 2015

How to go for 10X

I think the 10X term originated from this book. (Correct me if I am wrong. I didn't check this.)

It seems like Larry and Sergey are a fan of this concept (so should you!). Actually reading this January 2013 piece, you can sense that the Alphabet transition was in the works by then.

10X doesn't just mean go fast, get quick results, and get 10X more done in the same time. If you think about it, that is actually a pretty incremental mode of operation. And that is how you incur technical debt. That means it was just a matter of time for others to do the same thing, and probably much better and more complete. Trading off quality for time is often not a good deal (at least in the academic research domain).

10X means transformative rather than incremental improvement. Peter Thiel explains this well in his book Zero to One, 2014. The main theme in the book is: Don't do incremental business, invent a new transformational product/approach. Technology is 0-1, globalization is 1-n. Most people think the future of the world will be defined by globalization, but the book argues that technology matters more. The book says: Globalization (copying  and incrementalism as China has been doing) doesn't scale, it is unsustainable. Another way to put that argument is technology creates more value than globalization.

Below I propose some strategies for achieving 10X and also approach 10X from the perspective of how it applies for the academic research.

Aim big: Don't go for the incremental, pursue the transformative

10X is a mentality, frame of mind. The idea is if you go for a moonshot, and fail, you land among the stars. If you go for incremental improvements, you may be obselete by the time you get there because the world also moved on. Silicon Valley motto, "fail big, fail fast!" embodies this thinking.

For the research part, Dijkstra captured this thinking well in his advice to a promising researcher, who asked how to select a topic for research: "Do only what only you can do!" Anybody can pick the low hanging fruit.

Use the Pareto principle effectively and you are halfway there

The Pareto principle (also known as the 80–20 rule, the law of the vital few, and the principle of factor sparsity) states that, for many events, roughly 80% of the effects come from 20% of the causes.

On a related point, if you have to eat two frogs, eat the big frog first. Have the courage to confront the big and ugly head-first. That's where the biggest results/benefits/outcomes will come. There is an entire book on eating the big frog. And this is how the term eating frog originated if you are curious.

From the academic research perspective, the lesson is: attack the inherent complexity of the problem, not the incidental complexities, which time and improvement in technology will take care of.

Adopt/Invent better tools 

"Give me six hours to chop down a tree and I will spend the first four sharpening the axe." -- Abraham Lincoln

Here is a more modern perspective from an XKCD cartoon.

I mean, not just better but a transformative tools of course --remember the first point. Most often, you may need to invent that transformative tool yourself. When you are faced with an inherent worthy problem, don't just go for a point solution, generalize your solution, and ultimately make it in to a tool to benefit for the future. Generalizing and constructing the tool/system pays that technical debt and gets you to have truly 10X benefit. MapReduce as a tool comes to my mind as a good example for this. By generalizing on the kind of big data processing tasks/programs written at Google, Jeff Dean was able to see an underlying pattern, write a good tool to solve the problem once and for all.

Scientists spend decades to invent transformative tools (Hadron Collider, Hubble telescope) and then they get breakthrough results. As researchers in computer science, we should try to adopt/cultivate this mentality more.

Be agile and use rapid prototyping

Here is a brief informative video about the rapid prototyping idea. 

The point of prototypes is to fail fast, learn, and move on to the next attack. If you have a plan of attack for a worthy problem, sketch it, model it, pursue it to see if it holds water. As the first/easiest step, write down your idea to explain it.
"If you think without writing, you only think you're thinking." -- Leslie Lamport

"Writing is nature's way of letting you know how sloppy your thinking is." -- Guindon.

The next step for prototyping is mathematical modeling (or writing a specification-level program).

"Math is nature's way of letting you know how sloppy your writing is." -- Leslie Lamport


This advice is along the same vein as using better/transformative tools. Collaborate with the best minds on the topic that you can access to. Academics are pretty open to collaboration, especially when compared to the industry where there are many challenges to collaboration. If you have an interesting question, and if you demonstrate that you did your homework, you can recruit experts on the topic as collaborators.

Employ meta thinking

Focus on results, but also on processes. If you can't solve a hard problem, spin the problem attack that version instead.  Or, maybe go for proving an impossibility result.  Wouldn't that be transformative? (Here is a very recent example.)

Practice deliberately, and see what works

We tend to get more conservative as we get older. So deliberately practice being open-minded. Experiment! We are researchers after all. Collect good practices, and form useful habits. This is a process. Good luck.

Disclaimer: I don't claim to be a 10X engineer or researcher.

Tuesday, August 11, 2015

A distributed systems research agenda for cloud computing

Distributed systems (a.k.a. distributed algorithms) is an old field of almost 40 years old. It gave us impossibility proofs on the theory side, and also algorithms like Paxos, logical/vector clocks, 2/3-phase commit, leader election, dining philosophers, graph coloring, spanning tree construction which are adopted in practice widely. Cloud computing is a relatively new field in contrast. It provides new opportunities as well as new challenges for the distributed systems/algorithms area. Below I briefly discuss some of these opportunities and challenges.


Cloud computing provides abundance. Nodes are replaceable, even hot swappable. You can dedicate several nodes for running customized support services, such as monitoring, logging, storage/recovery service. These opportunities are likely to have impact on how fault-tolerance is considered in distributed systems/algorithms work.

Programmatic interfaces and service-oriented architecture are also hallmarks of cloud computing domain. Similarly, virtualization/containerization facilitates many things, including moving the computation to the data. However, it is unclear how these technologies can be employed to provide substantial benefits for distributed algorithms which operate at a more abstract plane.


Coordination introduces synchronization, which introduces potential for cascading shutdowns and halts.  Especially, in a large-scale system of systems, any coordination introduced may lead to spooking/latent/self-organized synchronization. This point is explained nicely in "Towards A Cloud Computing Research Agenda". Thus the challenge is to avoid coordination as much as possible, and still build useful and "consistent" systems.

Another curse of the extreme scale in cloud computing is the large fan-in/fan-out of components. These challenges are explained nicely in the "Tail at Scale" paper. These can lead to broadcast storms and incast storms, and algorithms/systems should be designed to avoid these problems.

Finally, another major challenge is geographically distributed services. Due to speed of light communication over large distances are prone to latencies and especially pose consistency versus availability challenges due to the CAP theorem.

The current state of distributed systems research on cloud computing

This is not an exhaustive list. From what I can recollect now, here is a crude categorization of current research in distributed systems on the cloud computing domain.

What is next?

If only I knew... I can only speculate, I have some ideas, and that will have to wait for another post.

In the meanwhile, I can point to these two articles which talked about what would be good/worthy research areas in cloud computing.
How can academics do research on cloud computing?
Academic cloud: pitfalls, opportunities

UPDATE: Part 2 of this post (i.e., new directions) is available here. 

Wednesday, June 3, 2015

Book Review: "Rework", 2010, Jason Fried & David Heinemeier Hansson

This is a book from the founders of 37signals. The book is full of simple commonsense advice for business and work. Yet these advice are also very fresh. How come?? Since we have been trying/inventing/pushing complex complicated rules & approaches for business and work in lieu of simple straightforward approaches, the simple commonsense advice in the book do come across as surprisingly fresh.

Just to give one example, for culture, the book says: "You can't build it, it occurs. Whatever you value and do (actions speak louder than words), it will become the culture of your company/workers overtime." Commonsense? Yes. But only after you take a step back and think, you say "Huh. That is obvious."

Another after-the-matter obvious advice: Try to underdo your competition, just do fewer things but better. You should stand for something. A strong stand is how you attract superfans. Covering everything means you don't have focus. Provide the main functionality as simple as possible, avoid the bells and whistles, and be unashamedly proud about this.

Maybe the insight behind this advice is you should necessarily have a transformational idea. And when you have that, the auxiliary stuff do not matter that much, and may even distract from your transformational idea.

This next advice is very interesting and stands out in controversial way. The book says: Don't accept outside money for your startup. You will find enough money yourself, if you implement the less is more principle. If you can't fund it yourself, and get it prototype yourself, it is not worth doing it. You are not thinking less is more. When you get outside money, you are bound to make the sponsors/vendors happy and lose your flexibility and freedom for the following reasons: you give up control; cashing out begins to trump building a quality business; spending other people's money is addictive; it's usually a bad deal; you start building what investors what instead of what customers want; raising money is incredibly addictive.

The book got my respect when it said writing is very important. The book says: Hire the guy who is a better writer. Clear writing is a sign of clear thinking. Great writers know how to communicate, empathize, and know what to omit and what to stress.

For the writing style, the books says: When you are writing, write with your own voice. This is also a matter of honesty. Honesty is the smart policy in business. So why come of like dishonest when you write artificially. The important thing is to communicate, and you should just focus on that.

The book has so many other commonsense, simple, yet refreshing and radical advice. The book is a very good very easy read. It is written in a conversational style, and you feel like you are having a discussion with Jason and David. And, it seems like the book got very good reviews from business and thought leaders.

Wednesday, May 6, 2015

How to run effective paper reading groups

Every year, I offer a distributed systems reading group seminar, where we discuss recent interesting research papers. (Here is the list of papers we covered this Spring.)

Reading group seminars are a lot of fun when everything clicks. But they can easily turn into soul-draining boring meetings when a couple of things go wrong. Here are some common bad meeting patterns: (1) the presenter goes on and on with a dry presentation, (2) without a common background, the participants bombard the presenter with a lot of questions just to get the context of the work and a lot of time is wasted just to get started on the paper, (3) the audience drifts away (some fall into their laptop screens, some start to fiddle with their phones), and (4) in the discussion phase an awkward silence sets in and crickets chirp.

I have been trying to tinker with the format of my reading group meetings to avoid those problems and improve the chances that everything clicks together. And over time I have been learning from trial and error what works and what doesn't. Based on what I learned, here is the most recent format I propose.

There are 3 components to a successful reading group meeting: the participants, the presenter, and the discussion process.

1. The participants should come prepared.

Here are the tricks I found to be most effective for ensuring that the participants come to the group meeting having read the paper and primed for fruitful discussion.

Cover only 1 paper per meeting. Trying to cover 2 papers in one meeting does not work. It is too much work to ask the participants to read 2 papers. Also 2 papers per meeting dilutes the focus and intensity.

Utilize the conference presentation video, if applicable. If the paper has a conference presentation video that really helps. Watching that 20 minute pitch about the paper makes it much easier to read the paper than attempting a cold reading of the paper. The OSDI and SOSP conferences make all the paper presentation videos openly available, which have been of great help for our seminar group.

Solicit questions from the participants before the meeting. In order to get the participants to start thinking critically about the paper, I ask them to write their questions in Piazza 1-2 days before the paper presentation. These questions also help the presenter to focus the presentation and get prepared for the discussion.

2. The presenter should come prepared.

Believe it or not, the presenter may come to the meeting with a superficial understanding of the paper. This makes things really awkward. In order to ensure that the presenter comes ready with a deep understanding of the paper, these are what we adopted.

Ask the presenter to submit a 2-3 page review of the paper. Preparing a powerpoint presentation only creates a superficial understanding, or even a false sense of security that the presenter understands (and can communicate) the paper. In order for the presenter to understand the paper better at a deeper conceptual level, I ask the presenter to write a 2-3 page review of the paper.

The presenter prints the copies of his review and brings this to the meeting for review before his presentation. All the participants review the hardcopy of the review-report in the first 15 minutes of the meeting. So in the first 15 minutes, it is all silence. The presenter is at the podium, getting adjusted, while we review his report. I got this idea from reading about how Bezos runs meetings.

Writing the report helps the presenter to focus on the message and really understand the paper. Reading the report critically prepare the participants to benefit more from the presentation. Their minds start thinking, posing questions about the content, and they listen to the presentation more actively as a result.

Restrict the presenter to a 10-15 slide presentation maximum. The presenter should perform no more than a 10-15 slide presentation. The slides should be mostly for visuals: figures, tables, graphs.

3. The discussion process should be primed.

The discussion phase needs some priming to achieve a fruitful discussion, a critical analysis of the paper, as well as brainstorming for creative extensions to the paper.

Spare enough time for question answering. We go with a separate dedicated question answering phase in addition to clarification questions that may come during the presentation. After the presentation, the presenter starts answering questions with first those that were submitted 1-2 day in advance on Piazza. The audience members who submitted the questions read the questions loudly, the presenter answers. And then we have more questions from the floor and more comments from the floor.

Elicit hands-on participation by writing mock paper reviews. In order to keep participants engaged deeply and elicit critical thinking, I recently got the idea to perform mock paper reviews in the group meetings. I plan to implement this idea for our upcoming reading group meetings.

After the question-answer phase, I will ask participants to form groups of 3, and mock-review the paper. I will ask the groups to commit to either a very-bad/critical review of the paper or a very-good/championing review of the paper. The reason for pushing to these extremities to force ourselves to come up with strong arguments, and not give wishy-washy reviews. The groups will write the review collaboratively using Google docs. In a group, one participant may research about related work, and write that part of the review, while another may write about motivation/application aspect of the paper, and the other about the technical/methods aspects.

Another group can form to write about related research questions to the current paper, in order to come up with interesting (and secondarily actionable) directions for future work. Again the group needs to be aggressive in its effort and brainstorm to come up with "novel", albeit speculative/little-far-fetched, research ideas.

I am a big proponent of making/producing something, albeit little, as much as possible. (That's why I have this blog.) With this setup we will not only criticize/analyze a work, but will also get to quickly synthesize and produce some by products, in terms of reviews and blog posts. Sort of like a mini-hackathon, shall we say a hack-a-paper session.

Would it be possible to use this format in teaching courses? I guess, for graduate level courses with less than 20 students, this format could also be used. This format is similar to flipped classroom, but with more critical thinking and writing/producing focus. With more than 20 students, I don't think the discussion and question-answering will scale. Also it may be hard to fit all this activities in a 50 minute class; an 80 minute class would work though.

Related links
How I read a research paper
How to present your work
Tell me about your thought process, not just your results
How to write your research paper
How I write
Good Prose and How to write a lot
Advice to graduate students

Wednesday, April 29, 2015

Arrakis: The OS is the control plane

This paper (authored by Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, and Thomas Anderson, University of Washington; Timothy Roscoe, ETH Z├╝rich) was awarded a best paper award in OSDI 2014. 

The paper "described and evaluated Arrakis, a new operating system designed to remove the kernel from the I/O data path without compromising process isolation. Unlike a traditional operating system, which mediates all I/O operations to enforce process isolation and resource limits, Arrakis uses device hardware to deliver I/O directly to a customized user-level library. The Arrakis kernel operates in the control plane, configuring the hardware to limit application misbehavior."

The Arrakis paper avoids mentioning containers, but what they propose has a lot of applicability to the containers technology. Containers aim to provide isolation/portability of VM without incurring the overhead of VMs. So containers run an application set on the OS and raw metal with better performance instead of running it on a VM layer. Arrakis is providing OS level technology to improve efficiency for the same goal.

The Arrakis approach is also closely related to the ExoKernel and MicroKernel approach. Containers, ExoKernel, Xen Unikernel, and the Arrakis project form a spectrum from monolithic to microkernel OS. It seems like Tanenbaum will have the last laugh.

Hardware support

Arrakis exploits hardware support provided for Virtual-Machine-level virtualization, and pushes further and implements virtualization at the application (or potentially at the container) level. Arrakis is built on Barrelfish, which already supports standalone user-mode device drivers, akin to found in microkernels. The paper argues that with some modifications the idea can be brought to Linux as well.

This is what Arrakis requires from the hardware:
"Arrakis assumes the network devices provide support for virtualization by presenting themselves as multiple virtual network interface cards (VNICs) and that they can also multiplex/demultiplex packets based on complex filter expressions, directly to queues that can be managed entirely in user space without the need for kernel intervention. Similarly, each storage controller exposes multiple virtual storage interface controllers (VSICs) in our model. Each VSIC provides independent storage command queues (e.g., of SCSI or ATA format) that are multiplexed by the hardware. Associated with each such virtual interface card (VIC) are queues and rate limiters."

"Network cards that support SR-IOV have the key elements of this model: they allow the creation of multiple VNICs that each may have multiple send and receive queues, and support at least rudimentary transmit and receive filters."

"Storage controllers have some parts of the technology needed to provide the interface we describe. For example, RAID adapters have a translation layer that is able to provide virtual disks above physical extents, and SSDs use a flash translation layer for wear-leveling. SCSI host-bus adapters support SR-IOV technology for virtualization and can expose multiple VSICs, and the NVMe standard proposes multiple command queues for scalability."

Tuesday, April 28, 2015

Large-scale cluster management at Google with Borg

This paper is by Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes and it appeared recently in EuroSys 2015.

Google's Borg is a cluster manager that admits, schedules, starts, restarts, and monitors all applications that Google runs. Borg runs 100K of jobs across a number of clusters each with 10K of machines.

Borg cells (1000s of machines that belong to a single cluster and are managed as a unit) run a heterogenous workload with two main parts. The first is long-running services that should never go down, and handle quick requests: e.g., Gmail, Google Docs, web search, BigTable. The second is user submitted batch jobs. Each job consists of multiple tasks that all run the same program (binary).

Each task maps to a set of Linux processes running in a container on a machine. The vast majority of the Borg workload does not run inside virtual machines (VMs) in order to avoid the cost of virtualization. Containers are so hot right now.


Each cell's Borgmaster consists of two processes: the main Borgmaster process and a separate scheduler.

The Borgmaster process handles client RPCs to create, edit, view job, and also communicates with the Borglets to monitor/maintain their state. (The Borglet is a machine-local Borg agent that starts, stops, restarts tasks at a machine. The Borgmaster polls each Borglet every few seconds to retrieve the machine's current state and send it any outstanding requests.) The Borgmaster process is logically a single process but is actually Paxos replicated over 5 servers.

When a job is submitted, the Borgmaster records it in Paxos and adds the job's tasks to the pending queue. This is scanned asynchronously by the scheduler, which assigns tasks to machines if there are sufficient available resources that meet the job's constraints. The scheduling algorithm has two parts: feasibility checking, to find machines on which the task could run, and scoring, which picks one of the feasible machines. If the machine selected by the scoring phase doesn't have enough available resources to fit the new task, Borg preempts (kills) lower-priority tasks, from lowest to highest priority, until it does.


"Centralized is not necessarily less scalable than decentralized" is a pet pieve of mine. So, I went all ears when I read this section. The paper said: "We are not sure where the ultimate scalability limit to Borg's centralized architecture will come from; so far, every time we have approached a limit, we've managed to eliminate it."

One early technique they used for scalability of the Borgmaster is to decouple the Borgmaster into a master process and an asynchronous scheduler. A scheduler replica operates on a cached copy of the cell state from the Borgmaster in order to perform a scheduling pass to assign tasks. The master will accept and apply these assignments unless they are inappropriate (e.g., based on out of date state), just like in optimistic concurrency control (OCC). To improve response times, they added separate threads to talk to the Borglets and respond to read-only RPCs.

A single Borgmaster can manage many thousands of machines in a cell, and several cells have arrival rates above 10000 tasks per minute. A busy Borgmaster uses 10–14 CPU cores and up to 50 GiB RAM.

In order to achieve the scalability of the scheduler, Borg employs score caching, grouping & treating tasks in equivalence classes, and performing relaxed randomization (basically sampling on machines). These reduced the scheduling time of a cell's entire workload from scratch from 3 days to a few 100s of seconds. Normally, an online scheduling pass over the pending queue completes in less than half a second.

Related work

There is the Apache Mesos project, which originated from a UC Berkeley class project. Mesos formed the basis for Twitter's Aurora, a Borg-like scheduler for long running services, and Apple's Jarvis, which is used for running Siri services. Facebook has Tupperware, a Borg-like system for scheduling containers on a cluster.

AWS has ECS (EC2 Container Service) for managing jobs running on clusters. ECS has a state management system that runs Paxos to ensure a consistent and highly available view of the cluster state. (similar to the Borgmaster process). Instead of one scheduler, ECS employs distributed schedulers each interacting with the state management system. Each scheduler is responsible for a separate set of workers in order to avoid too many conflicts in scheduling decisions.

Microsoft has the Autopilot system for automating software provisioning, deployment, and system monitoring. Microsoft also uses the Apollo system   for scheduling which tops-off workers opportunistically with short-lived batch jobs to achieve high throughput, with the cost of causing (occasionally) multi-day queueing delays for lower-priority work.

Kubernetes is under active development by many of the same engineers who built Borg. Kubernetes builds/improves on Borg. In Borg, a major headache was caused due to using one IP address per machine. That meant Borg had to schedule ports as a resource coordinate with tasks to resolve port conflicts in the same machine. Thanks to the advent of Linux namespaces, VMs, IPv6, and software-defined networking, Kubernetes can take a more user-friendly approach that eliminates these complications: every pod and service gets its own IP address. Kubernetes is opensource.


Borg is all about scheduling computation but does not get into any data scheduling, transfer scheduling issues. Data (and data transfer) should also be treated as first class citizen in scheduling decisions, as with big data comes big costs and big delays. Wouldn't it be nice to have a data-scheduler/manager system collaborating with Borg help run a more efficient data center?

Thursday, April 23, 2015

Paper Summary: On the use of Clocks to Enforce Consistency in the Cloud

This paper is by Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, Luis Rodrigues, and appeared in IEEE Data Engineering Bulletin 2015.

The purpose of this paper is to revisit how the logical and physical clock concepts are applied in the context of developing distributed data store systems for the cloud and review the choice of clocks in relation to consistency/performance tradeoffs.

The use of clocks in weak consistency data stores 

Dynamo employs sloppy quorums and hinted hand-off and uses version vector (a special case of vector clocks) to track causal dependencies within the replication group of each key. A version vector contains one entry for each replica (thus the size of clocks grows linearly with the number of replicas). The purpose of this metadata is to detect conflicting updates and to be used in the conflict reconciliation function.
Here is a link to my Dynamo review.

COPS is a geo-replicated datastore and it assigns a scalar clock to each object. Clients maintain the last clock value of all objects read in the causal past. Updates piggyback their dependencies when being propagated to other data centers. When a data center receives an update propagated by another data center, it only makes it visible when its dependencies are satisfied. COPS provides a partial form of transactions called causally consistent read-only transactions which return versions of the read objects that belong to a causally consistent snapshot. A two-round protocol implements these transactions. In the worst case the list of dependencies can grow and slow down the system.
Here is a link to my COPS review.

The GentleRain protocol aims to reduce the metadata piggybacked on updates propagation and to eliminate dependency checking procedures. The idea is to only allow a data center to make a remote update visible once all partitions (within the data center) have seen all updates up to the remote update time stamp. Thus, a client that reads a version is automatically ensured to read causally consistent versions in subsequent reads without the need of explicitly checking dependencies or being forced to wait until a causally consistent version is ready. In other words, GentleRain shoehorns causality in to physical clocks by delaying updates.

ORBE uses vector clocks, organized as a matrix, to represent dependencies. The vector clock has an entry per partition and data center. Physical clocks are used for generating read snapshot times, and ORBE can complete read-only transactions in one round by relying on physical clocks.

The use of clocks in strong consistency data stores

Clock-SI assumes loosely synchronized clocks that only move forward, and provides Snapshot Isolation consistency, where read-only transactions read from a consistent (possibly multi-versioned) snapshot, and other transactions commit if no object written by them was also written concurrently. To ensure safety against clocks skews, Clock-SI introduces delays to read operations.  Here is a link to my Clock-SI review.

Google Spanner employs TrueTime (which employs GPS and atomic clocks), and provides a strong consistency property: external consistency, which is also known as strict serializability. To ensure safety against clock skews, Spanner also introduces delays to read operations, and also delays commits in update operations to provide strict serializability. Here is a link to my Spanner review.

Finally CockroachDB, an open-source clone of Spanner, employs Hybrid Logical Clocks (HLC) in order to serialize transactions and ensure Snapshot Isolation and Serializable Snapshot Isolation. Hybrid Logical Clocks (HLC) is my recent work in collaboration with Dr. Sandeep Kulkarni. HLC couples physical clock with a scalar logical clock in order to efficiently order causally related transactions whose uncertainty intervals overlap. Here is a link to our Hybrid Logical Clocks (HLC) work.

I am quoting from the "On the use of Clocks to Enforce Consistency in the Cloud" paper about HLC: "Unlike Spanner, CockroachDB does not assume the availability of specialized hardware to ensure narrow bounds on clock synchronization, but relies on conventional NTP-based clock synchronization that frequently imposes clock skews of several tens of milliseconds. HLC is hence particularly beneficial in this case, at it allows for ensuring external consistency across causally related transactions while sparing from the costs of commit waits."


The paper has an intriguing discussion section. It makes the observation that we do not fully understand the trade-offs between logical and physical clocks yet, and mentions that HLC is an interesting and promising approach to investigate these tradeoffs. It gives some comparisons of the above protocols to show that time (in terms of its precision and comprehensiveness) is a resource that can be a factor in the performance and consistency tradeoffs in distributed data stores. The paper also talks about the costs of totally-ordered versus concurrent operations in distributed datastores. I found that this discussion make similar points with my "distributed is not necessarily more scalable than centralized" post.

Use of clocks in distributed datastores for consistency/performance tradeoffs is certainly an interesting and fruitful research area nowadays.

So how does your favorite data store use clocks/version-stamps? How would changing to a different clock scheme affect performance versus consistency tradeoffs in that data store?

Earlier I had discussed about the use of clocks in Granola, and how upgrading to HLC can improve performance and throughput.

Wednesday, April 22, 2015

Paper summary: A Taxonomy of Partitioned Replicated Cloud-based Database Systems

This paper is by Divy Agrawal, Amr El Abbadi, and Kenneth Salem, and appeared in IEEE Data Engineering journal in 2015.

This paper proposes a taxonomy of large scale partitioned replicated transactional databases. Partitioned replicated means, the database is divided in partitions and the partitions are replicated across different sites as in Figure 1. The motivation for partitioning is scalability, and the motivation for replication is to enable high availability even when some of the replicas are down. For geo-replicated databases, sites are maintained at different datacenters/regions, although the paper surveys non-geo-replicated databases as well.

The taxonomy

The taxonomy is based on the relationship between transaction management and replica management. This paper considers transactions that provide one-copy serializability guarantee, where concurrent transactions behave as if they execute sequentially on a single database. For a partitioned database, it is necessary to coordinate the database partitions to enforce transactional (ACID) guarantees. In addition, in order to support replication, the database system must also synchronize the database replicas so that replication can be hidden from the application.

Figure 2 shows the proposed taxonomy. Replicated object systems is a single leaf. Replicated transactions systems further divide in to symmetric and asymmetric replicated systems.

Replicated object systems

Replicated object systems implement transaction management on top of replica management, which ensures that partitions are consistently replicated across sites. A prominent example of this category is Spanner, another is Granola (which is not geo-replicated).

If you like to refresh your knowledge of these systems, here is a link to my Spanner review, and here a link to my Granola review.

A tablet in Spanner is a collection of directories or fragments of directories. Tablets correspond to partitions of the database and they are replicated across sites. Spanner uses Paxos to synchronize the replicas of each partition across sites. Spanner uses a separate instance of Paxos, with a long-lived leader, for each partition.

To implement transactions (including multiple partition transactions) Spanner uses two-phase locking for concurrency control, and two-phase commit. The Paxos leader in each partition is responsible for participating in these protocols on behalf of that partition. It does so in much the same way that it would if its partition were not replicated, except that changes to the partition's state are replicated using Paxos instead of just being stored locally at the leader's server. The leaders serve as the link between the transaction protocols and Paxos by ensuring that both database updates and changes in the state of the transaction are replicated.

Replicated transaction systems

Replicated transaction systems implement replica management on top of transaction management. Transactions run over individual partition replicas, rather than one logical partition as was the case in replicated object systems.

Each site has a local transaction manager that ensures that its local transactions have ACID properties with respect to other local transactions at that site. Each local transaction is responsible for applying the effects of its parent global transaction to the replicas at its own local site.

In symmetric replicated transaction systems, all of the local transactions of a given parent transaction are the same and run concurrently at the different sites. In the UCSB's replicated commit protocol the global transaction will be committed if the local coordinators at a majority of the sites vote to commit it. In return a client that wishes to read data from a partition sends its read request to all replicas of that partition, waits for a majority of the replicas to respond, and chooses the latest version that it receives from that majority. (Similar to the idea in Attiya, Bar-Noy, and Dolev, 1995.)

In an asymmetric replicated system, one local master transaction runs first at a single site. If that local transaction is able to commit, then the remaining local transactions are run at the other sites. These remaining transactions are called update propagation transactions. Typically, the update propagation transactions perform only the updates of the master transaction, and do not perform any reads.

An example of an asymmetric replicated primary copy system is Microsoft's Cloud SQL Server, which is the database management system behind the SQL Azure cloud relational database service. Cloud SQL server is not designed for geo-replication, so all of database replicas ("sites") are located in the same datacenter. Transactions are limited to a single partition unless they are willing to run at the read committed SQL isolation level.

An example of an asymmetric replicated update anywhere system is Google Megastore, where database partitions (called entity groups) are replicated and geographically distributed. In Megastore, a client can initiate a single-partition transaction at any replica of that partition --typically, the client will use a nearby replica. For each partition, Megastore manages a transaction log, which is replicated to all sites using Paxos to ensure that transactions commit in the same order everywhere. Here is a link to my Megastore review.

I guess Yahoo PNUTS could be considered somewhere between primary copy and update anywhere system due to its per-record master scheme.

What's next?

This taxonomy is useful for thinking about the cloud-based transactional database systems in a more systematic way. So where does your favorite transactional distributed database system fit? Are there inherent limitations/strengths to one category over another? Is it possible to have efficient multi-partition transaction capability for asymmetric replicated transaction systems?