Tuesday, February 21, 2017

1 million pageviews

My blog has recently reached 1 million pageviews. This warrants for a short retrospection.

I started the posting regularly on September 2010. I wanted to get into the cloud computing domain, so I needed to accumulate background on cloud computing work. I decided that as I read papers on cloud computing, I will post a summary to this blog. I thought if I could explain what I learned from the papers in my own words, I would internalize those lessons better. And if others read those summaries and benefit, that is an extra plus.

"Writing is nature's way of telling you how sloppy your thinking is." In fact, I learned a lot writing those paper reviews. Writing the reviews gave me a deeper understanding of the work done, beyond what I could achieve by passively reading them. Putting them on web was also a nice choice, because I could refer my students to some of these summaries when needed. And it turned out that I referred to those summaries myself very frequently to jog my memory. Since I have encoded the writing with my understanding, reading through my summary would get me refreshed about the important lessons from that work. Since my summaries were made available on the web, all I needed to do was google search for muratbuffalo and paper name.

(Side remark about my research journey: My research area at the first couple years of my PhD was distributed algorithms and self-stabilization. Then starting on 2002, wireless sensor networks has become my research area. I applied stabilizing distributed algorithms for in-network querying and tracking in wireless sensor networks. Around 2009 I started transitioning to crowdsourced sensing and collaboration using smartphones. And starting from 2010, I transitioned to large-scale cloud computing systems. Distributed systems has been the common theme through out. Here is a link to my research statement as of the end of 2016.)

Over time I included posts about my conference trips, book reviews, rants, and research advice for students. Putting research advice (reading, writing, presenting) is also beneficial because I can refer my students to it. And occasionally I receive emails from remote corners of the world about how some of these posts helped them or inspired them, and that makes me very happy for an entire day.

Some of the big hits 

The bursty traffic all came from Hacker News. The regular traffic came from many sources: Google searches, blog posts, twitter links.

Google tells me I can earn up to $18.50 per month by placing ads on my blog using AdSense. No thanks, for now.

Here are the top 10 posts in my blog as of now. Looks like anything mentioning Facebook is a big hit. Deep learning is also very hot. Glad to see our hybrid logical clocks work also up there. And glad to see interest for TLA+.

Saturday, February 18, 2017

Bowling your way to the top

"Oh, this is very American!" I said, when I finally understood how Bowling scoring works.

Bowling scoring is nonlinear

In a bowling game, there are 10 rounds. There are 10 pins, and you get 2 shoots in each round to knock as many as you can.

Even if you are novice, if you are eager and put effort in it, at each round you can knock down 6 pins. So that gives you a score of 6*10=60.

If you knock down 7 pins at each round, you get a score of 70.
8 pins, you get a score of 80.
9 pins, you get a score of 90.

Here is where things start to go nonlinear and you get accelerated returns. If you knock down all the 10 pins in your two shoots, this is called a spare. Your score for that round is not just 10, but the point you get from the next round is also added to it. So if you had a spare in round k, and got 7 in the next round k+1, you get 10+7 for round k, and 7 for round k+1, and in total of 17+7=24 points from these two rounds. If we were scoring this linearly, you would only get 10+7=17.

If you knock down all the 10 pins in your first shoot in a round, this is called a strike. Your score for that round is not just 10, but the points you get from the next *two* rounds get added to it. If you had a strike in round k, and got 7 in round k+1 and k+2, you get 10+7+7=24 points for round k, 7 for k+1, and 7 for k+2, and a total of 38 points from these 3 rounds. If we were scoring this linearly, you would only get 10+7+7=24 from these 3 rounds.

In the first game I played, I was knocking about 7 pins each round, so I thought, I should be in pretty good shape. Wrong. I got 4th place. The first two guys were experienced, they managed to hit sequences of strikes and spares, and their score grew very fast. My third place friend had managed to hit a series of spares towards the end of the game, and has beaten my score before I could understand my score was being beaten. I thought I was comfortably ahead.

It is more important to hit occasional strikes and spares than hitting a constantly comfortable 7 average.

So let's get back to where we left, the transition from linear to nonlinear scoring.
All 8 pins at all rounds, you get a total score of 80.
All 9, you get a total score of 90.
All spares, you get a total score of 200, instead of 100.
All strikes, you get a total score of 300, instead of 100.

And that last one is called a perfect game. 
Here is a short video of a perfect game.

OK so what?

Why am I wasting my time and your time telling you about bowling scoring?

If you were born and raised in US, you might have yawned reading through the above text. You might be taking the scoring for granted. In fact, when I realized the scoring works in a "funny" way, I asked my American friends to explain. They didn't have much previous practice explaining the scoring. One of them said, after stalling for some time, "Hmm, I realize this is the first time I am explaining Bowling scoring to someone." And this guy has played in a Bowling league for a couple years :-)

After a couple more takes of explaining/questioning with a second friend, when I finally understood what is going on, I blurted: "Oh, this is very American!", which surprised my friend.

If you take this scoring for granted, all I will tell you is this: "Three points for a win" is a relatively recent adoption in soccer scoring. Before that, it was 0 points for loss, 1 points for draw, and 2 points for win. And the games were so boring.

Teams would go for a draw, because the prospect of gaining one extra point by putting effort into attacking was not worth risking your defensive stance which could make you lose the game and get no points. The transition to three points for a win started only after 1980 taking up to 2000 in some countries. And this led to a significant increase of average goals scored in the games.

This is not about bowling, isn't it?

Yes, you see, free markets are inherently nonlinear scoring markets. Nonlinear scoring applies especially for the current information technology markets, where "the best performers are able to capture a very large share of the rewards, and the remaining competitors are left with very little". In such a winner-take-all economy, you run the risk of being overlooked if your products are mediocre. You need to hit some strikes.

This is also true in academia. Yes, you need to show that you are publishing productively, and there is some pebble counting. But in order for those pebbles to count, you need some occasional gems in between. You need to hit some strikes.

It is more important to hit occasional strikes and spares than hitting a constantly comfortable 7 average.

You need to think big, aim big, and go for a strike, so you can achieve nonlinear returns occasionally. 

Other related links

1. Wait a minute? Didn't I tell you a couple days ago "worse is better"?
Yes, I did. But this is how I concluded that post: "Worse is better takes a simplistic/minimalist approach. Simple and minimal can be powerful, if it is not done too ugly. Is worse always better? No. As I said earlier systems design is all about tradeoffs. It is important to analyze and decide in advance what the priorities are."

In fact, a worse-is-better system hits a strike in a priority dimension, such as being minimalist and going viral. On the other hand, a do-the-right-thing system may get stuck with hitting constantly comfortable of 7 average in all dimensions.

2. This nonlinear return idea also reminds me of the high-intensity interval training (HIT) idea. Tim Ferris had a very interesting interview with Prof. Martin Gibala on this.  The idea in HIT is that you get accelerated returns for the short nonlinear effort you put into your training.

Thursday, February 16, 2017

Mesos: A platform for fine-grained resource sharing in the data center

This paper appeared in NSDI 11 and introduced the Mesos job management and scheduling platform which proved to be very influential in the big data processing ecosystem. Mesos has seen a large following because it is simple and minimalist. This reminds me of the "worse is better" approach to system design. This is an important point and I will ruminate about this after I explain you the Mesos platform.

The problem 

We need to make multiple frameworks coexist and share the computing resources in a cluster. Yes, we have submachine scheduling abstractions: first the virtual machines and then containers. But we still need a coordinator/arbiter to manage/schedule jobs submitted from these frameworks  to make sure that we don't underutilize or overload/overtax the resources in the cluster.

Offer-based scheduling

Earlier, I have talked about Borg which addressed this cluster management problem. While Borg (and later Kubernetes) takes a request-based scheduling approach, Mesos chooses to provide an offer-based scheduling approach.

In the request-based scheduling, the frameworks provide their scheduling needs to the scheduler/controller and the scheduler/controller decides where to place the tasks and launches them. This can arguably make the design of the controller overcomplicated. The scheduler/controller may need to understand too many details about multiple frameworks in order to perform their requests adequately. This may not scale well as the number of frameworks to support grows. (And we have an abundance of big data processing frameworks.)

In stark contrast, Mesos delegates the control over scheduling to the frameworks. The Mesos master (i.e., the controller) provides resource offers to the frameworks,  and the frameworks decide which resources to accept and which tasks to run on them.

In other words, Mesos takes a small-government, libertarian approach to cluster management :-)

Since Mesos is minimalist, it is simple and nicely decoupled from the various frameworks it serves. This made Mesos go viral and achieve high-adoption. But systems design is an exercise in choosing which tradeoffs you make. Let's study the drawbacks. (I am putting on my critical hat, I will talk about the benefits of Mesos again toward the end of this post.)

The long-running tasks, and the big tasks strain this offer-based scheduling model. Some frameworks may schedule tasks that can overstay their welcome, and take advantage of the too trusting and hands-off Mesos. This would be unfair to other client frameworks. (Of course the Mesos master may take into account "organizational policies such as fair sharing" when extending offers to the frameworks, and can even kill long running tasks.)

Moreover, since Mesos is hands-off, it does not provide fault-tolerance support for long-running tasks, which are more likely to experience failure in their lifetimes as they run longer. Mesos punts the ball to the client frameworks which will need to carry the burden. And doing this for each client framework may lead to redundant/wasted effort. Fortunately other helper systems like Marathon emerged to address this issue and provide support for long running tasks.

Even assuming that the client frameworks are not-greedy and on their best cooperating behavior, they may not have enough information about other tasks/clients of Mesos to make optimal scheduling decisions. The paper mentions that: "While this decentralized scheduling model may not always lead to globally optimal scheduling, we have found that it performs surprisingly well in practice, allowing frameworks to meet goals such as data locality nearly perfectly."

Related to this problem, another very simple idea makes a cameo appearance in the paper: "We used delay scheduling to achieve data locality by waiting for slots on the nodes that contain task input data. In addition, our approach allowed us to reuse Hadoop's existing logic for re-scheduling of failed tasks and for speculative execution (straggler mitigation)."

Is that too simple a technique? Well, it is hard to argue with results. The delay scheduling paper has received 1000+ citations since 2010. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys 10, 2010. 

Mesos architecture

Mesos means middle or intermediate, from Greek misos. Nice name.

Mesos master is ZooKeeper guarded, so a hot standby can get in and take over if the Mesos master fails. The Mesos master manages the resources by talking to Mesos slaves/workers on the machines in the cluster. This is similar to how BorgMaster manages resources talking to Borglets on the machines.

So where is the scheduler in this architecture? This responsibility is punted to the client frameworks. As we mentioned above, the Mesos master provides offers to the client frameworks, and it is upto the client framework to accept an offer.

Here is how things work from the client framework's perspective. Each framework intending to use Mesos needs to implement two components: a scheduler that registers with the Mesos master to be offered resources, and an executor that is launched on Mesos worker nodes to run the framework’s tasks.

This table shows the callbacks and actions to implement to write the scheduler and the executor components. (To Python users, there is pyMesos to help you write the scheduler and executor components in Python.)

In the resourceOffer callback, the scheduler should implement the functionality to select which of the offered resources to reject and which to use along with how to pass Mesos a description of the tasks it wants to launch on them. What if that offer became unavailable in the meanwhile? The Mesos master will then warn the Scheduler via the offerRescinded callback that the offer has been rescinded, and it is the client framework's responsibility to handle this and reschedule the job using the next offers from the Mesos master.

The implementation of the scheduler gets more and more involved if the framework would like to keep track of tasks for a submitted job and provide the users of the framework this information. The scheduler gets callbacks on statusUpdate of the tasks, but it needs to piece together and track which job these tasks correspond to. For example, the scheduler gets a callback when a task is finished, and then it is the responsibility of the scheduler to check and mark a job as completed when all its tasks are finished.

This scheduler/executor abstraction can also get leaky. The paper mentions this about the Hadoop port, which came to a total of 1500 lines of code: "We also needed to change how map output data is served to reduce tasks. Hadoop normally writes map output files to the local filesystem, then serves these to reduce tasks using an HTTP server included in the TaskTracker. However, the TaskTracker within Mesos runs as an executor, which may be terminated if it is not running tasks. This would make map output files unavailable to reduce tasks. We solved this problem by providing a shared file server on each node in the cluster to serve local files. Such a service is useful beyond Hadoop, to other frameworks that write data locally on each node."

If you are thin like Mesos, you can (should?) add on weight later when it is warranted.

A side remark: What is it with the "fine-grained" in the title?

The title is a humble and conservative title: "Mesos: A platform for fine-grained resource sharing in the data center". The paper seems insecure about this issue, and keeps referring back to this to emphasize that Mesos works best with fine-grained short tasks. This gets peculiar for a careful reader.

Well, I think I know the reason for this peculiarity. Probably the authors may have been burned before about this from an over-critical reviewer (it is always Reviewer 2!), and so they are trying to preemptively dismantle the same criticism to be aired again. This is a very solid and important paper, but I wouldn't be surprised even a paper of this caliber may have been rejected earlier and this version may be their second (or even third) submission. Reviewer 2 might have told the authors  not too subtly that the paper is claiming too much credit (which is always a big pet peeve of reviewer 2), and the current version of the paper is written defensively to guard against this criticism.

Oh, the joys of academia. I wouldn't be surprised if Reviewer 2 also found the paper low on novelty and suitable more for industrial research and not for academic research.


Yes, Mesos is too eager to punt the ball to the clients. But this is not necessarily a bug, it can be a feature. Mesos is thin, and  gives your frameworks control over how to schedule things. Mesos doesn't step in your way and provides your frameworks  low-level control over scheduling and management decisions.

Mesos reminds me of the "worse-is-better" approach to system design. (Ok, read that link, it is important. I will wait.) Since Mesos is minimalist and simple it is a viral platform. (Much like MapReduce and Hadoop were.)

Borg/Kubernetes aims to do "the-right-thing". They provide a blackbox cluster manager that provides a lot of features, optimal scheduling, fault-tolerance, etc. This is great if you fit into the workloads that they cater to, which covers most of the web-services workloads. But this approach may actually get in your way if you like to have low-layer control on scheduling/management decisions.

I read the "worse is better" when I was a fresh graduate student in 1999 working on the theory side of distributed algorithms and self-stabilization. I was a Dijkstra fan, and this article was a real eye opener for me. It made me to question my faith :-)

Worse is better takes a simplistic/minimalist approach. Simple and minimal can be powerful, if it is not done too ugly. Is worse always better? No. As I said earlier systems design is all about tradeoffs. It is important to analyze and decide in advance what the priorities are.

I feel like I will pick up on this thread at another time.

Saturday, February 11, 2017

Large-scale cluster management at Google with Borg

This paper from Google appeared on Eurosys'15. The paper presents Borg, the cluster management system Google used since 2005. The paper includes a section at the end about the good and bad lessons learned from using Borg, and how these led to the development of Kubernetes container-management system which empowers the Google Cloud Platform and App Engine.

Borg architecture

This is the Borg. Resistance is futile.

A median Borg cell is 10K machines. And all those machines in a cell are served by a logically centralized control: the Borgmaster.

Where is the bottleneck in the centralized Borg architecture? The paper says it is still unclear whether this architecture would hit a practical scalability limit. Anytime Borg was given a scalability target, they managed to achieve it by applying basic techniques: caching, loose-synchronization, and aggregation.

What helped the most for achieving scalability was decoupling the scheduler component from the Borgmaster. The scheduler is loosely-synchronized with the Borgmaster: it operates on a cached cached copy of the cell state and acts as a counsel/advisor to the Borgmaster. If the scheduler makes a decision that is not feasible (because it is based of an outdated state: machine failed, resource gone, etc.), the Borgmaster will not take that advice and ask the scheduler to reschedule the job this time hopefully with better up-to-date state.

To provide high-availability, the Borgmaster is Paxos-replicated over 5 machines. Replicas serve read-only RPC calls to reduce the workload on the Borgmaster leader. In addition to the Paxos log, there is also periodic checkpoints/snapshots to restore the Borgmaster's state to an arbitrary point in the past. A fauxmaster can also use this functionality in debugging of the Borgmaster and scheduling performance.

A Borglet is the local Borg agent on every machine in a cell. (In Mesos this corresponds to the Mesos slave, or in the new terminology the Mesos agent.) Borgmaster replica runs a stateless link shard to handle the communication with some subset of borglets. The link shard aggregates and compresses and reports only diffs to the state machines to reduce update load at the elected master.

Jobs and tasks

A job consists of many tasks (which are same binary programs). 50% of machines run 9+ tasks, and 90%ile machine has ~25 tasks and run ~4500 threads.

Google's Borg workload consists of 2 main categories. Production jobs are long running services serving short user requests and they require low-latency. Batch jobs on the other hand are less-sensitive to performance fluctuations. The workload has dynamic surges: batch jobs come and go, and productions jobs have a diurnal pattern. (A representative Borg workload trace is publicly available.) Borg needs to handle this dynamic demand while providing as high utilization of the cluster machines as possible.

It turns out tight-packing scheduling is not optimal for high-utilization, because it is too strict and fails to accommodate for bursty loads and misestimations from Borg clients. Instead a hybrid packing is used, which provides 5% better packing efficiency than the tight-packing/best-fit policy. Borg uses priorities for tasks. If a machine runs out of resources to accommodate its assigned tasks (e.g., due to burst in demands), lower priority tasks on that machine are killed and added to the scheduler's pending queue for re-placement.

Users operate on jobs by issuing remote procedure calls (RPCs) to Borg, most commonly from a command-line tool or from other Borg jobs. To help users manage their jobs, Borg provides declarative job specification language, and job monitoring/management tools. Borg uses the concept of allocation set for a job, which corresponds to the concept of pod in Kubernetes.

Task startup latency at a machine is about 25seconds, 20 sec of which is package installation time. To reduce the latency from package installation, Borg tries to schedule tasks where the packages are already available. In addition, Borg employs tree and torrent-like protocols to distributes packages to machines in parallel. Finally, Borg also tries to schedule tasks to reduce correlation of failures for a given job.

Almost every task contains a builtin HTTP server that publishes health and performance info. Borg monitors the health-check URL and restarts tasks that fail to respond.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...