Thursday, February 16, 2017

Mesos: A platform for fine-grained resource sharing in the data center

This paper appeared in NSDI 11 and introduced the Mesos job management and scheduling platform which proved to be very influential in the big data processing ecosystem. Mesos has seen a large following because it is simple and minimalist. This reminds me of the "worse is better" approach to system design. This is an important point and I will ruminate about this after I explain you the Mesos platform.

The problem 

We need to make multiple frameworks coexist and share the computing resources in a cluster. Yes, we have submachine scheduling abstractions: first the virtual machines and then containers. But we still need a coordinator/arbiter to manage/schedule jobs submitted from these frameworks  to make sure that we don't underutilize or overload/overtax the resources in the cluster.

Offer-based scheduling

Earlier, I have talked about Borg which addressed this cluster management problem. While Borg (and later Kubernetes) takes a request-based scheduling approach, Mesos chooses to provide an offer-based scheduling approach.

In the request-based scheduling, the frameworks provide their scheduling needs to the scheduler/controller and the scheduler/controller decides where to place the tasks and launches them. This can arguably make the design of the controller overcomplicated. The scheduler/controller may need to understand too many details about multiple frameworks in order to perform their requests adequately. This may not scale well as the number of frameworks to support grows. (And we have an abundance of big data processing frameworks.)

In stark contrast, Mesos delegates the control over scheduling to the frameworks. The Mesos master (i.e., the controller) provides resource offers to the frameworks,  and the frameworks decide which resources to accept and which tasks to run on them.

In other words, Mesos takes a small-government, libertarian approach to cluster management :-)

Since Mesos is minimalist, it is simple and nicely decoupled from the various frameworks it serves. This made Mesos go viral and achieve high-adoption. But systems design is an exercise in choosing which tradeoffs you make. Let's study the drawbacks. (I am putting on my critical hat, I will talk about the benefits of Mesos again toward the end of this post.)

The long-running tasks, and the big tasks strain this offer-based scheduling model. Some frameworks may schedule tasks that can overstay their welcome, and take advantage of the too trusting and hands-off Mesos. This would be unfair to other client frameworks. (Of course the Mesos master may take into account "organizational policies such as fair sharing" when extending offers to the frameworks, and can even kill long running tasks.)

Moreover, since Mesos is hands-off, it does not provide fault-tolerance support for long-running tasks, which are more likely to experience failure in their lifetimes as they run longer. Mesos punts the ball to the client frameworks which will need to carry the burden. And doing this for each client framework may lead to redundant/wasted effort. Fortunately other helper systems like Marathon emerged to address this issue and provide support for long running tasks.

Even assuming that the client frameworks are not-greedy and on their best cooperating behavior, they may not have enough information about other tasks/clients of Mesos to make optimal scheduling decisions. The paper mentions that: "While this decentralized scheduling model may not always lead to globally optimal scheduling, we have found that it performs surprisingly well in practice, allowing frameworks to meet goals such as data locality nearly perfectly."

Related to this problem, another very simple idea makes a cameo appearance in the paper: "We used delay scheduling to achieve data locality by waiting for slots on the nodes that contain task input data. In addition, our approach allowed us to reuse Hadoop's existing logic for re-scheduling of failed tasks and for speculative execution (straggler mitigation)."

Is that too simple a technique? Well, it is hard to argue with results. The delay scheduling paper has received 1000+ citations since 2010. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys 10, 2010. 

Mesos architecture

Mesos means middle or intermediate, from Greek misos. Nice name.

Mesos master is ZooKeeper guarded, so a hot standby can get in and take over if the Mesos master fails. The Mesos master manages the resources by talking to Mesos slaves/workers on the machines in the cluster. This is similar to how BorgMaster manages resources talking to Borglets on the machines.

So where is the scheduler in this architecture? This responsibility is punted to the client frameworks. As we mentioned above, the Mesos master provides offers to the client frameworks, and it is upto the client framework to accept an offer.

Here is how things work from the client framework's perspective. Each framework intending to use Mesos needs to implement two components: a scheduler that registers with the Mesos master to be offered resources, and an executor that is launched on Mesos worker nodes to run the framework’s tasks.

This table shows the callbacks and actions to implement to write the scheduler and the executor components. (To Python users, there is pyMesos to help you write the scheduler and executor components in Python.)

In the resourceOffer callback, the scheduler should implement the functionality to select which of the offered resources to reject and which to use along with how to pass Mesos a description of the tasks it wants to launch on them. What if that offer became unavailable in the meanwhile? The Mesos master will then warn the Scheduler via the offerRescinded callback that the offer has been rescinded, and it is the client framework's responsibility to handle this and reschedule the job using the next offers from the Mesos master.

The implementation of the scheduler gets more and more involved if the framework would like to keep track of tasks for a submitted job and provide the users of the framework this information. The scheduler gets callbacks on statusUpdate of the tasks, but it needs to piece together and track which job these tasks correspond to. For example, the scheduler gets a callback when a task is finished, and then it is the responsibility of the scheduler to check and mark a job as completed when all its tasks are finished.

This scheduler/executor abstraction can also get leaky. The paper mentions this about the Hadoop port, which came to a total of 1500 lines of code: "We also needed to change how map output data is served to reduce tasks. Hadoop normally writes map output files to the local filesystem, then serves these to reduce tasks using an HTTP server included in the TaskTracker. However, the TaskTracker within Mesos runs as an executor, which may be terminated if it is not running tasks. This would make map output files unavailable to reduce tasks. We solved this problem by providing a shared file server on each node in the cluster to serve local files. Such a service is useful beyond Hadoop, to other frameworks that write data locally on each node."

If you are thin like Mesos, you can (should?) add on weight later when it is warranted.

A side remark: What is it with the "fine-grained" in the title?

The title is a humble and conservative title: "Mesos: A platform for fine-grained resource sharing in the data center". The paper seems insecure about this issue, and keeps referring back to this to emphasize that Mesos works best with fine-grained short tasks. This gets peculiar for a careful reader.

Well, I think I know the reason for this peculiarity. Probably the authors may have been burned before about this from an over-critical reviewer (it is always Reviewer 2!), and so they are trying to preemptively dismantle the same criticism to be aired again. This is a very solid and important paper, but I wouldn't be surprised even a paper of this caliber may have been rejected earlier and this version may be their second (or even third) submission. Reviewer 2 might have told the authors  not too subtly that the paper is claiming too much credit (which is always a big pet peeve of reviewer 2), and the current version of the paper is written defensively to guard against this criticism.

Oh, the joys of academia. I wouldn't be surprised if Reviewer 2 also found the paper low on novelty and suitable more for industrial research and not for academic research.


Yes, Mesos is too eager to punt the ball to the clients. But this is not necessarily a bug, it can be a feature. Mesos is thin, and  gives your frameworks control over how to schedule things. Mesos doesn't step in your way and provides your frameworks  low-level control over scheduling and management decisions.

Mesos reminds me of the "worse-is-better" approach to system design. (Ok, read that link, it is important. I will wait.) Since Mesos is minimalist and simple it is a viral platform. (Much like MapReduce and Hadoop were.)

Borg/Kubernetes aims to do "the-right-thing". They provide a blackbox cluster manager that provides a lot of features, optimal scheduling, fault-tolerance, etc. This is great if you fit into the workloads that they cater to, which covers most of the web-services workloads. But this approach may actually get in your way if you like to have low-layer control on scheduling/management decisions.

I read the "worse is better" when I was a fresh graduate student in 1999 working on the theory side of distributed algorithms and self-stabilization. I was a Dijkstra fan, and this article was a real eye opener for me. It made me to question my faith :-)

Worse is better takes a simplistic/minimalist approach. Simple and minimal can be powerful, if it is not done too ugly. Is worse always better? No. As I said earlier systems design is all about tradeoffs. It is important to analyze and decide in advance what the priorities are.

I feel like I will pick up on this thread at another time.

1 comment:

Kartik said...

Indeed the work was rejected from OSDI first; it was called "Nexus" (instead of Mesos) at the time. Bryan Cantrill tells more of the story in his Usenix ATC '16 talk:

Web page for audio: