Large-scale cluster management at Google with Borg

This paper from Google appeared on Eurosys'15. The paper presents Borg, the cluster management system Google used since 2005. The paper includes a section at the end about the good and bad lessons learned from using Borg, and how these led to the development of Kubernetes container-management system which empowers the Google Cloud Platform and App Engine.

Borg architecture



This is the Borg. Resistance is futile.

A median Borg cell is 10K machines. And all those machines in a cell are served by a logically centralized control: the Borgmaster.

Where is the bottleneck in the centralized Borg architecture? The paper says it is still unclear whether this architecture would hit a practical scalability limit. Anytime Borg was given a scalability target, they managed to achieve it by applying basic techniques: caching, loose-synchronization, and aggregation.

What helped the most for achieving scalability was decoupling the scheduler component from the Borgmaster. The scheduler is loosely-synchronized with the Borgmaster: it operates on a cached cached copy of the cell state and acts as a counsel/advisor to the Borgmaster. If the scheduler makes a decision that is not feasible (because it is based of an outdated state: machine failed, resource gone, etc.), the Borgmaster will not take that advice and ask the scheduler to reschedule the job this time hopefully with better up-to-date state.

To provide high-availability, the Borgmaster is Paxos-replicated over 5 machines. Replicas serve read-only RPC calls to reduce the workload on the Borgmaster leader. In addition to the Paxos log, there is also periodic checkpoints/snapshots to restore the Borgmaster's state to an arbitrary point in the past. A fauxmaster can also use this functionality in debugging of the Borgmaster and scheduling performance.

A Borglet is the local Borg agent on every machine in a cell. (In Mesos this corresponds to the Mesos slave, or in the new terminology the Mesos agent.) Borgmaster replica runs a stateless link shard to handle the communication with some subset of borglets. The link shard aggregates and compresses and reports only diffs to the state machines to reduce update load at the elected master.

Jobs and tasks



A job consists of many tasks (which are same binary programs). 50% of machines run 9+ tasks, and 90%ile machine has ~25 tasks and run ~4500 threads.

Google's Borg workload consists of 2 main categories. Production jobs are long running services serving short user requests and they require low-latency. Batch jobs on the other hand are less-sensitive to performance fluctuations. The workload has dynamic surges: batch jobs come and go, and productions jobs have a diurnal pattern. (A representative Borg workload trace is publicly available.) Borg needs to handle this dynamic demand while providing as high utilization of the cluster machines as possible.

It turns out tight-packing scheduling is not optimal for high-utilization, because it is too strict and fails to accommodate for bursty loads and misestimations from Borg clients. Instead a hybrid packing is used, which provides 5% better packing efficiency than the tight-packing/best-fit policy. Borg uses priorities for tasks. If a machine runs out of resources to accommodate its assigned tasks (e.g., due to burst in demands), lower priority tasks on that machine are killed and added to the scheduler's pending queue for re-placement.

Users operate on jobs by issuing remote procedure calls (RPCs) to Borg, most commonly from a command-line tool or from other Borg jobs. To help users manage their jobs, Borg provides declarative job specification language, and job monitoring/management tools. Borg uses the concept of allocation set for a job, which corresponds to the concept of pod in Kubernetes.

Task startup latency at a machine is about 25seconds, 20 sec of which is package installation time. To reduce the latency from package installation, Borg tries to schedule tasks where the packages are already available. In addition, Borg employs tree and torrent-like protocols to distributes packages to machines in parallel. Finally, Borg also tries to schedule tasks to reduce correlation of failures for a given job.

Almost every task contains a builtin HTTP server that publishes health and performance info. Borg monitors the health-check URL and restarts tasks that fail to respond.

Comments

Unknown said…
While Kubernetes is certainly a successor of Borg in terms of lessons learned in design, I don't see anything in the paper or elsewhere that Kubernetes has become Google's cluster manager at large. Do you have other evidence that they have replaced Borg with Kubernetes? This would be interesting, as Kubernetes currently has relatively small limits on size of clusters that would seem to prevent Google from using it as their main cluster manager.
Murat said…
Mike, thanks, good point. I don't have good information on this, so I edited the text so that I don't seem to claim Kubernetes replaced Borg totally.
Elvin said…
Anecdotally, I recall hearing that Google's internal ecosystem is too coupled to Borg, that migrating from it would take too much time/effort to happen any time soon.

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Hints for Distributed Systems Design

Foundational distributed systems papers

Learning about distributed systems: where to start?

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The demise of coding is greatly exaggerated

SIGMOD panel: Future of Database System Architectures

Dude, where's my Emacs?

There is plenty of room at the bottom