Posts

Showing posts from April, 2015

Arrakis: The OS is the control plane

Image
This paper (authored by Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, and Thomas Anderson, University of Washington; Timothy Roscoe, ETH Z├╝rich) was awarded a best paper award in OSDI 2014. 

The paper "described and evaluated Arrakis, a new operating system designed to remove the kernel from the I/O data path without compromising process isolation. Unlike a traditional operating system, which mediates all I/O operations to enforce process isolation and resource limits, Arrakis uses device hardware to deliver I/O directly to a customized user-level library. The Arrakis kernel operates in the control plane, configuring the hardware to limit application misbehavior."

The Arrakis paper avoids mentioning containers, but what they propose has a lot of applicability to the containers technology. Containers aim to provide isolation/portability of VM without incurring the overhead of VMs. So containers run an application set on the OS and raw …

Large-scale cluster management at Google with Borg

Image
This paper is by Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes and it appeared recently in EuroSys 2015.

Google's Borg is a cluster manager that admits, schedules, starts, restarts, and monitors all applications that Google runs. Borg runs 100K of jobs across a number of clusters each with 10K of machines.

Borg cells (1000s of machines that belong to a single cluster and are managed as a unit) run a heterogenous workload with two main parts. The first is long-running services that should never go down, and handle quick requests: e.g., Gmail, Google Docs, web search, BigTable. The second is user submitted batch jobs. Each job consists of multiple tasks that all run the same program (binary).

Each task maps to a set of Linux processes running in a container on a machine. The vast majority of the Borg workload does not run inside virtual machines (VMs) in order to avoid the cost of virtualization. Containers are so hot right now.

Borgm…

Paper Summary: On the use of Clocks to Enforce Consistency in the Cloud

This paper is by Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, Luis Rodrigues, and appeared in IEEE Data Engineering Bulletin 2015.

The purpose of this paper is to revisit how the logical and physical clock concepts are applied in the context of developing distributed data store systems for the cloud and review the choice of clocks in relation to consistency/performance tradeoffs.

The use of clocks in weak consistency data stores  Dynamo employs sloppy quorums and hinted hand-off and uses version vector (a special case of vector clocks) to track causal dependencies within the replication group of each key. A version vector contains one entry for each replica (thus the size of clocks grows linearly with the number of replicas). The purpose of this metadata is to detect conflicting updates and to be used in the conflict reconciliation function.
Here is a link to my Dynamo review.

COPS is a geo-replicated datastore and it assigns a scalar clock to each object. Clients maintain th…

Paper summary: A Taxonomy of Partitioned Replicated Cloud-based Database Systems

Image
This paper is by Divy Agrawal, Amr El Abbadi, and Kenneth Salem, and appeared in IEEE Data Engineering journal in 2015.

This paper proposes a taxonomy of large scale partitioned replicated transactional databases. Partitioned replicated means, the database is divided in partitions and the partitions are replicated across different sites as in Figure 1. The motivation for partitioning is scalability, and the motivation for replication is to enable high availability even when some of the replicas are down. For geo-replicated databases, sites are maintained at different datacenters/regions, although the paper surveys non-geo-replicated databases as well.
The taxonomy The taxonomy is based on the relationship between transaction management and replica management. This paper considers transactions that provide one-copy serializability guarantee, where concurrent transactions behave as if they execute sequentially on a single database. For a partitioned database, it is necessary to coordina…

GraphX: Graph processing in a distributed dataflow framework

Image
This paper appeared in OSDI'14, and is authored by Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks; Ankur Dave, Daniel Crankshaw, and Michael J. Franklin, University of California, Berkeley; Ion Stoica, University of California, Berkeley, and Databricks. This link includes video and slides which are useful to understand the paper.

This paper comes from the AMP lab at UC Berkeley. (Nice name! AMP stands for Algorithms, Machines, and People.) This lab brought to us Spark and GraphLab. And this paper is a logical successor. This paper is about marrying Spark (dataflow systems) with GraphLab (graph-processing systems).

Motivation Here is the motivation for this merger. In large-scale computation, we need both dataflow processing and graph processing systems. Graph-processing systems outperform dataflow processing systems by an order of magnitude for iterative computations on graphs (e.g., connected-component ana…

All file systems are not created equal: On the complexity of crafting crash-consistent applications

Image
This paper appeared in OSDI'14 and is authored by Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau at University of Wisconsin–Madison.

A previous OSDI'14 paper we discussed had said almost every failure is due to bad exception/error-handling. But this paper shows that even when you divine the correct error-handling/recovery code, it may still not work. The layering abstraction leaks, and the filesystem underneath may do funny things in a crash.

The paper considers an important and timely problem, because many important applications, including databases such as SQLite and key-value stores such as LevelDB, are currently implemented on top of file systems instead of directly on raw disks. Such data-management applications must be crash consistent, but achieving this goal atop modern file systems is challenging because the exact guarantees provided by file systems are unclear and un…

Popular posts from this blog

I have seen things

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Learning about distributed systems: where to start?

My Distributed Systems Seminar's reading list for Fall 2020

Fine-Grained Replicated State Machines for a Cluster Storage System

My Distributed Systems Seminar's reading list for Spring 2020

Cross-chain Deals and Adversarial Commerce

Book review. Tiny Habits (2020)

Zoom Distributed Systems Reading Group