Posts

Book Review: "Rework", 2010, Jason Fried & David Heinemeier Hansson

This is a book from the founders of 37signals . The book is full of simple commonsense advice for business and work. Yet these advice are also very fresh. How come?? Since we have been trying/inventing/pushing complex complicated rules & approaches for business and work in lieu of simple straightforward approaches, the simple commonsense advice in the book do come across as surprisingly fresh. Just to give one example, for culture, the book says: "You can't build it, it occurs. Whatever you value and do (actions speak louder than words), it will become the culture of your company/workers overtime." Commonsense? Yes. But only after you take a step back and think, you say "Huh. That is obvious." Another after-the-matter obvious advice: Try to underdo your competition, just do fewer things but better. You should stand for something. A strong stand is how you attract superfans. Covering everything means you don't have focus. Provide the main functionalit...

How to run effective paper reading groups

Every year, I offer a distributed systems reading group seminar, where we discuss recent interesting research papers. ( Here is the list of papers we covered this Spring. ) Reading group seminars are a lot of fun when everything clicks. But they can easily turn into soul-draining boring meetings when a couple of things go wrong. Here are some common bad meeting patterns: (1) the presenter goes on and on with a dry presentation, (2) without a common background, the participants bombard the presenter with a lot of questions just to get the context of the work and a lot of time is wasted just to get started on the paper, (3) the audience drifts away (some fall into their laptop screens, some start to fiddle with their phones), and (4) in the discussion phase an awkward silence sets in and crickets chirp. I have been trying to tinker with the format of my reading group meetings to avoid those problems and improve the odds that everything clicks together. And over time I have been learn...

Arrakis: The OS is the control plane

Image
This paper (authored by Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, and Thomas Anderson, University of Washington; Timothy Roscoe, ETH Zürich) was awarded a best paper award in OSDI 2014.  The paper "described and evaluated Arrakis, a new operating system designed to remove the kernel from the I/O data path without compromising process isolation. Unlike a traditional operating system, which mediates all I/O operations to enforce process isolation and resource limits, Arrakis uses device hardware to deliver I/O directly to a customized user-level library. The Arrakis kernel operates in the control plane, configuring the hardware to limit application misbehavior." The Arrakis paper avoids mentioning containers, but what they propose has a lot of applicability to the containers technology. Containers aim to provide isolation/portability of VM without incurring the overhead of VMs. So containers run an application set on the OS and r...

Large-scale cluster management at Google with Borg

Image
This paper is by Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes and it appeared recently in EuroSys 2015. Google's Borg is a cluster manager that admits, schedules, starts, restarts, and monitors all applications that Google runs. Borg runs 100K of jobs across a number of clusters each with 10K of machines. Borg cells (1000s of machines that belong to a single cluster and are managed as a unit) run a heterogenous workload with two main parts. The first is long-running services that should never go down, and handle quick requests: e.g., Gmail, Google Docs, web search, BigTable. The second is user submitted batch jobs. Each job consists of multiple tasks that all run the same program (binary). Each task maps to a set of Linux processes running in a container on a machine. The vast majority of the Borg workload does not run inside virtual machines (VMs) in order to avoid the cost of virtualization. Containers are so hot right now...

Paper Summary: On the use of Clocks to Enforce Consistency in the Cloud

This paper is by Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, Luis Rodrigues, and appeared in IEEE Data Engineering Bulletin 2015. The purpose of this paper is to revisit how the logical and physical clock concepts are applied in the context of developing distributed data store systems for the cloud and review the choice of clocks in relation to consistency/performance tradeoffs. The use of clocks in weak consistency data stores  Dynamo employs sloppy quorums and hinted hand-off and uses version vector (a special case of vector clocks) to track causal dependencies within the replication group of each key. A version vector contains one entry for each replica (thus the size of clocks grows linearly with the number of replicas). The purpose of this metadata is to detect conflicting updates and to be used in the conflict reconciliation function. Here is a link to my Dynamo review. COPS is a geo-replicated datastore and it assigns a scalar clock to each object. Clien...

Paper summary: A Taxonomy of Partitioned Replicated Cloud-based Database Systems

Image
This paper is by Divy Agrawal, Amr El Abbadi, and Kenneth Salem, and appeared in IEEE Data Engineering journal in 2015. This paper proposes a taxonomy of large scale partitioned replicated transactional databases. Partitioned replicated means, the database is divided in partitions and the partitions are replicated across different sites as in Figure 1. The motivation for partitioning is scalability, and the motivation for replication is to enable high availability even when some of the replicas are down. For geo-replicated databases, sites are maintained at different datacenters/regions, although the paper surveys non-geo-replicated databases as well. The taxonomy The taxonomy is based on the relationship between transaction management and replica management. This paper considers transactions that provide one-copy serializability guarantee, where concurrent transactions behave as if they execute sequentially on a single database. For a partitioned database, it is necessary to co...

GraphX: Graph processing in a distributed dataflow framework

Image
This paper appeared in OSDI'14 , and is authored by Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks; Ankur Dave, Daniel Crankshaw, and Michael J. Franklin, University of California, Berkeley; Ion Stoica, University of California, Berkeley, and Databricks. This link includes video and slides which are useful to understand the paper. This paper comes from the AMP lab at UC Berkeley. (Nice name! AMP stands for Algorithms, Machines, and People.) This lab brought to us Spark and GraphLab. And this paper is a logical successor. This paper is about marrying Spark (dataflow systems) with GraphLab (graph-processing systems). Motivation Here is the motivation for this merger. In large-scale computation, we need both dataflow processing and graph processing systems. Graph-processing systems outperform dataflow processing systems by an order of magnitude for iterative computations on graphs (e.g., connected-comp...

All file systems are not created equal: On the complexity of crafting crash-consistent applications

Image
This paper appeared in OSDI'14 and is authored by Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau at University of Wisconsin–Madison. A previous OSDI'14 paper we discussed had said almost every failure is due to bad exception/error-handling. But this paper shows that even when you divine the correct error-handling/recovery code, it may still not work. The layering abstraction leaks , and the filesystem underneath may do funny things in a crash. The paper considers an important and timely problem, because many important applications, including databases such as SQLite and key-value stores such as LevelDB, are currently implemented on top of file systems instead of directly on raw disks. Such data-management applications must be crash consistent, but achieving this goal atop modern file systems is challenging because the exact guarantees provided by file systems are unclear ...

Facebook's Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services

Image
This paper appeared in OSDI'14, and is authored by Michael Chow, University of Michigan; David Meisner, Facebook, Inc.; Jason Flinn, University of Michigan; Daniel Peek, Facebook, Inc.; Thomas F. Wenisch, University of Michigan . The goal of this paper is very similar to that of Google Dapper (you can read my summary of Google Dapper here) . Both work try to figure out bottlenecks in performance in high fanout large-scale Internet services. Both work use similar methods, however this work (the mystery machine) tries to accomplish the task relying on less instrumentation than Google Dapper. The novelty of the mystery machine work is that it tries to infer the component call graph implicitly via mining the logs, where as Google Dapper instrumented each call in a meticulous manner and explicitly obtained the entire call graph. The motivation for this approach is that comprehensive instrumentation as in Google Dapper requires standardization....and I am quoting the rest from the pa...

Paper review: "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems"

Image
This paper appeared in OSDI'14 and is written by Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm, University of Toronto. While the title mentions "analysis of production failures", the paper is not concerned/interested in root cause analysis of failures, instead, the paper is preoccupied with problems in error handling that lead to failures. I enjoyed this paper a lot. The paper has interesting counterintuitive results. It opens to discussion the error handling issue, and particularly the Java exception handling. Digesting and interpreting the findings in the paper will require time. To contribute to this process, here is my take on the findings ---after a quick summary of the paper. The setup and terminology  The paper studied 198 randomly sampled user-reported failures of 5 opensource data-intensive distributed systems: Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, an...

My crazy proposal for achieving lightweight distributed consensus

Distributed consensus is a hard problem. See some of my previous posts for discussion about impossibility results on distributed consensus. Paxos taught Perspectives on the CAP theorem Attacking Generals problem The reason distributed consensus is hard is because the parties involved don't have access to the same knowledge (the same point of view) of the system state.   Halpern's work on knowledge and common knowledge in distributed systems provides a useful framework to explain this.   Here is a summary of common knowledge paper by Halpern , and here is a nice talk on that paper. Of course, we have distributed consensus solutions, Paxos, ZooKeeper/ZAB, that ensure safety always and provide progress whenever the system conditions step outside the impossibility results territory (with respect to synchrony, channel reliability/reachability, and number of up nodes). But these solutions come with a performance cost, as they need serialization of all update requests from a ...

Eidetic systems

Image
This paper appeared in OSDI'14. The authors are all from University of Michigan: David Devecsery, Michael Chow, Xianzheng Dou, Jason Flinn, and Peter M. Chen. This paper presents a transformative systems work, in that it introduces a practical eidetic system implementation on a Linux computer/workstation. This paper is a tour de force: It undertakes a huge implementation effort to implement a very useful and novel eidetic memory/system service. The authors should be commended for their audaciousness. An eidetic computer system can recall any past state that existed on that computer, including all versions of all files, the memory and register state of processes, interprocess communication, and network input. An eidetic computer system can explain the lineage of each byte of current and past state. (This is related to the concept of data provenance, which I mention briefly at the end of my review.) Motivation One use case for an eidetic system is to track where/how erroneou...

Extracting more concurrency from Distributed Transactions

Image
This paper appeared in OSDI'14. The authors are: Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, Jinyang Li. The paper, presentation slides and video are accessible here. The paper proposes a concurrency control protocol for distributed transactions, and evaluates its performance comparing with two-phase locking (2PL) and optimistic concurrency control (OCC). The protocol The protocol introduced, ROCOCO (ReOrdering COnflicts for COncurrency) is targeted for extracting more concurrency under heavily contended workload than 2PL and OCC can handle. In fact, ROCOCO's improvements over OCC and 2PL comes after the peak throughput point of even ROCOCO. One of the questions after the OSDI presentation was about this. "That region where the system is thrashing is not a region you want to be. Why would you not employ admission control to refuse extra workload that pushes the system past peak performance?" ROCOCO assumes that a distributed transaction consists of a set of ...

Popular posts from this blog

Hints for Distributed Systems Design

The Agentic Self: Parallels Between AI and Self-Improvement

Learning about distributed systems: where to start?

Foundational distributed systems papers

Building a Database on S3

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

TLA+ mental models

Advice to the young

Analyzing Metastable Failures in Distributed Systems

My Time at MIT