Showing posts from September, 2017

Paper Summary. Proteus: agile ML elasticity through tiered reliability in dynamic resource markets

This paper proposes an elastic ML system, Proteus, that can add/remove transient workers on the fly for exploiting the transient availability of cheap but revocable resources in order to reduce costs and latency of computation. The paper appeared in Eurosys'17 and is authored by Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, and Phillip B. Gibbons. Proteus has two components: AgileML and BidBrain. AgileML extends the parameter-server ML architecture to run on a dynamic mix of stable and transient machines, taking advantage of opportunistic availability of cheap but preemptible AWS Spot instances. BidBrain is the resource allocation component that decides when to acquire and drop transient resources by monitoring current market prices and bidding on new resources when their addition would increase work-per-dollar. Before delving into AgileML and BidBrain, let's first review the AWS Spot model. See Spot run AWS provides always available compute instances,

Paper summary. Distributed Deep Neural Networks over the Cloud, the Edge, and End Devices

This paper is by Surat Teerapittayanon, Bradley McDanel, and H.T. Kung at Harvard University and appeared in ICDCS'17.  The paper is about partitioning the DNN for inference between the edge and the cloud. There has been other work on edge-partitioning of DNNs, most recently the Neurosurgeon paper. The goal there was to figure out the most energy-efficient/fastest-response-time partitioning of a DNN model for inference between the edge and cloud device.  This paper adds a very nice twist to the problem. It adds an exit/output layer at the edge, so that if there is high-confidence in classification output the DNN replies early with the result, without going all the way to the cloud and processing the entire DNN for getting a result. In other words, samples can be classified and exited locally at the edge when the system is confident and offloaded to the cloud when additional processing is required. This early exit at the edge is achieved by jointly training a single DNN with

Paper summary. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

This paper appeared in NSDI'17 and is authored by Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger,  Phillip B. Gibbons, and Onur Mutlu.  Motivation This paper proposes a framework to distribute an ML system across multiple datacenters, and train models at the same datacenter where the data is generated. This is useful because it avoids the need to move big data over wide-area networks (WANs), which can be slow (WAN bandwidth is about 15x less than LAN bandwidth), costly (AWS does not charge for inside datacenter communication but charges for WAN communication), and also prone to privacy or ownership concerns. Google's Federated Learning also considered similar motivations, and set out to reduce WAN communication. It worked as follows: 1) smartphones are sent the model by the master/datacenter parameter-server,  2) smartphones compute an updated model based on their local data over some number of iterations, 3) the updated models are sen

Paper summary. OpenCL Caffe: Accelerating and enabling a cross platform machine learning framework

This 2016 paper presents an OpenCL branch/port of the deep learning framework Caffe . More specifically, this branch replaces the CUDA-based backend of Caffe to an open standard OpenCL backend. The software was first located at , then  graduated to . Once we develop a DNN model, we ideally like to be able to deploy it for different applications across multiple platforms (servers, NVDIA GPUs, AMD GPUs, ARM GPUs, or even over smartphones and tablets) with minimum developing efforts. Unfortunately, most of the deep learning frameworks (including Caffe) are integrated with CUDA libraries for running on NVIDIA GPUs, and that limits portability across multiple platforms. OpenCL helps for portability of heterogenous computing across platforms  since it is supported by a variety of commercial chip manufacturers: Altera, AMD, Apple, ARM Holdings, Creative Technology, IBM, Imagination Technologies, Intel, Nvidia,

Retroscope: Retrospective Monitoring of Distributed Systems (part 2)

This post, part 2, focuses on monitoring distributed systems using Retroscope. This is a joint post with Aleksey Charapko . If you are unfamiliar with hybrid logical clocks and distributed snapshots, give part 1 a read first. Monitoring and debugging distributed applications is a grueling task. When you need to debug a distributed application, you will often be required to carefully examine the logs from different components of the application and try to figure out how these components interact with each other. Our monitoring solution, Retroscope, can help you with aligning/sorting these logs and searching/focusing on the interesting parts. In particular, Retroscope captures a progression of globally consistent distributed states of a system and allows you to examine these states and search for global predicates. Let’s say you are working on debugging ZooKeeper, a popular coordination service. Using Retroscope you can easily add instrumentation to the ZooKeeper nodes to log and

Web Data Management in RDF Age

This was the keynote on ICDCS'17 Day 2, by Tamer Ozsu. Below are my notes from his talk. The slides for his presentation are available here. Querying web data presents challenges due to lack of a schema, its volatility, and its sheer scale. There have been several recent approaches to querying web data, including XML, JSON, and fusion tables. This talk is about another approach to maintaining and querying web data: RDF and SPARQL. This last one is the recommended way by W3C (World Wide Web Consortium) and is a building block for semantic web and Linked Open Data (LOD) . Here is a diagram denoting the LOD datasets and links between them as of 2014. Resource Description Framework (RDF) In RDF, everything is a uniquely named resource (URI). The ID for Jack Nicholson's resource is JN29704. Resources have defined attributes: y:JN29704 hasName = "Jack Nicholson", y:JN29704 BornOnDate = "1937-04-22". The relationships with other resources can be defined via


The other day I had taken my little daughter to the library. There was a lady tutoring his son. The boy was slightly over-active and argumentative. He was wearing a wool Buffalo hat. It was warm in the library, so that seemed out of place. It was also strange that the boy, who is around 7-8 years old, was not in school at this time of the day. But I can't put 2 and 2 together. I assumed that the lady is home-schooling the boy. Since I was interested about how home-schooling works (what works and what not), I asked her about it. (Although I am a shy introvert, I am not shy from asking people questions when I am curious about something. That is a little oddity I have.) She said she was not homeschooling, but her son missed a lot of classes, so she was trying to help him catch up. At that moment, I notice the thick book she was reading: "CANCER"! My heart sank. I tried to seem upbeat though. I asked the boy his name, Nicholas, and wished him best of luck with his studies

Paper summary. Untangling Blockchain: A Data Processing View of Blockchain Systems

This is a recent paper submitted to arxiv (Aug 17, 2017).  The paper presents a hype-free, very accessible, yet technical  survey on blockchain. It takes a data-processing centric view of blockchain technology, treating  concurrency issues and efficiency of consensus as first class citizens in blockchain design. Approaching from that perspective, the paper  tries to forge a connection with blockchain technologies and distributed transaction processing systems. I think this perspective is promising for grounding the blockchain research better. When we connect a new area to an area with a large literature, we have opportunities to compare/contrast and perform efficient OODA loops . Thanks to this paper, I am finally getting excited about blockchains. I had a huge prejudice about blockchains based on the Proof-of-Work approach being used in BitCoin. Nick Szabo tried to defend this huge inefficiency, citing that it is necessary for attestation/decentralized/etc, but that is a false dicho

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

My Distributed Systems Seminar's reading list for Fall 2020

I have seen things

Learning about distributed systems: where to start?

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Facebook's software architecture

Paxos unpacked