Showing posts from November, 2016

How does one get motivated for teaching?

A friend recently complained that, even after years in academia, he never got quite adjusted to teaching. He said he doesn't see much incentive in teaching and found it boring, and asked about how one can get motivated for teaching. This is actually a good question and it is important to get in to the habit of questioning and checking oneself. Here is how I answer this question. I don't need to get motivated for teaching. Teaching is a responsibility bestowed upon me as an academician. So I look at this from a responsibility perspective. I also have responsibilities as a parent, as a husband, or as a driver (if I am the designated driver), and for matters of responsibility, I don't need motivation. I know I should rise to the occasion. Teaching is my service to the students, who really need it. Today students have many alternatives. They can watch a lecture on YouTube, and can find a variety of study material on the Internet. But it is only I that stand in front of

My Distributed Systems Seminar's reading list for Spring 2017

Below is the first draft list of papers I plan to discuss in my distributed systems seminar in the Spring semester. If you have some suggestions on some good/recent papers to cover, please let me know in the comments. Datacenter Operating System Firmament: Fast, Centralized Cluster Scheduling at Scale (OSDI 16) Large-scale cluster management at Google with Borg (Eurosys 15) Apache Hadoop YARN: yet another resource negotiator (SOCC 13) Slicer: Auto-Sharding for Datacenter Applications (OSDI 16) Monitoring Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems (SOSP 15) Shasta: Interactive Reporting At Scale (SIGMOD 16) Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed In-memory Databases (SIGMOD 16) Consistency The many faces of consistency (2016) The SNOW Theorem and Latency-Optimal Read-Only Transactions (OSDI 16) Incremental Consistency Guarantees for Replicated Objects  (OSDI 16) Just Say NO to Paxos Overhead: Replacing Consensus wi

Modeling a Replicated Storage System in TLA+, Project 1

Why a TLA+ project? The first project assignment in my distributed systems class this semester was modeling a replicated storage system in TLA+. Assigning a TLA+ project makes me a rarity among distributed systems professors. A common project would be a MapReduce programming assignment or a project to implement a simple distributed service (such as a key-value store) in Java. I think that a MapReduce deployment project does not teach much about distributed systems, because MapReduce is a very crude abstraction and hides all things distributed under the hood. Using MapReduce for the distributed systems class project would be like handing people a mechanics certification upon successful completion of a driving test. Implementing a simple distributed service, on the other hand, would teach students that indeed programming and debugging distributed systems is very hard. However, I would suspect that much of the hardship in that project would be due to accidental complexities of the i

Book Review: Einstein (by Walter Isaacson)

I read this book recently and liked it a lot. It was a surprisingly engaging read. I thought I knew a lot about Einstein, and the book would be redundant and bland. But the book proved me wrong. I learned a lot of new things about Einstein, especially about his personality and his research perspective and style. The book also did a great job on explaining Einstein's theories, scientific achievements in an accessible and interesting manner. Einstein's personality Let's get this out of the way. Einstein didn't fail math. He was always a very smart and hardworking student. In primary school, he was at the top of his class and "far above the school requirements" in math. And before he was 15 he had mastered differential and integral calculus. What may have helped propagate the myth was Einstein's unruly and defiant personality. Einstein would do great at the things he likes, and not good at the things he doesn't like. He had excelled in his universi

How Complex Systems Fail

This is a 4 page report about the nature of failures in complex systems. It is a gloomy report. It says that complex systems are always ridden with faults, and will fail when some of these faults conspire and cluster. In other words, complex systems constantly dwell on the verge of failures/outages/accidents. The writing of the report is peculiar. It is written as a list of 18 items (ooh, everyone loves lists). But the items are not independent. For example, it is hard to understand items 1 and 2, until you read item 3. Items 1 and 2 are in fact laying the foundations for item 3. The report is written by an MD, and is primarily focused on healthcare related complex systems, but I think almost all of the points also apply for other complex systems, and in particular cloud computing systems. In two recent posts ( Post1 , Post2 ), I had covered papers that investigate failures in cloud computing systems, so I thought this report would be a nice complement to them. 1) Complex syste

Modeling Paxos and Flexible Paxos in Pluscal and TLA+

The first part of this post describes modeling Paxos in Pluscal. The second part shows how to modify that model to achieve a flexible quorum Paxos . The idea for flexible quorums was introduced in a paper in 2016 summer.  This simple and surprising result says "majority agreement is not required in Paxos and the sets of acceptors required to participate in agreement (known as   quorums) do not even need to intersect with each other" . Modeling Paxos in Pluscal While there are many examples of Paxos modeling in TLA+ available, I haven't found any Pluscal modeling of Paxos, except this one from Lamport , which helped me come up with my Pluscal model below. The problem with TLA+ is that it is too low-level (i.e., too declarative and math-like) for writing--and reading-- distributed algorithms. The PlusCal language provides a higher-level pseudocode, which is easier to follow. However, as you go through my Pluscal model below, you will find that it doesn't follow yo

[Paper review] TaxDC: A Taxonomy of nondeterministic concurrency bugs in datacenter distributed systems

This paper appeared in ASPLOS 2016 and the authors are Leesatapornwongsa, Lukman, Lu, and Gunawi. The paper provides a comprehensive study on real-world distributed concurrency bugs and is a good complement to the paper I reviewed before: "Why does cloud stop computing" . While the previous paper looked at all possible bugs that lead to service outages, this paper focuses only on distributed concurrency (DC) bugs. This type of bug happens to be my favorite kind of bug. I am fascinated with "failures", and with DC bugs doubly so. DC bugs are execution timing/ordering/scheduling related, they occur nondeterministically, and are extremely difficult to detect, diagnose, and fix in production systems. I love seeing those long traces of improbable DC bugs surface when I model check distributed algorithms in TLA+ . The experience keeps me humble. The paper presents analysis of 104 DC bugs from Cassandra, HBase, Hadoop MapReduce, and ZooKeeper. These bugs are categorize

Paper Review: Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages

This paper conducts a cloud outage study of 32 popular Internet services, and analyzes outage duration, root causes, impacts, and fix procedures. The paper appeared in SOCC 2016, and the authors are Gunawi, Hao, Suminto Laksono, Satria, Adityatama, and Eliazar. Availability is clearly very important for cloud services. Downtimes cause financial and reputation damages. As our reliance to cloud services increase, loss of availability creates even more significant problems. Yet, several outages occur in cloud services every year. The paper tries to answer why outages still take place even with pervasive redundancies. To answer that big question, here are the more focused questions the paper answers first. How many services do not reach 99% (or 99.9%) availability? Do outages happen more in mature or young services? What are the common root causes that plague a wide range of service deployments? What are the common lessons that can be gained from various outages? What would be

Modeling the hygienic dining philosophers algorithm in TLA+

This post on hygienic dining philosophers algorithm is the continuation of the post on dining philosophers. If you haven't seen that one, it is better to read that first before you continue with this one. The hygienic dining philosophers algorithm (due to Chandy & Misra 1984) generalizes the ring topology of the dining philosophers algorithm to an arbitrary undirected graph. The most important contribution of the hygienic philosophers algorithm is that it introduces a priority concept that ensures "No Starvation" even under weak fairness. (NoStarvation == \A p \in Procs: state[p]="Hungry" ~> state[p]="Eating") The scheme deprioritizes a process that has eaten so the process cannot greedily beat its neighbors to eating again. This is achieved by flagging the forks of a process that has eaten as "dirty". Dirty forks denote that the process has just eaten, therefore that process needs to provide the fork to any of its neighbors that ask

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The end of a myth: Distributed transactions can scale

There is plenty of room at the bottom

Distributed Transactions at Scale in Amazon DynamoDB

Dude, where's my Emacs?