Showing posts from January, 2011

Lessons from giant-scale services

This is a 2001 paper by Eric Brewer summarizing the lessons he learned from designing and developing giant-scale internet services ( see Inktomi ) between 1995-2001. This paper does not mention the CAP theorem at all! This is very surpising given that CAP theorem is what Brewer is famous to most people in the internet services domain. (CAP theorem is very famous and has been very influential for the design of Internet services and cloud computing systems. CAP theorem was first mentioned in publication in 1998 paper , and then presented in PODC'00 keynote by Brewer. See these two posts for a detailed treatment of the CAP theorem.) Instead this paper is all about DQ principle as a design guideline for internet services. The paper mentions Harvest and Yield which may be seen as finer granularity versions of Consistency and Availability respectively, and may connect back to the CAP theorem. I discuss that connection at the end. DQ principle In the web services model, clie

Hints for computer system design, ACM-OS'83

My seminar on cloud computing systems has started today with two papers. Here is the summary of the first paper we covered in the seminar. Designing a computer system is very different from designing an algorithm: The external interface (that is, the requirement) is less precisely defined, more complex, and more subject to change. The system has much more internal structure, and hence many internal interfaces. The measure of success is much less clear. In this 1983 paper , Butler Lampson gives hints for computer system design based on his experience on building several systems in Xerox PARC labs. It is amazing how relevant and how fresh these hints are after 30 years of their publication. While these hints were not specifically targeting distributed systems design, several of these hints are applicable (and widely used) for cloud computing systems. Most of the text below is verbatim copied from the paper. I omitted about half of the hints, and details/justifications about the

Crash-only software, HOTOS'03

Here is the summary for the second paper we covered in the first class of our seminar. The paper has a provocative title, and in fact there is a wikipedia entry on this title : "Crash-only software refers to computer programs that handle failures by simply restarting, without attempting any sophisticated recovery. Correctly written components of crash-only software can microreboot to a known-good state without the help of a user. Since failure-handling and normal startup use the same methods, this can increase the chance that bugs in failure-handling code will be noticed..." (We had previously seen this observation in the last section of Ousterhout's "The role of distributed state" paper .) Motivation Since crashes are unavoidable, software must be at least as well prepared for a crash as it is for a clean shutdown. But then --in the spirit of Occam's Razor-- if software is crash-safe, why support additional, non-crash mechanisms for shutting down? A frequ

My Spring'11 Seminar on Cloud Computing

I am offering a seminar on Cloud Computing this semester. Below is the list of papers I plan to discuss in the seminar. I have also put up a course webpage here . If you have some suggestions on other good/recent papers to cover, please let me know in the comments. (My colleague, Tevfik Kosar , who joined the department this semester, is also offering a seminar on Data Intensive Scientific Discovery . He will be posting his reading list soon, in the meanwhile, here is a link to his 2006 seminar reading list . ) WEEK 1 Hint for Computer System Design, ACM OS'83 [Systems] Crash-Only Software, HotOS 2003 [Systems] WEEK 2 The role of distributed state, 1991 [Filesystems] Harvest, Yield, and Scalable Tolerant Systems, HOTOS'99 [Architectures] WEEK 3 Serverless network filesystems, ACM ToCS'96 [Filesystems] A History of the Virtual Synchrony Replication Model, Replication'10 [Replication] WEEK 4 SEDA: An Architecture for Highly Concurrent Server Applications, SOSP&#

CRDTs: Consistency without concurrency control

This paper appeared in ACM SIGOPS Operating Systems Review in April 2010. One of the authors on this paper is Marc Shapiro. We have previously discussed the "optimistic replication" survey by Shapiro here . (If you haven't read that survey, you should take a moment now to fill this gap. I'll wait.) This paper is also related to optimistic replication. Recall that there are two approaches to replication due to the CAP theorem . One approach insists on maintaining a strong consistency among replicas and requires consensus for serializing all updates on the replicas. Unfortunately this approach does not scale beyond a small cluster. The alternative, optimistic replication, ensures scalability by giving up consistency guarantees, however in the absence of consistency, application programmers are faced with overwhelming complexity. Yes, for some applications eventual consistency may suffice, but even there "complexity and consensus are hiding under a different guis

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The end of a myth: Distributed transactions can scale

There is plenty of room at the bottom

Distributed Transactions at Scale in Amazon DynamoDB

Dude, where's my Emacs?