Monday, December 23, 2013

Research is a capricious mistress

Research is a capricious mistress. You should be totally devoted to it, and you should be thinking about it all the time. Otherwise, it won't divulge its secrets to you.

Research has this nonlinear return on investment (ROI) function. It awards persistence and dedication by folds and punishes apathy and laziness severely.

If you give research your 100%, you will get at least 200% ROI back. Eventually, that is. It will first test you to see if you are worthy, if you really mean it, if you really have it in you. You will get several setbacks, false alarms. Yet, if you persist and overcome the failures, it will start giving back to you by folds.

If you give research your 50% you will get 20% back. You won't make the cut, and you will be operating a business with deficiency. Usually someone else (the university that employs you) pays the tab.

So what does giving 100% mean? It doesn't mean working overtime (sometimes that will be needed, and you will be doing it though). The productive focused working time (i.e., deep thinking time) is limited to 4 hours a day more or less. There seems to be a bound after which your brain refuses to make progress; it is as if your brain needs to wait to adjust for and catch up to what it had produced. It won't go further even if you push more. I liken this to climbing a mountain; you have to rest daily to let your metabolism adjust. (This 4 hour bound has been anectodally mentioned by many researchers, but I don't know of any detailed analysis of this. For example, I don't know if it is possible to fit in 8 concentrated hours as 4+4 on two different projects. I haven't been able to succeed so far.)

Let's stick with that 4 hours daily. Are you giving that focused 4 hours to your research? Only a small fraction of researchers can answer this affirmatively. How can you cultivate this deep thought? It is hard work. Be persistent and try different things until you find what works best for you. Here are some necessary conditions to get you started. First, turn all distractions off. Don't check your email, twitter, browser in those precious hours. Go offline if possible. Secondly, you should be writing and taking notes. You need to be writing to be able to think in a concentrated manner. If you work with pen and paper (or a whiteboard), you can get more visual, you can doodle, you can link concepts. If you use a good word editor (I swear by Emacs orgmode) and keep typing as you work, the bonus is that this often will be your zeroth draft for your research paper, and you won't get stuck in thinking how/where to start writing.

After paying your dues to your research in those 4 hours, you are still not off the hook. What you should do is you should think about your research in that remaining time. This is the digestion and cudding process. At the background of your brain, you should revisit your research at different times of the day. You do this to see 1) if what you produced holds water, and 2) if by approaching from different angles you can make more progress. You come back to your research again and again during the day (while driving, in the shower) to see if you can catch your research off-guard and get more secrets out. But, you should not let this thinking/reconsidering turn to worrying. Worrying is unproductive and harmful.

If you make some progress regularly and persistently, you will get to see the many folds return on your investment.

Saturday, November 9, 2013

My notes from SOSP13 welcome and awards

The ACM Symposium on Operating Systems Principles (SOSP) is arguably the top conference in the computer systems area. The SOSP conference took a start with a welcome talk from the General Chair, Michael Kaminsky (Intel Labs), and PC Chair, Mike Dahlin (Google and UT Austin).

The chairs started by thanking to the many sponsors (platinum, gold, silver, bronze level) for the conference.

This year SOSP had 628 registrations, which made it the biggest SOSP as of yet, with a 16% increase over 2011 SOSP (which was yet biggest till then). Attendance distribution to SOSP is 76% from North America, 15% Europe, and 11% Asia. Among those attending 42% is faculty, 42% students, and 15% industry. There were 7 workshops on the weekend preceding the SOSP (LADIS was one of them), and 40% of attendants also attended workshops.

This year, for the first time, SOSP had full open access conference proceedings (the cost, $1100 per paper has been paid by SIGOPS), and this announcement got a huge applause from the audience.

Among the 150+ papers submitted, 30 papers are accepted to SOSP. There was a 3 round review process, with a total of 740 reviews, and on average 5 reviews per paper. 70 papers that fell in the middle were discussed in the depth PC meeting.

Three papers have been awarded with the best paper awards:

  1. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors, Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert Morris (MIT CSAIL), Eddie Kohler (Harvard)
  2. Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior, Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, Armando Solar-Lezama (MIT CSAIL)
  3. Naiad: A Timely Dataflow System, Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, Martin Abadi (Microsoft Research)

The accepted 30 papers were presented within the duration of 2.5 days as single track. Each paper got a 30 minute time slot for which 22-25 minutes is for presentation, and 5-8 minutes are dedicated to question-answering session. This year Google-moderator tool has also been employed for taking questions from the audience, in addition to walk-to-microphone questions. The tool enables the audience to vote on the questions and get the questions with most votes rise to the top.

A majority of the presentations were very good (which is not the case in a typical conference where a big majority of the presentations are dull). It was clear that presenters had practiced extensively before the presentations, which helped them to deliver polished 22-25 minutes talks. Question-answering sessions were lively and there were several insightful questions asked.

The papers and  presentation slides (and soon talk videos) are available from

Some of my top picks (reflecting my prejudicies and research interests) from the conference are as follows:

  • The scalable commutativity rule: Designing scalable software for multicore processors
  • Dandelion: a compiler and runtime for heterogenous systems
  • Sparrow: distributed, low latency scheduling
  • From L3 to seL4 what have we learnt in 20 years of L4 microkernels?
  • An analysis of facebook photo caching
  • Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior
  • Transactions chains: achieving serializability with low-latency in geo-distributed storage systems
  • Consistency-Based Service Level Agreements for Cloud Storage
  • Tango: Distributed Data Structures over a Shared Log
  • There Is More Consensus In Egalitarian Parliaments
  • Naiad: A Timely Dataflow System

I will try to edit/organize my notes about some of these talks and share them soon. Especially the last five papers on this list appeal a lot to me since they are more relevant and related to my research interests large-scale distributed systems.

Tuesday night there was a banquet and award ceremony. Stefan Savage has been awarded with the 2013 Mark Weiser award. He gave a humble yet powerful talk, and shared personal stories about how his career has benefited a lot from interacting with the late Mark Weiser. Two recent PhD thesis were awarded with the Dennis Ritchie award.

Finally, the following five papers have been added to SIGOPS Hall of Fame:

  1. “Tenex, A Paged Time Sharing System for the PDP-10”, Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy and Raymond S. Tomlinson, Communications of the ACM 15(3), March 1972. 
  2. “A NonStop Kernel”, Joel Bartlett,  in Proceedings of the Eighth ACM Symposium on Operating Systems Principles (SOSP’81), Pacific Grove, California, December 1981 
  3. K. Mani Chandy and Leslie Lamport, “Distributed Snapshots: Determining Global States of a Distributed System”, ACM Transactions on Computer Systems 3(1), February 1985. 
  4. Kenneth P. Birman and Thomas A. Joseph, “Exploiting Virtual Synchrony in Distributed Systems”, in Proceedings of the Eleventh ACM Symposium on Operating Systems Principles (SOSP’87), Austin, Texas, November 1987. 
  5. Eddie Kohler, Robert Morris, Benjie Chen, John Janotti and Frans Kaashoek, “The Click Modular Router”, ACM Transactions on Computer Systems (TOCS), 18(3), August 2000.

Friday, November 8, 2013

LADIS 2013 keynotes

I attended and presented a paper at LADIS 2013, which was colocated with SOSP13. I will talk about my paper in a later post. Here I just want to share brief summaries of the LADIS keynotes.

1. LADIS keynote: Cloud-scale Operational Excellence, by Peter Vosshall, distinguished engineer

What is operational excellence? It is anticipating and addressing problems.

For us, operational excellence arises from a combination of culture + tools + processes.

1.1 Culture

Amazon leadership principles are:
  1. Customer obsession
  2. Ownership (Amazon has a strong ownership culture, known as devops!)
  3. Insisting on the highest standards

1.2 Tools

Amazon has tools for software deployment, monitoring, visualization, ticketing, risk auditing.

In 1995, Amazon had a single web-server operation, and had a website-push perl script. This was managed by a small centralized team, named Houston.

The team invested in a tool called Apollo for automating deployments. As a result, it was easy to do deployments. Some 2011 numbers are as follows. Mean time between deployments: 11.6 secs, Max number of deployments in an hour: 1079, Mean number of hosts simultaneously receiving deployments: 10000.

Another tool for enabling continuous deployment is pipelines, which automate the path the code takes from check-in to production: packages -> version set -> beta -> 1box -> production.

1.3 Processes

When you ask for good intentions, you are not asking for a change. "Good intentions don't work, mechanisms work!"

Similar to the "Andon cord" in Toyota that stops serial line to address issues, Amazon has an Andon cord that can be pulled by the customer service department. The category owner for the Andon cord pulled needs to address immediately: the product is removed from Amazon website until category owner addresses the problem.

Correction of errors (COE) is another process. This is a mechanism Amazon employs to learn from mistakes. COE started as emails documenting errors and what is learned frankly. Anatomy of a COE today: what happened, what was the impact, the 5 whys, what were the lessons learned, what are the corrective actions.

2 LADIS keynote Baidu: Big data and infrastructure

I didn't take much notes from this talk, but here is an interesting tidbit to share.

It is known that 90% of hardware failures are caused by hard disk drives. So in some sense memory is more reliable than the disk. 3-way memory replication is enough for most applications, and that is what Baidu uses. Fast recovery for a replica is more important at the end of the day.

3 LADIS: Lessons from an internet-scale notification system, Atul Adya, Google

Thialfi is a notification service. Thialfi was first presented in SOSP11. Since then Thialfi scaled by several orders of magnitude. The team has learned unexpected lessons, and Atul talked about these lessons.

Thialfi overview: App registers for X, this is recorded at data center if X is updated app gets notification. (This is much better than busy polling by app.) Thialfi abstraction: Object unique id, and monotonically increasing version number 64 bit. Thialfi is built around soft-state. It recovers registration state from clients if needed.
Some lessons learned from operating Thialfi.

3.1 Lesson1: Is this thing on? Working for everyone?

You can never know! You need continuous testing in production. For example, look at server graphs to infer end to end latency. Chrome sync was the first real customer for Thialfi. For a big customer like Chrome, it is even possible to monitor Twitter for complaints.

3.2 Lesson2: And you thought you could debug?

In such a large scale system, you have to log selectively. When a specific user has problem, it may look like searching for a needle in a haystack. The team had to write custom production code for some customers.

3.3 Lesson3: Clients considered harmful

If you rely on client-side computations enticed by the lightweight/scalable servers promise, you will have problems with old versions of client apps. You cannot update the client code, so don't put code on clients.

3.4 Lesson4: Getting your code in the door is important

Build a feature if only customers care about it. A corollary is that you may need unclean features (weakest semantics) to get customers. For example, in one case, when they found that version numbers were not feasible for many systems, they modified Thialfi to allow time instead of version numbers.

3.5 Lesson 5: You are building your castle on sand

Use non-optimal consistent hashing (not geo-aware), rather than optimal but flapping/dithering optimal balancing.

3.6 Lesson 6: The customer is not always right

The example given here was with respect to strict latency and SLAs.

3.7 Lesson 7: You cannot anticipate the hard parts

Hard parts of Thialfi actually turned out to be:
  1. Registrations: getting client and data center to agree on registration state is hard.
  2. Wide-area routing.
  3. Client library and its protocol.
  4. Handling overload.

3.8 Question answer section

Q: Should one design a service properly at the start or make it grow organically?
A: Atul said that he was a fan of designing properly in the first place, but this failed for Thialfi. They revisited the design three times. His new rule is: if you are building on top of other sytems (as it was the case with Thialfi), don't spend months on design.
The third rewrite of Thialfi is ongoing. In this revision, they will use Google Spanner for synchronous replication of the registration state!

Q: What about Dec 2012 Chrome crashes, did that have anything to do with Thialfi.
A: Nothing to do with Thialfi, Google Sync was blamed for it. Thialfi was not implicated in a PR level failure yet.

Thursday, August 15, 2013

How I write

Writing is easy. All you do is sit staring at the blank sheet of paper until the drops of blood form on your forehead.
-- Red Smith
Let's get this cleared first. Writing is hard. Whoever says it is easy is lying. There is a lovely book by Sophy Burnham called "For Writers Only" which includes countless quotes from writers on the exquisite pain of the writing life. I highly recommend the book. All the quotations in this post are taken from that book.
I see but one rule: to be clear.
-- Stendhal
Of course, I am not going to talk about how to write great works of literature. For that I am unqualified. I will talk about technical/scientific writing, which is a bit easier and manageable. For technical writing you do not necessarily need to be creative, playful, and poetic. You just have to communicate your points clearly. (Unfortunately, this is easier said than done.)
We do not write in order to be understood, we write in order to understand.
-- C. Day-Lewis
Writing is very important for our careers as researchers. Writing helps us make sense of our research and findings, so we should first and foremost write for ourselves. Writing is also how we document/communicate our results so that others can benefit and build on them. Consequently, writing is what we are evaluated on as academics.
If we had to say what writing is, we would have to define it essentially as an act of courage.
--Cynthia Ozick
While writing is very important, it is often neglected and poorly done.  This is because writing is hard work; it exposes how weak and unorganized our thoughts really are. After all those years of writing, I still feel uneasy about writing.

My trick to writing

When I was a rookie graduate student in the distributed systems field, I was a huge Dijkstra fan. I was a romantic and would strive to write like Dijkstra: think hard, get the entire composition ready in your mind, and write at once (you won't even need to revise because you shall get it right the first time). After a while, I found that this type of writing was not sustainable for me. It was very hard to get started. I would freeze with the pressure of getting it perfect the first time. So, over time, I developed a trick to get my job easier. I became pragmatic and a believer of rapid prototyping.
The more a man writes, the more he can write.
-- William Hazlitt
My trick is not novel. It is essentially freewriting: Lower your expectations and write freely in bulk, and you will be amazed at what you produce.

There are two steps to this type of writing: Mess up and tidy up. The main idea is to separate drafting (messing up) from revising (tidying up).

Mess up

Beginning to write, you discover what you have to write about.
-- Kit Reed
This is the exploration phase. The goal is to jot down any useful idea that can lead to other useful ideas and to produce in bulk. In this phase don't criticize or restrain yourself. Don't worry about good prose, clarity, style, or better wording. Don't correct the typos. You don't need to capitalize, you don't even need to write in full sentences. Just write and accumulate text. After you follow some false leads and take unfruitful paths, you will eventually start putting down some useful ideas, and write more and produce more on top of those ideas.
It's a messy business. You wind up with shoe boxes of scrap paper.
-- Cormac McCarthy
In order to loosen up, you need to go further and convince yourself that you are writing to throw these away, and not to use them. Of course the reality is that usually what you write is mostly not that bad and after some tidying up you reach an acceptable draft.

Tidy up

Kill your darlings, kill your darlings, even when it breaks your egocentric little scribbler’s heart, kill your darlings.
-- Stephen King
This is the focusing phase. The goal is to remove the useless parts and revise the useful parts. Less is more when it comes to writing. In order to keep the writing focused, you have to remove a lot of cruft. Revel in deleting text (it is like deleting code :-), you should be happy if you can explain what you want in less words. (If you are uncomfortable with removing text, try first commenting it out or moving it at a separate junk-text buffer so you can still consider adding it back later. Often you will find that you will be OK removing that text as you grow less attached to it.)

A paragraph is the unit of composition, so start at the paragraph level. Watch the paragraphs form as you get related text together. Reorder the paragraphs and remove some of them. Check that the flow of the writing (the high-level train of thought) is there. While doing this, sympathize with the reader and consider the reader's point of view. Can the reader follow the flow? Or are there places where a reader (a distracted one, or one with different set of assumptions/expectations/background) may diverge. Work out those kinks, or (if you still feel weak/lazy/unsure) write self notes/comments about these issues so you can start addressing them in later revisions.
Easy reading is damned hard writing.
-- Nathaniel Hawthorne
Now that you have a zeroth draft, you can start reviewing and revising. You are now the editor. Forget you wrote the text produced earlier and imagine that another guy wrote it. (Shake your head rigorously, if that works for you.) Do a harsh review of the writing and identify the issues with it. Then go over the writing and try to fix as much as you can.
I have often rewritten --often several times-- every word I have ever published. My pencils outlast their erasers.
-- Vladimir Nabokov
You are almost done. Repeat the revising step a couple more times until you feel satisfied with the writing. Do one more revision even after then. If this is a research paper, send it out to your friends/collaborators to review it for you. They may be able to find issues that escaped your senses.

Bonus tip: Gear up

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.
-- Abraham Lincoln
You will (and should) spend an enormous amount of time writing during your career, so identifying the right tools for your writing matters a lot. An outliner helps greatly for this type of writing (drafting then revising). I found that Emacs org-mode is great for writing and organizing my thoughts. Take time to find what works best for you. If you feel lazy about this, consider this: the physicists had to build telescopes and the Hadron collider as their tools, so they can get things done. Tools are really important, and you should realize that time spent mastering/customizing a tool is not time wasted.
Ever tried? Ever failed? No matter. Try again. Fail again. Fail better.
-- Samuel Beckett
Since writing is also a tool, the advice on honing your tools also applies to writing. The best way to get better at writing is by writing a lot. Write daily. By writing a lot of bad prose, you will eventually learn how to write better and get more comfortable with writing. Take every opportunity to write. Maintain a blog. Read a lot. All the while, be mindful of your writing. Always consider how you can improve it further.

How do I write a research paper?

Writing a research paper requires more discussion: How do you write in parallel with your research? How do you accumulate enough text to get started? How do you structure/organize your paper into appropriate sections. But these will have to come at a later post. (UPDATE: See How to write your research paper post.)
The best part about writing is stopping.
-- Colin Walters

Related posts

Writing advice
How to write your research paper 
How I read a research paper
My advice to my graduate students
My advice to my undergraduate students

Thursday, July 25, 2013

How I read a research paper

I can't tell you how you should read a research paper. Probably a different and customized style would work best for you.  But I can tell you how I read a research paper. And if you think some of the things below make sense, you can try them for yourself and see if they work for you as well.

So, how do I read a paper? The first answer that pops to my mind is "I fight with it". I strive to completely understand and "Grok" the paper by arguing with it. I resist to believe the claims in the paper at face value, I try to poke holes at it. I believe that the paper has to earn my trust, and I only accept the parts that I am obliged to accept.

My algorithm for reading a paper

0. Print the paper
1. Ask questions and argue with the paper
2. Write a review
3. Fight with the paper
4. Sleep on it and come back to it

Print the paper

I like to physically touch the paper and handwrite and doodle on the paper. I have highlighter and different color pens ready with me, when I am reading a paper.

Ask questions

I am a slow and deliberate reader. I ask a lot of questions while reading the paper. Why did the paper define it this way? Was this the only way to define it? Was this the best way to define it? Do I believe it? Why should I believe it? I continuously have some questions in my mind going through the paper. If nothing else, I try to guess where the next paragraph will be (or should be) leading the paper. This is called critical reading.

I get very detail-oriented when reading the paper. I highlight, I underline, I cross over sentences. I mark important paragraphs with stars. I use the margins to take notes about my arguments and about some connections that occur to me. Obviously, there will be several places I won't understand in my first reading. I mark these with WDYM (what do you mean). I hypothesize (make up guesses) about things I don't understand, and revise my hypothesis as I read further in to the paper.

I am a bit ashamed of this, but I often write provocative things on the margin, like "This claim is stupid", "This is just bullshit". It is silly, but I do this to keep myself engaged with the paper and be emotionally involved as well. This way, I never fall asleep reading a paper.

Write a review

"Writing is nature's way of telling you how sloppy your thinking is."

I write a short review of the paper to test how much I understand of it. Since I already mark enough text from the paper and write enough notes in the margins while reading the paper critically, it becomes easy to start writing my personalized review of the paper using these.

I am often surprised by how little I understood the paper in my first reading. When writing the review, since I try to explain things myself in my own way, I often realize that I didn't really understand the paper.

To elaborate this point, I will borrow from my advice to undergraduate students:
There are different levels of knowing/understanding. The first level is "knowing by word". With passive learning (like passively reading), you get the most limited form of understanding/knowing. It is ephemeral, and you are not able to reconstruct that on your own. The second level is "knowing by first hand experience". At this level, you are able to understand/witness and if necessary rediscover the knowledge on your own. Finally, the third level is "knowing by heart". At this level, you internalized the knowledge so that it becomes part of your nature. You grok it.

Fight with the paper

Now that I somewhat understand the paper, it is time for me to think about it more in depth to understand it better. How would I do it? Could I find a better/simpler solution to this problem? What would be the major disadvantages of the proposed solution? Or, even, considering the problem the paper addressed, was this the right question to ask in the first place?

I find that this exercise also has benefits for my own writing/research. Trying to poke holes in others' papers helps me become a better writer and write stronger/better-defended papers. (If you are a graduate student, this will help you write a well-defended strong thesis, and you can get by fighting a small snake in the snake fight portion of your thesis defense.)

It is now time for me to read the paper a second time. This time I try to test/verify my major objections about the paper and also try to understand the places in the paper I had marked as WDYM (or sometimes as WDYFM).

Sleep on it and come back to it

At this point, I might have been overly negative and hard on the paper, and now is the time to empathize with the paper and put it back in context. It is important not to confuse critical reading with being hypercritical about a paper and dismissing the contributions made by the paper.

So I sleep on it, I take time off and go on with other things (life?) for a day or so. But during this time, my mind comes back unconsciously to the paper and reconsiders my judgments about it. And often, as a result I come to respect the paper and appreciate it more. Even though I fought with the paper and argued with it fiercely, now I try to respect and truly learn from the paper.

Maybe this quote from "Ender's game" explains this best. Ender says "In the moment when I truly understand my enemy, understand him well enough to defeat him, then in that very moment I also love him."


How do other researcher read papers?
Unfortunately, I don't know too much about how other researchers read. It would be nice to learn tricks and approaches used by others. But I guess these things are not discussed much explicitly.

I think my approach to paper reading was shaped strongly by my PhD advisor Anish Arora. He would also write on the paper, underline carefully, and argue in the margins. He was also extremely meticulous about paper reviewing for conference committees. He would insist that we understand every aspect of the paper, and sometimes understand its contributions better than the authors that wrote the paper :-)

I also had a chance to observe my postdoc advisor Nancy Lynch reading papers. She is a very detail-oriented person as well. She was able to find even the tiniest smallest mistakes in the papers with ease. She once told me that her mind worked like a debugger when reading a paper, and these bugs jumped at her. The way she worked with students was that she would dedicate herself solely on a student/paper for the duration of an entire week. That week, she would avoid thinking or listening other works/students. This is because, she wanted to immerse and keep every parameter about the paper she is working on in her mind, and grok it.

After reading this post, Ted Herman shared some of his useful tricks. While reading, he would think about how he would rewrite the paper in a way that would have cleared up the misunderstandings. He would also ask these questions of the paper: "Why is this interesting? What are the surprises? Should we even be surprised?"

There is also this short CCR paper about how to read a paper.

How much time do I spend reading a paper this way?
Many many hours. Of course if the subject is easy or I know the domain well, I can read a paper in a more relaxed manner. But I don't remember getting a lot of benefit from such readings. I get the best return on investment on the papers I had to struggle with and truly grok.

Isn't this wasteful for bad papers?
Yes, it is. I do this for reading papers that are important for my research and/or reading good quality papers that appeared at good venues. Or for papers I am stuck with reviewing for a conference or journal.

Related earlier posts:
My advice to graduate students
My advice to undergraduate students
Tell me about your thought-process, not just your results

Sunday, July 21, 2013

Apps are selfish parasites! How can we get truly collaborative apps on smartphones?

While there has been good progress and wide availability of the devices (smartphones, tablets, sensors) to fulfill the ubiquitous computing vision, the-state-of-the-art in software and integration is lagging far behind. Consider DARPA's 2009 network grand challenge on the occasion of the 40th anniversary of the Internet. The challenge was to accurately find 10 weather balloons deployed in arbitrary locations of the U.S. within a day. There was an award of $40,000 for the team that would first report the locations of the 10 balloons accurately, and this challenge was solved within 9 hours. The winning team employed social networks and a multilevel incentive structure, but had to prepare, campaign, and publicize aggressively for an entire month before the challenge day.

This points to a big gap between the potential of the smartphones and the reality of the smartphone software today. Why are the existing apps so limited and person-centric? Why can we not have an app that is able to solve similar active collaboration and coordination problems automatically?

We argue that the reason for this gap is the lack of an infrastructure to task/utilize these devices for collaboration and coordination. In the absence of such an infrastructure, the state-of-the-art today is for each device to connect to Internet to download/upload data and accomplish an individual task that does not require collaboration and coordination. In contrast, providing an infrastructure for publish/subscribe and tasking of these devices would enable any device to utilize the data published by several devices in a region, as well as to task several devices in a region to acquire the needed data, if the data is not already being published to the infrastructure.

In order to task/utilize ubiquitous devices for collaboration, we propose a ubiquitous computing middleware, Eywa, which provides an open publish-subscribe infrastructure for smartphones, and paves the way for crowdsourced sensing and collaboration applications. Eywa differs from previous work as its emphasis is on active and focused crowdsourcing and it serves as an enabler for users/devices to task other devices automatically to collect the needed data.

Read on for the rest of our position paper here.

A related paper to this is the LineKing paper.

Ramblings on serializability

Here is a very raw/immature idea I have been toying with recently. I decided to write it down and share in order to get feedback about whether this is worth exploring further.


Serializability of reads and writes (sequential consistency) is an important feature in distributed systems. A significant application of serializability is in transaction processing in distributed databases (e.g., Spanner).

Paxos serialization versus serializability

Paxos is a method for fault-tolerant state machine replication. In order to achieve fault-tolerant state machine replication, Paxos employs consensus to serialize operations at a leader and apply the operations at each replica in this exact serialized order (dictated by the leader).

While serialization is not an end in Paxos but rather a means, Paxos is seen nowadays as a defacto method for serialization rather than a method for fault-tolerant state machine replication. Maybe the ZooKeeper implementation of a Paxos-based lock-service help promote this confusion and blur the lines between Paxos's true goal (fault-tolerant replication) with its side effect (serialization).

In fact Paxos serialization is overkill, it is too strong. Paxos will serialize operations in a total order, which is not necessarily needed for sequential consistency. Today in many applications where knowing the total order and replicated-logging of that order is not important, Paxos is still (ab)used.

Serialization is not and should not be tightly-related to a replication solution.

OK, so what? (What is wrong with using a drug for its side effect?)

This confusion may permeate the misconception that serialization is inherently a reactive process. Many people have now been thinking about serialization from the Paxos perspective and as an inherently on-demand process.

(Probably there are other problems that arise due to this confusion, but for now I am only picking on this one.)

Proactive serialization

Serialization can actually be proactive and anticipatory. The workers does not need to make a serialization request to a Paxos leader. By getting rid of such a request (albeit in a fraction of the cases), latency can be reduced and throughput can be improved.

By adopting the proactive serialization philosophy, a lock service can provide the locks to some workers proactively so those workers can go ahead. Embracing this philosophy, a lock-service master can anticipate a future request and provide some locks ahead of time. The benefit of this is to eliminate the need for an explicit request (again in a fraction of the cases) and improve throughput and reduce latency.

A couple paragraphs above I said that "Today in many applications where knowing the total order and replicated-logging of that order is not important, Paxos is still (ab)used." Picking up on that thread I will claim further (without much evidence for now) that in those cases, proactive serialization could do a better job.

What are some ways to achieve proactive serialization?

How can proactive serialization be achieved? One approach is to employ machine learning. The lock service learns the patterns of requests for locks over time and then starts distributing these locks speculatively before they are requested.

Employing presignaling can be another approach to achieve proactive serialization. The nodes can anticipate which locks they will be needing in the next round or minute and ask for those locks in bulk speculatively. The lock service can choose to honor some of these requests and achieve proactive serialization to improve performance.

Another approach to achieve proactive serialization can be to do a static analysis of workers' programs/actions and come up with patterns/sequence of accesses. When the lock service observes one of the detected patterns, it can give the locks in advance to the respective workers that are mentioned in the rest of the pattern.

Did I pick on Paxos unfairly?

Actually, none of these proactive approaches are ruled out in Paxos. The Paxos leader itself could be making these proactive/speculative lock giving requests in advance. So, did I really need to pick on Paxos? Maybe not.

However, Paxos serialization in total order is still overkill and can cause problems in a Paxos-based solution. Google Spanner used a smart solution of assigning separate conflict domains (at the tablet level) to Paxos groups. This way it was able to provide locks in parallel across separate conflict domains.

If we do not tightly-couple serialization with a Paxos-based replication service, we can find more efficient and new solutions for serialization.

Thursday, July 4, 2013

Spanner: Google's Globally-Distributed Database

The Spanner paper by Google (appeared in OSDI'12) is cryptic and hard to understand. When I first read it, I thought I understood the main idea, and that the benefit of TrueTime was to enable lock-free read-only transactions in Spanner. Then, I slowly realized things didn't check; it was possible to achieve lock-free read-only transactions without TrueTime as well. I did another read, and thought for some time, and had a better understanding of how TrueTime benefits Spanner, and how to improve its shortcomings.

I will first provide a summary of the Spanner work (borrowing sentences and figures from the Spanner paper), and then talk about what TrueTime is actually good for.

In a nutshell

Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. Spanner supports non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes. In order to support externally-consistent distributed transactions at global scale, it uses a novel TrueTime API that exposes clock uncertainty.

From NoSQL to NewSQL!

The need to support semi-relational tables and synchronous replication in Spanner has been motivated by the popularity of Megastore. Many applications at Google (e.g., Gmail, Picasa, Calendar, Android Market, and AppEngine) chose to use Megastore because of its semi-relational data model and synchronous replication, despite its poor write throughput.

Spanner evolved from a Bigtable-like versioned key-value store into a temporal multi-version database. Data is stored in semi-relational tables, and Spanner provides a SQL-based query language and supports general-purpose long-lived transactions (e.g, for report generation —on the order of minutes). The Spanner team believes it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.

Spanner distributed database

Data is versioned, and each version is automatically timestamped with its commit time by the TrueTime API. Spanner provides externally consistent reads and writes, and globally-consistent reads across the database at a timestamp. External consistency (or equivalently, linearizability) is defined as follows: if a transaction T1 commits before another transaction T2 starts, then T1's commit timestamp is smaller than T2's. Using TrueTime Spanner is able to assign globally-meaningful commit timestamps to transactions, which reflect the serialization order.

2 TrueTime API returns a TTinterval that is guaranteed to contain the absolute time during which was invoked. In other words, TrueTime guarantees that for an invocation tt =, tt.earliest < tabs(enow) < tt.latest, where enow is the invocation event and tabs(enow) is the absolute time of event enow. The instantaneous error bound is denoted as ε, which is half of the interval’s width.

Google keeps uncertainty small (bounded by around 6ms) by using multiple modern clock references (GPS and atomic clocks). TrueTime is implemented by a set of time master machines per datacenter and a time slave daemon per machine. The majority of masters have GPS receivers with dedicated antennas. The remaining masters (Armageddon masters) are equipped with atomic clocks.

Every daemon polls a variety of masters to reduce vulnerability to errors from any one master. Between synchronizations, a daemon advertises a slowly increasing time uncertainty. ε is derived from conservatively applied worst-case local clock drift. The daemon's poll interval is currently 30 seconds, and the current applied drift rate is set at 200 μ sec/second, which together account for the bound on uncertainty at 6ms.

The TrueTime API directly exposes clock uncertainty, and the guarantees on Spanner's timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to wait out that uncertainty.

3 Spanner Implementation

A zone has 1 zonemaster and 100 to 1000s of spanservers. Zonemaster assigns data to spanservers; spanserver serves data to clients. Location proxies help clients to locate the spanservers assigned to serve their data. The universe master displays status information about all the zones for interactive debugging. The placement driver handles automated movement of data across zones on the timescale of minutes.

Each spanserver is responsible for 100 to 1000 tablets. A tablet implements a bag of the following mappings: (key:string, timestamp:int64) → string. To support replication, each spanserver implements a single Paxos state machine on top of each tablet.

At every replica that is a leader, each spanserver implements: a lock table (mapping ranges of keys to lock states) to implement concurrency control, and a transaction manager to support distributed transactions. If a transaction involves only one Paxos group (as is the case for most transactions), it can bypass the transaction manager, since the lock table and Paxos together provide transactionality.

If a transaction involves more than one Paxos group, those groups' leaders coordinate to perform 2-phase commit. One of the participant groups is chosen as the coordinator: the participant leader of that group will be referred to as the coordinator leader, and the slaves of that group as coordinator slaves.

4 Concurrency control

The Spanner implementation supports read-write transactions, read-only transactions, and snapshot reads. Standalone writes are implemented as read-write transactions; non-snapshot standalone reads are implemented as read-only transactions. A snapshot read is a read in the past that executes without locking.

4.1 Read-Write transactions

Read-write transactions use 2-phase locking and 2-phase commit. First, the client issues reads to the leader replica of the appropriate group, which acquires read locks and then reads the most recent data. When a client has completed all reads and buffered all writes, it starts 2-phase commit. Read-write transactions can be assigned commit timestamps by the coordinator leader at any time when all locks have been acquired, but before any locks have been released. For a given transaction, Spanner assigns it the timestamp that the coordinating leader assigns to the Paxos write that represents the transaction commit. To wait out the uncertainty in TrueTime, there is a Commit Wait: The coordinator leader ensures that clients cannot see any data committed by Ti until TT.after(si) is true.

4.2 Read-only transactions

Reads in a read-only transaction execute at a system-chosen timestamp without locking, so that incoming writes are not blocked. A read-only transaction executes in 2 phases: assign a timestamp sread, and then execute the transaction's reads as snapshot reads at sread. The snapshot reads can execute at any replicas that are sufficiently up-to-date.

Serving reads at a timestamp. Every replica tracks a value called safe time tsafe which is the maximum timestamp at which a replica is up-to-date. A replica can satisfy a read at a timestamp t, if t ≤ tsafe. We define tsafe = min(tPaxos, tTM), where each Paxos state machine has a safe time tPaxos and each transaction manager has a safe time tTM.

tPaxos is the timestamp of the highest-applied Paxos write. Because timestamps increase monotonically and writes are applied in order, writes will no longer occur at or below tPaxos with respect to Paxos.

tTM is ∞ at a replica if there are zero prepared (but safe not committed) transactions—that is, transactions in between the two phases of 2-phase commit. Otherwise, for every participant group g, over all transactions Ti prepared at g, tTM= mini (spreparei,g)-1. In other words, tTM denotes the request timestamp of the earliest prepared but not committed transaction.

Assigning timestamps to read-only transactions. The simple assignment of sread = to a read-only transaction preserves external consistency. However, such a timestamp may require the execution of the data reads at sread to block if tsafe has not advanced sufficiently. To reduce the chances of blocking, Spanner should assign the oldest timestamp that preserves external consistency. (External consistency constraint dictates that you cannot use an older version of variable to read, and you cannot assign a timestamp earlier than pending read-write transaction on any of the variables involved in the read-only transaction). Spanner implements a simpler choice when multiple Paxos groups are involved. The client avoids a negotiation round, and just has its reads execute at sread=, which may wait for safe time to advance.

4.3 Refinements

tTM as defined above has a weakness, in that a single prepared transaction prevents tsafe from advancing. Such false conflicts can be removed by augmenting tTM with a fine-grained mapping from key ranges to prepared-transaction timestamps. When a read arrives, it only needs to be checked against the fine-grained safe time for key ranges with which the read conflicts.

tPaxos is also advanced by heartbeats to help tsafe advance at the replicas. (This does not require high precision clock synchronization, and NTP easily suffices for this.)

5 TrueTime, what is it good for?

The paper does not directly discuss what TrueTime buys for Spanner. It says this in the conclusions: "One aspect of our design stands out: the linchpin of Spanner's feature set is TrueTime. We have shown that reifying clock uncertainty in the time API makes it possible to build distributed systems with much stronger time semantics." What does this mean exactly? What TrueTime buys Spanner is left unclear in the paper.

After re-reading the paper only with this question in mind, I was left more puzzled. In my first read of the paper, I thought TrueTime enabled lock-free reads in Spanner. After the second reading, I realized that lock-free reads could be implemented without TrueTime, only by using version numbers, because read-only transactions were also serialized by coordinating leaders and Paxos groups. TrueTime wasn't speeding up read-only transactions either. Even using TrueTime, read-only transaction still needs to wait to hear/learn the commit information from any variable/tablet-overlapping pending/prepared read-write transaction.

Maybe TrueTime benefited by making Spanner implementation simple, but Section 4.2.4 lists several unavoidable implementation hardships even with TrueTime. It looks like using version numbers wouldn't be significantly more complicated. Also, for schema change and paxos leader replacement (which the paper claims TrueTime simplified a lot), the NTP synchronization (several tens of ms accuracy) easily suffices. We could have easily avoided the more precise TrueTime implementation with GPS and atomic clocks for these.

TrueTime and consistent-cuts in the past

After I was almost convinced that TrueTime was not good for anything, I realized this: TrueTime benefits snapshot reads (reads in the past) the most! By just giving a time in the past, the snapshot read can get a consistent cut read of all the variables requested at that given time. This is not an easy feat to accomplish in a distributed system without using TrueTime and high-precision synchronized clocks, as it would require capturing and recording causality relationships across many different versions of the variables involved so that a consistent cut can be identified for all the variables requested in the snapshot read. That would certainly be highly prohibitive to store in the multiversion database and very hard to query as well. TrueTime provides a convenient and succinct way of encoding and accessing past consistent-cuts of the Spanner multiversion database.

But can we find any clever solutions to circumvent that problem without resorting to the high-precision clocks in TrueTime?

My close friend and frequent-collaborator Sandeep Kulkarni at Michigan State University had proposed HybridClocks in 2001. HybridClocks also exposed clock uncertainty ε in physical clocks, but it also used logical clocks in addition to the physical clocks to capture the finer-grain causality relationships that falls into the ε gray area.

Using HybridClocks it can be possible to relieve the atomic clock requirement in Spanner and use NTP instead. Even with ε at 100 msecs, we can still track finer-grain dependencies between variables using HybridClocks. The good news is that the size of the HybridClocks can be kept very succinct and can be recorded in Spanner database as timestamp to enable snapshot reads easily.

HybridClocks idea may also help speed up Spanner's high-precision clock-based implementation. In Spanner ε determines the rate of read-write transactions on a tablet. This is because, the coordinating leaders delay read-write transactions to commit with at least ε time apart in order to ensure that there is no uncertainty in past snapshot reads. It is possible to avoid this wasteful waiting by adding some logical clocks information to TrueTime as prescribed in HybridClocks.

Sandeep and I are currently exploring these ideas.


Our two followup work building on this topic are:
Beyond TrueTime: Using AugmentedTime for Improving Google Spanner
Hybrid Logical Clocks

Thursday, June 6, 2013

Antifragility from an engineering perspective

I read Nassim Taleb's "Antifragility: Things That Gain from Disorder" book a while ago, and enjoyed it a lot. I have been thinking about antifragility in engineering systems, and thought it would be good to put what I have come up with so far in writing to be able to contribute to the discussion. There are some nice reviews of the book in various places. My intention here is not to review the book, but to try to look at antifragility from an engineering perspective. Unfortunately, this came out mostly as rambling, but here it is for what it is worth.

Engineered systems: 

Let's start with giving examples from the mechanical world. I will try to give examples for three increasingly-superior levels of reliability: robust-yet-fragile < resilient < antifragile.

Robust-yet-fragile. A good example here is the glass. Glass (think of automobile glasses or gorilla glass) is actually very tough/robust material. You can throw pebbles, and even bigger rocks at it, and it won't break or scratch, well, up to a point that is. The glass is very robust to the anticipated faults (stressor) up to a point. But, exceed that point, and then the glass is in shambles.  That shows an unanticipated stressor (a black swan event in Taleb's jargon) for the glass: a ninja stone. The ninja stone is basically a piece of ceramic that you take from the spark plug, and is denser than glass. So if you gently throw this very little piece of ceramic to your car window, it breaks in shambles. This is a well-known trick to break into cars.

This is called a robust-yet-fragile structure, and this is actually why we had the Titanic disaster. Titanic, the ship, had very robust panels, but again upto a point. When Titanic exceeded that point a little bit (with the iceberg hitting it), the panels broke into shambles, very much like the glass meeting ninja stone. Modern ships after Titanic, went for resilient, instead of robust (yet fragile) panels. The resilient panels bend easier, but they don't break as miserably. They still hold together to the face of an extreme stressor. Think of plastic; it is less robust but more resilient than glass.

The robust-yet-fragile effect is also known as highly optimized tolerance. If you optimize tolerance for one anticipated stressor, you become very vulnerable to another unanticipated fault. (Much like the closed Australian ecosystem.) There is some literature on robust yet fragile systems.

Resilient. Robust systems mask stressors (upto a point), and they optimize for certain anticipated stressors and fail miserably for unanticipated stressors. The resilient system approach is the opposite. It prescribes not to mask stressors; the stressors perturb the system somewhat, but you eventually recover from them. A good slogan for resilience is robust-yet-flexible. These systems stretch somewhat with the stressor, but not fail completely or discretely.

Engineers embrace resilience today. We can see this idea in construction, in bridges, skycrapers, which are built in a flexible way to sway somewhat with wind and earthquakes but not fail completely. In fact, by flexing/stretching a bit with stressors, these systems remain unharmed and last longer, as they tolerate the shocks by rolling with them instead of absorbing the shocks fully.

Today, it is more or less established that systems should aim to expose stressors/faults/problems in some suitable manner, rather than hide them. Even as a system masks problems, it should at least report/log them. Otherwise your system will die slowly and you won't know (such accounts are more common than you would imagine).

Antifragility in mechanical systems. Taleb emphasizes repeatedly that the antifragility idea is not robust and not just resilient. An antifragile system thrives under failures, not just tolerate it. Resilient is better than robust(-yet-fragile), and antifragile is better than resilience in this respect.

Taleb says in the book that one of the few examples of antifragile materials he knows is carbon nanotubes, which gets stronger when faced with a stressor. Here is another antifragile material, a non-newtonian fluid.

There are not many antifragility examples to mechanical systems. In fact, Taleb gives the washing machine versus the cat section in the book to illustrate this point. The cat (living organism) is antifragile. The cat becomes stronger with stressors: It is use it or lose it for the cat, as her muscles would atrophy in the absence of use. So living organisms benefit from stressors. But for the washing machine (mechanical system) it is use it and lose it, as there is wear and tear associated with usage. A mechanical system does not gain much from stressors. (Actually my old car sitting in the garage disagrees. When a car is not used, it starts developing some problems, and you can say that the car benefits to a degree from being used and stressors. This is similar to the state of a vacant house versus an occupied house.)

For engineering of mechanical systems, maybe antifragility is not that applicable. With the wisdom of hundreds of thousands of man-years in engineering (practical systems and technology), why aren't we already seeing examples of antifragility?

Antifragility is not well formalized in the book, especially from the engineering perspective. So, there is some gray area for which systems we can categorize as antifragile. Does antifragility involve effects of randomization, white-noise effects? If that is the case, you can say that engineered systems benefit from some randomness. It was found that Norbert Weiner's mechanical target-tracking calculators/computers performed better in the plane (with noise from vibration) than on the ground (with no noise). There is this thing called stochastic resonance, which has been harvested to achieve some gains in practice. These are sort of white noise asyncronized noise to the systems. On the other hand, it is also well-known that synchronized resonance (an army marching in unison) can bring down bridges. Maybe by harvesting resonance and learning to use it, antifragile mechanical systems can be engineered. Unfortunately I don't know much about this domain, and I am not sure if there are some good progress and results in this domain.

Antifragility in cyber world

Antifragility is actually all about environment actions and reaction by the system. An antifragile system reacts to environment actions to keep a utility function high, and sometimes achieve a very high utility function (this is when stressors help to improve the system).  This is called the barbell strategy in the book: Be safe with the utility function, and sometimes improve it drastically.

Then we can define antifragility as the ability to improve the utility function to the face of "bad" environment actions. But what is bad? One man's food can be another man's poison. Bad can be defined with respect to (comparative to) other systems: When other systems are badly affected by the environment actions, if you are gaining then you are antifragile. (This also implies that for zero-sum environments, you are being antifragile to the expense of the fragility of others.)

Probabilistic algorithms. The CS literature is full of examples of probabilistic algorithms that do much better than any deterministic algorithm. The impossibility results (such as FLP and attacking generals) are circumscribed by probabilistic algorithms. Some algorithms specifically benefit from the randomness: the random the process the better they fare. A nice book has been written by Rajiv Motwani and Prabhakar Raghawan on randomized algorithms.

Randomness is especially great for breaking ties, and used in networking and distributed systems literature for that. For example, in the ethernet and wifi protocols, random backoffs are used so that communication over a shared medium is possible at all. Do these examples count as antifragility?

This also reminds me of an example from the Sync book by Strogatz (which is an excellent book about research, read it if you haven't already). Strogatz has formulated a sync problem for runners on a running track. "Each runner has his own speed, which is analogous to the frequency of an oscillator, and all the runners shout at and are heard by every other runner, which is analogous to the coupling between the oscillators. Depending on the initial conditions and the setup of the coupling, a group of runners may synchronize into a single block all running at the same speed, fall into chaos everybody running on her own, or anything in between." It was found that when the runners (speeds and positions) are not very uniform but rather somewhat random, synchronization was possible and achieved faster. It was also found that there is a sudden phase transition non-sync and sync outcomes. I don't know what (if anything) this implies for antifragility.

BitTorrent example. For an antifragile system, the more you try to stress the system, the stronger it grows. A great example is bittorrent. In bittorrent, the more a file becomes traffic hotspot, the faster it gets to download it. Bittorrent streams gains from hotspots and contention by exploiting the Network effect to provide scaling. If a file is popular for downloads, then its parts are available from more peers, so it will be faster to download it from many peers available.

Fault-tolerant computing angle

Self-stabilization. In the fault-tolerant computing domain, self-stabilization comes to mind immediately as an example of resiliency. Self-stabilization, first proposed by Dijkstra in 1970s, calls for not categorizing faults but instead to design tolerance for anticipated faults. Treat all faults uniformly as a perturbation to the system, and design your system to be tolerant to perturbation. So regardless of faults or combinations of faults, your system is going to recover. This is the self-stabilization view. There has been a lot of work on self-stabilizing computer systems in the literature. The trivial kinds of stabilizing systems are soft-state systems, and restartable systems.

Self-stabilization does not fit the antifragility definition. Since the state corruption abstraction is too abstract and well-defined, the "corruption helps" becomes an oxymoron. Corruption is pre-defined to be the bad thing, so it is hard to play that game. Maybe you have zones of perturbation, and the further you are perturbed away, the faster you recover from it. But, if you have a fast recovery method why not use it for the other regions as well?

Self-adaptive systems, a more recent concept can fit the antifragility idea. But self-adaptive systems are not well defined/formalized, and I am not aware of any big success stories from that line of thinking yet.  And I guess the philosophical difference between self-adaptive versus antifragile systems is that, you can still have a predefined/constant antifragile system that is not self-adaptive. You can have an antifragile system that uses the barbell idea, and does not do bad in any input, but does great in some inputs. That system is not adaptive, but it is still antifragile.

Software rejuvenation. Software rejuvenation can be an example of antifragility in the fault-tolerant computing domain. The software rejuvenation idea is to reset the software occassionally to get rid of memory leaks, unoptimal or incorrect bugs. The antifragility angle is that, if the number of faults increase, you start a software rejuvenation. So the increased number of faults helps for faster recovery/rejuvenation of the software. Specifically the Microsoft bug reporting paper comes to my mind, which is a really interesting piece of work that appeared in SOSP 2009. The idea is that if a bug is more prevalent, it is reported more automatically and is fixed first.

Wednesday, June 5, 2013

RAMCloud reloaded: Log-structured Memory for DRAM-based Storage

I had written a review about "the case for RAMCloud" paper in 2010.  RAMCloud advocates storing all the data in the RAM over distributed nodes in a datacenter. A RAMCloud is not a cache like memcached and data is not stored on an I/O device; DRAM is the permanent home for data.  Obviously, storing everything in RAM could yield a very high-throughput (100-1000x) and very low-latency (100-1000x) system compared to disk-based systems.

In the last 3 years, the RAMCloud group headed by John Ousterhout at Stanford has done significant work on this project, and this is a good time to write another review on RAMCloud. My review mostly uses (shortened) text form their papers, and focuses on some of the significant ideas in their work.

State of the RAMCloud 

Data model. RAMCloud provides a simple key-value data model consisting of uninterpreted data blobs called objects. Objects are grouped into tables that may span one or more servers in the cluster; a subset of a table stored on a single server is called a tablet. Objects must be read or written in their entirety. RAMCloud is optimized for small objects --a few hundred bytes or less-- but supports objects up to 1 MB.

Each master's memory contains a hash table and a collection of objects stored in DRAM. The hash table contains one entry for each object stored on that master; it allows any object to be located quickly, given its table and key.

RAMCloud recovery. To ensure data durability against crashes and power failures, each master must keep backup copies of its objects on the secondary storage of other servers. The backup data is organized as a log for maximum efficiency. Each master has its own log, which is divided into 8 MB pieces called segments. Each segment is replicated on several backups (typically two or three).  A master uses a different set of backups to replicate each segment, so that its segment replicas end up scattered across the entire cluster. E.g., this segment replicated at nodes 11 and 44, and the next segment at nodes 26 and 37.

When a master receives a write request from a client, it adds the new object to its memory and forwards information about that object to the backups. The backups append the new object to segment replicas stored in nonvolatile buffers; they respond to the master as soon as the object has been copied into their buffer, without issuing an I/O to secondary storage. Once the master has received replies from all the backups, it responds to the client. Each backup accumulates data in its buffer until the segment is complete, at which point it writes the segment to secondary storage and reallocates the buffer for another segment.

RAMCloud recovery is then possible by reading from the segments and constructing the hashtables at the masters. The paper describing this recovery process appeared in SOSP 2011.

RAMCloud log cleaner. RAMCloud uses a log cleaner to reclaim free space that accumulates in the logs when objects are deleted or over-written. Each master runs a separate cleaner. The cleaner selects several segments to clean (based on the amount of free space & age of data). The cleaner then scans these segments stored in memory and copies any live objects to new survivor segments. Liveness is determined by checking for a reference to the object in the hash table. The cleaner makes the old segments' memory available for new segments, and notifies the backups for those segments that they can reclaim the storage for the replicas.

In this post, we will look at the paper for the RAMCloud log cleaner, which has been published as a technical report and is under submission at SOSP'13.

The problem with Memory allocation & Garbage collection

Memory allocators fall into two general classes: noncopying allocators and copying allocators.  Non-copying allocators such as malloc cannot move an object once it has been allocated, so they are vulnerable to fragmentation. Non-copying allocators work well for individual applications with a consistent distribution of object sizes, but they can easily waste half of memory when allocation patterns change.  Copying allocators are those that can move objects after they have been created. In principle, garbage collecting (GC) can solve the fragmentation problem by moving live data to coalesce free heap space.

GC comes with a trade-off: at some point all of these collectors (even those that label themselves as "incremental") must walk all live data, relocate it, and update references. This is an expensive operation that scales poorly, so GC delay global collections until a large amount of garbage has accumulated. As a result, they typically require 1.5-5x as much space as is actually used in order to maintain high performance. This erases any space savings gained by defragmenting memory.

Pause times are another concern with copying allocators with GC. At some point all collectors must halt the processes' threads to update references when objects are moved. Despite work on real-time garbage collectors, even state-of-art solutions have maximum pause times of hundreds of microseconds, or even milliseconds --this is 100 to 1,000 times longer than the round-trip time for a RAMCloud RPC.

Log Structured Memory (LSM)

Existing allocators are not able to use memory efficiently, particularly in the face of changing access patterns, so are not applicable for RAMCloud. An ideal memory allocator for a DRAM-based storage system such as RAMCloud should have two properties: 1) It must be able to copy objects in order to eliminate fragmentation. 2) It must not require a global scan of memory: instead, it must be able to perform the copying incrementally, garbage collecting small regions of memory independently with cost proportional to the size of a region. This paper shows how to use a log-structured approach to memory management to achieve fragmentation-free and fast memory allocation.

In RAMCloud it was a natural choice to use a logging approach on disk to backup the data stored in main memory (given also that log-structured file-system (LSF) has been introduced by Ousterhout in 1991, it was inevitable actually :-). However, it was surprising to discover that logging also makes sense as a technique for managing the data in DRAM: Log-structured memory (LSM) takes advantage of /the restricted use of pointers/ in storage systems /to eliminate the global memory scans/ that fundamentally limit existing garbage collectors. The result is an efficient and highly incremental form of copying garbage collector that allows memory to be used efficiently even at utilizations of 80-90%.

RAMCloud uses a single log-based approach for managing both disk and main memory, with small policy differences that optimize the usage of each medium. Combining LSM and LSF, RAMCloud adopts a 2-level approach to cleaning with different policies for cleaning data in memory versus secondary storage. Since log data is immutable once appended, the log cleaner can run concurrently with normal read and write operations. Furthermore, multiple cleaners can run in separate threads. In the rest of this post, we will discuss each of these ideas: the LSM logs, 2-level cleaning, and parallel cleaning.

Log metadata and log cleaning

Each new log segment contains a log digest that describes the entire log. Every segment has a unique identifier, allocated in ascending order within a log (see Fig 3 above). Each object in the log must be self-identifying: it contains the table identifier, key, and version number for the object in addition to its value. Log metadata also contains tombstones that identify deleted objects. When an object is deleted or modified, RAMCloud does not modify the object's existing record in the log. Instead, it appends a tombstone record to the log.

To reclaim free space, the log cleaner should copy live data out of the segments it chooses for cleaning. Unfortunately, the cost of log cleaning at the disk rises rapidly as memory utilization approaches 100%.  E.g., if segments are cleaned when 80% of their data are still live, the cleaner must copy four bytes of live data for every byte it frees. If segments are cleaned at 90% utilization, the cleaner must copy 9 bytes of live data for every byte it frees.

2-Level cleaning

In the original implementation of RAMCloud, disk and memory cleaning were tied together: cleaning in memory was mirrored to the backup copies on disk. This made it impossible to achieve both high memory utilization and high write throughput. The solution to this is to clean in-memory and on-disk logs independently: 2-level cleaning. This way, memory can have higher utilization than disk. The cleaning cost for memory will be high, but DRAM can easily provide the bandwidth required to clean at 90% utilization or higher.  Disk cleaning happens less often. The disk log becomes larger than the in-memory log, so it has lower overall utilization (50%), and this reduces the bandwidth required for cleaning.

The first level of cleaning, called segment compaction, operates only on the in-memory segments on masters. It compacts a single segment at a time, copying its live data into a smaller region of memory and freeing the original storage for new segments. Segment compaction maintains the same logical log in memory and on disk: each segment in memory still has a corresponding segment on disk. However, the segment in memory takes less space because deleted objects and obsolete tombstones have been removed.

The second level of cleaning is called combined cleaning. If the disk log is allowed to grow until it consumes twice as much space as the log in memory, the utilization of segments cleaned on disk will never be greater than 50%, which makes cleaning relatively efficient.

Segments and seglets. With compaction, however, segments in memory can have different sizes. Each RAMCloud master divides its memory into fixed-size 64KB seglets. A segment consists of a collection of seglets, and the number of seglets varies with the size of the segment. Segment compaction cannot reorganize data, since it must preserve the 1-to-1 mapping between segments in memory and those on disk. Combined cleaning is there to enable segment reorganization.

Parallel cleaning

Parallel cleaning in RAMCloud is greatly simplified by the use of a log structure and simple metadata: Since segments are immutable after they are created, the cleaner never needs to worry about objects being modified while the cleaner is copying them. This means that the basic cleaning mechanism is very straightforward: the cleaner copies live data to new segments, atomically updates references in the hash table, and frees the cleaned segments.

There are 3 points of contention between cleaner threads and service threads handling read and write requests: 1) both cleaner and service threads need to add data at the head of the log, 2) the threads may conflict in updates to the hash table, 3) the cleaner must not free segments that are still in use by service threads. We consider these 3 contention points next.

Head of log contention. The most obvious way to perform cleaning is to copy the live data to the head of the log, but this would create contention for the log head between cleaner threads and service threads that are writing new data.  RAMCloud's solution is for the cleaner to write survivor data to different segments than the log head. Each cleaner thread allocates a separate set of segments for its survivor data. Synchronization is required when allocating segments, but once segments are allocated, each cleaner thread can copy data to its own survivor segments without additional synchronization.

Hash table contention. Hash table is used both by service threads and cleaner threads. The cleaner uses the hash table to check whether an object is alive (by seeing if the hash table currently points to that exact object). If the object is alive, the cleaner copies it and updates the hash table to refer to the new location in a survivor segment. To ensure consistency while reducing contention, RAMCloud currently uses fine-grained locks on individual hash table buckets. Although contention for these locks is low (only 0.1% under heavy write and cleaner load) it still means that the cleaner must acquire and release a lock for every object it scans, and read and write requests must also acquire an extra lock. Lockless solution is future work.

Freeing segments in memory. Once a cleaner thread has cleaned a segment, the segment's storage in memory can be freed for reuse. However, it is possible that a service thread had begun using the data in the segment before the cleaner updated the hash table; if so, the cleaner must not free the segment until the service thread has finished using it.  A simple solution is to use a reader-writer lock for each segment, with service threads holding reader locks while using a segment and cleaner threads holding writer locks before freeing a segment. However, this would increase the latency of all read requests by adding another lock in the critical path.

Instead, RAMCloud uses a mechanism based on epochs, which avoids locking in service threads. The only additional overhead for service threads is to read a global epoch variable and store it with the RPC.  When a cleaner thread finishes a cleaning pass, it increments the epoch and then tags each cleaned segment with the new epoch (after this point, no new request will use the cleaned segment).  The cleaner occasionally scans the epochs of active RPCs and frees the segment when all RPCs with epochs less than the segment's epoch have completed. This approach creates additional overhead for segment freeing, but these operations are infrequent and run in a separate thread where they don't impact read and write times.

Evaluation and concluding remarks

In the most stressful workload, a single RAMCloud server can support 230,000-400,000 durable 100-byte writes per second at 90% memory utilization. The two-level approach to cleaning improves performance by 2-8x over a single-level approach at high memory utilization, and reduces disk bandwidth overhead by 2-20x.  Parallel cleaning effectively hides the cost of cleaning: an active cleaner adds only about 3% to the latency of typical client write requests. For detailed evaluation results, you can check the paper.

To recap, this work showed that logging also makes sense as a technique for managing the data in DRAM: Log-structured memory takes advantage of /the restricted use of pointers/ in storage systems /to eliminate the global memory scans/ that fundamentally limit existing garbage collectors. The result is an efficient and highly incremental form of copying garbage collector that allows memory to be used efficiently even at utilizations of 80-90%.

Questions for further thinking along these lines:
"Each object in the log must be self-identifying; when the log is scanned during crash recovery, this information allows RAMCloud to identify the most recent version of an object and reconstruct the hash table." Then, why not backup (copy reconstruct) the hash-table as well?

"Log-structured memory takes advantage of the restricted use of pointers in storage systems to eliminate the global memory scans that fundamentally limit existing garbage collectors." Can you think of similar "restricted use of pointers" approaches to enable log compaction and parallel cleaning?

LSM worked for storage systems. What other applications/use-cases would this be appropriate? What applications/use-cases would this be inappropriate?

Monday, May 27, 2013

IPDPS'13 day1 graph algorithms

Here are some of my notes from the first day of IPDPS.

Optimizations & analysis of BSP graph processing models on public clouds

Mapreduce/Hadoop is not very suitable for graph processing (which requires iterating over and over on the same graph), and this led to the Pregel graph processing framework by Google. Pregel is based on the Bulk Synchronous Parallelism (BSP) model. Here is a summary of Pregel if you are not familiar with it. In short, Pregel uses a vertex-centric graph processing model, where the same code is executed on all vertices concurrently. Pregel uses message passing along edges and barrier synchronization at supersteps (i.e., rounds) to iterate over the graph. This paper looks at optimizations and analysis of BSP graph processing frameworks.

This group had access to Microsoft Azure cloud computing framework, and they wanted to experiment with Pregel there, so they implemented Pregel (following the description in the Pregel paper) from scratch in .NET environment. They early on noticed that Pregel gets fairly memory intensive as it holds on to all the messages sent to all the vertices in the worker. They started analyzing it further to see how the memory usage changes in the lifetime of a Pregel program over many supersteps. They discovered that there is a camel hump in the middle of the program lifetime (i.e., in the middle supersteps) for most traditional graph programs, such as all-pairs shortest path and betweenness centrality. This is because these graph programs tend to exchange more messages towards the middle supersteps as the computation flourishes and the number of messages exchanged subdues again as the computation comes closer to termination. (It turns out this hump is not present for PageRank.) This hump, of course, is important as it has an impact on how many workers you need to provision, because you need to provision for the worst-case memory usage of the workers.

So the group goes on to look into how they can constrain this hump to have a predictable memory usage through all supersteps and to facilitate managing the memory constraints of the workers. To this end, they come up with the swath concept. Swath is a subset of vertices of entire graph on which the algorithm is being initiated. Their goal is to pick the swath size that is the best fit into the main memory (amplitude) of the workers. They work on identifying swath initiation heuristics (when are subsequent swaths of verices activated) and swath size heuristics (how many vertices are active concurrently in a swath). They experiment with two approaches, sampling approach and adaptive approach, to determine when the next swath is initiated. By breaking computation into swaths of vertices and using our sizing heuristics, they are able to achieve up to 3.5x speedup over the maximum swath size that does not cause the a failure. Of course, a limitation of the swath approach is that it assumes that the program execution is embarassingly parallel and you can execute the program over swaths distributed in time without causing any correctness issues. So this approach is applicable only to those type of graph programs, such as all-pairs shortest path and betweenness centrality.

The hump observation in memory usage of BSP-based graph processing frameworks is a nice and useful one. We have also worked on BSP-based graph processing frameworks and focused on improving the opensource Giraph implementation of Pregel. We provided serializability to Giraph by introducing an optimization: internal vertices in a worker do not message each other but rather read each others' state directly from the memory of the worker they reside. Our optimization not only provides a stronger serializability to Giraph, but it also prevents this memory camel-hump haunting BSP programs as well. Our paper has been accepted to EuroPar13, and I will write a detailed post about our paper soon.

Multithreaded graph partitioning 

This paper describes the design and development of an extension to the Metis partitioner to enable multithreading, and while doing so the paper also thoroughly explores the Metis design space.

The Metis partitioner is a tool for divide a graph into minimally connected and roughly equal parts. While partitioning the graph, the constraint is that the largest partition produced should be smaller than a given size (the worker memory size), and this makes the problem an NP-hard problem. The way Metis works is via using a multilevel paradigm of 1) coarsening, 2) initial partitioning, and 3) uncoarsening. In the first two steps an approximate partitioning is made via coarsening, and then Metis does a refinement on "the bordering vertices" to find better partitioning in the last step. Since the coarse partitioning works over all vertices instead of just border vertices it is generally the bottleneck step.

The paper investigated several alternatives for each of the 3 steps above. For the coarsening step they looked at fine-grain matching (locking based), multi-pass matching, and unprotected matching (which requires a conflict resolution at the end, but this is no problem because only a small percentage of conflicts occurs). For the initial partitioning they tried parallel recursive bisectioning, and parallel k-sectioning. For the refinement step they tried coarse-grain and fine-grain approaches. They give an experimental evaluation of all these different approaches on graph datasets (roadmap and vlsi circuit) that consist of millions of vertices. They evaluated for performance and cut quality, and showed that their multithreaded metis is a clear winner. 

One of the lessons learned from the multithreaded metis is that using unprotected operations (for coarsening step) is not that dangerous or crazy, because cleaning up after race conditions turned out to be faster than preventing them. This group made their code open source at

Finally, some ramblings 

I never understood the people that go all that trouble to travel to a conference who only then sit in the lobby or the room to websurf and do emails hunching on their laptops. If there is a sensible explanation for this behavior can someone tell me, so I can stop wondering about this? Yes, presenters are not always doing a great job at explaining things, but after all that trouble to traveling to the conference, those people owe it to themselves to get the most out of the conference by being there as a whole and not just physically.

My theory about the low quality of some presentations is that the presenters often give presentations to impress and not to teach/educate/communicate. (Ted Herman once warned me about this when I was a graduate student, and I tried to do my best not to fall into that trap ever since.) I believe that by just focusing on the message to communicate and being committed to communicating it, most presenters would be able to do a decent job. Instead presenters seem to feel like they need to show off how technically through and how clever they had been, how sophisticated they are, and the result is a dry, defensive, and incomprehensible presentation. Look, you have been thinking about that problem for the last one year at least, and I am hearing/seeing about it the first time here, and you expect me to understand all the subtleties in that problem space in 10 minutes? If you are committed to teaching/educating/communicating in your allotted 25 minute time slot, you should focus on explaining the insight and the most important technical results (not all of them) in the simplest of terms. You can mention that the problem has several subtleties and you are referring the audience to the technical paper for the full details. I am not saying that you shouldn't be technical; you can be technical but not to the expense of being cryptic or exclusive.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...