Wednesday, May 23, 2012

My advice to the 2012 class

I feel sad at graduations. When I graduated college in 1997, all I felt was sadness and melancholy, rather than joy. I chose not to attend my MS and PhD graduation ceremonies, but I remember I felt sad after both defenses. It turns out, I also feel sad at my students' graduations. I graduated 3 PhD students and a couple MS students with the thesis option. It was always sad to see them depart.

This semester, I have been teaching distributed systems to the graduating class (seniors) at my sabbatical institute, Bilkent University. They are bright and talented (for example, Ollaa was developed by 3 of them). For the last class of their semester and their college lives as well, instead of discussing yet more about distributed systems, we walked out of the classroom, sat on the grass outside, and had a nice chat together.

I gave my students some parting advice. I tried to convey what I thought would be most useful to them as they pursue careers as knowledge/IT workers and researchers. Here is that advice in 3 stories and lessons.

1st Story: How I almost flunked the differential equations course

I was taking differential equations course in the first year of college. I was feeling comfortable with the course because I had seen some of the content in high school. I had gotten an A in the first midterm, so I didn't pay attention to the course until the finals. For the finals I had to go through a lot of content. I read/followed the practice questions in the textbook rather than solving questions on my own. As I followed the solutions of the practice questions, they seemed obvious and easy to do. Of course, when I took the final exam I was baffled since I couldn't solve any of the questions on my own. What seemed obvious when I was following the book turned out to be not obvious at all when I had to come up with them myself. I did so bad that I almost flunked the course. I retook the course in the summer, this time I didn't repeat the same mistake and got an A.

1st Lesson: The best way of learning is by doing

There are different levels of knowing. The first level is "knowing by word". With passive learning (like I did for studying for the diff final), you get the most limited form of understanding/knowing. It is ephemeral, and you are not able to reconstruct that on your own. The second level is "knowing by first hand experience". At this level, you are able to understand/witness and if necessary rediscover the knowledge on your own. Finally, the third level is "knowing by heart". At this level, you internalized the knowledge so that it becomes part of your nature. (See Grok).

The best way of learning is by doing. You need to work hands-on, make a lot of mistakes so you can start to internalize that knowledge. That means you should make, and produce things continuously. For this to work in a sustainable manner, you need to change your default mode from the consumer mode to the producer mode. Always have a side-project to sharpen the saw. Start your day by working for an hour or so on your side-project, instead of browsing news/twitter/etc. At night, spare some time for your side-project as well.

2nd Story: (Ok, a quotation actually, not a story)

Once you get a B.S., you think "you know everything". Once you get an M.S., you realize "you know nothing". Once you get a Ph.D., you realize that "yes, you know nothing, but that is not a problem, because nobody knows anything!"

2nd Lesson: We are all equally stupid

This is the summary of the PhD experience, so this bears repeating: "you know nothing, but that is not a problem, because nobody knows anything!"

We are all equally stupid. Our brains consist of 1.3 kg (3pounds) of gooey material. It is hard to hold 10 variables in our brains simultaneously. Our brains are very poor at reasoning about even very short concurrent programs (I know from first-hand experience :-). Our brains are also very susceptible to biases and fallacies.

The people you see as experts/geniuses are that way because they have been working/thinking/internalizing these topics for more than a decade. Those experts become baffled if you ask them a question a little bit outside of the frame they are accustomed to.

So here is the trick, "you can level the playing field by work on new things/technologies".

3rd Story: Kung-fu master and his student

M: You must pass one more test: What is the meaning of the Black Belt?
S: The end of my journey, a well-deserved reward for my hard work.
M: You are not ready for the Black Belt. Return in one year.

One year passes...
S: It is a symbol of distinction and the highest achievement in our art
M: You are not ready for the Black Belt. Return in one year.

Another year passes...
S: The Black Belt represents not the end, but the beginning, the start of a never-ending journey of discipline, work and the pursuit of an ever higher standard.
M: You are now ready to receive the Black Belt and begin your work.

3rd Lesson: Just ask

There are actually two lessons to this story. The first one is that, like the black belt, the Ph.D. is the beginning, not the culmination, of your career. But the more important second lesson is: If you bother to talk to and learn from the people who have already gone through this process, you might graduate two years earlier :-) :-)

So, for whatever goal you are chasing, be effective. Learn the criteria for the next step and optimize for it. This again connects back to the first lesson. A shortcut to getting the first-hand knowledge is to ask to the people with the first-hand knowledge. This is of course not the same thing as first-hand knowledge, but it is the closest it gets to that. (Steve Jobs also has the following advice about asking.)

I will write another blog post specifically about how to succeed in PhD, but that may take some time.

Wednesday, May 2, 2012

Replicated/Fault-tolerant atomic storage

The cloud computing community has been showing a lot of love for replicated/fault-tolerant storage these days. Examples of replicated storage at the datacenter level are GFS, Dynamo, Cassandra, and at the WAN level PNUTS, COPS, and Walter. I was searching for foundational distributed algorithms on this topic, and  found this nice tutorial paper on replicated atomic storage: Reconfiguring Replicated Atomic Storage: A Tutorial, M. K. Aguilera, I. Keidar, D. Malkhi, J-P Martin, and A. Shraer, 2010.

Replication provides masking fault-tolerance to crash failures. However, this would be a limited/transient fault-tolerance unless you reconfigure your storage service to add a new replica to replace the crashed node. It turns out, this on-the-fly reconfiguration of a replicated storage service is a subtle/complicated issue due to the concurrency and fault-tolerance issues involved. This team at MSR @ Slicon Valley has been working on reconfiguration issues in replicated storage for some time. But, in this post I am not going to talk about their reconfiguration work, and instead will just focus on the replicated/fault-tolerant atomic storage part.

Majority replication algorithm
There is this elegant algorithm for replicated/fault-tolerant atomic storage that I think every distributed systems researcher/developer should know about. It is simple and powerful. And, it is fun; I promise your brain will feel better about itself after you learn this majority replication algorithm. The algorithm originally appeared in: Sharing memory robustly in message-passing systems, H. Attiya, A. Bar-Noy, and D. Dolev, 1995. Here I will summarize the algorithm based on the discussion provided in the tutorial paper.

The algorithm employs majority replication to provide atomic read/write operations in the presence of crash failures under an asynchronous execution model. (Of course, the FLP result states the impossibility of solving consensus under this model, but this is a weaker problem than solving consensus.) Here, atomic means that the system provides linearizability, a strong type of consistency that guarantees that a read returns the most recent version of data. This single-copy consistency is stronger than Amazon Dynamo's eventual consistency and even GFS's consistency. The algorithm is on the CP side of the CAP triangle; availability is sacrificed when a majority of replicas are unreachable.

Write operation
Each storage node keeps a local copy of what it believes to be the most recent value stored by a client, together with a timestamp indicating the freshness of the value. A vt-pair refers to a pair of (value, timestamp), which a storage node keeps. To execute a write(v) operation, the client proceeds in two phases: it executes a get phase followed by a set phase.

get phase:
vt-set= read vt pairs from majority of storage nodes
select unique t' such that t' > max (t in vt-set)

set phase:
write_request (v, t') on storage nodes
storage nodes store vt' only if t' > their stored t
storage nodes send ack
when majority acks are received, return OK

(Uniqueness of t' can be ensured by adjoining the client-id to the timestamp, so that a timestamp consists of a pair with a number and a client-id, ordered lexicographically.)

Read operation
The read() operation is very similar to the write operation. The client also executes the get and set phases back to back. The only difference is that in the set phase, the client writes back the maximum timestamp vt pair it learns in the get phase.

get phase:
vt-set= read vt pairs from majority of storage nodes
select vt' such that t' = max (t in vt-set)

set phase:
write_request (v,t') on storage nodes
storage nodes store vt' only if t' > their stored t
storage nodes send ack
when majority acks are received, return v

The Set phase in read() is needed to prevent oscillating reads due to storage node failures, in which successive reads oscillate between a new and an old value while a write is in progress-- which is a violation of atomicity. The Set phase ensures that a subsequent read() will return a value at least as recent as the value returned by the current read().  The key intuition here is that any two majorities of storage nodes always have at least one storage node in common. Therefore if some client stores value v at a majority of storage nodes then another client is guaranteed to see v when it queries any majority.

Relation to Paxos
The majority replication algorithm seems closely related to the Paxos consensus algorithm. The t in the vt-pair corresponds to ballot number in Paxos. The Get and Set phases correspond to the phase1 and phase2 of Paxos. (Of course since majority replication is not for consensus, there is nothing corresponding to phase3:commit of Paxos.) Finally, the read operation corresponds to the learning operation in Paxos. Now the differences. In majority replication clients do not coordinate for the write operation, whereas in Paxos, leaders are constrained to re-propose/rewrite the value with the highest t. Also to avoid the dueling-leaders problem, Paxos relies on a leader election service so that the system eventually converges to one leader that can safely anchor/finalize a value as the decision value. (The leader election service in Paxos needs some partial-synchrony to make progress, so consensus is achieved only then.) In summary, majority replication is a building block for Paxos consensus.

This relation is explained in more detail in the "Perspectives on the CAP theorem" paper.

Concluding remarks
The nice thing about this elegant algorithm is that it can tolerate/mask the crash of a minority of storage nodes and an arbitrary number of client nodes, and it works in an "asynchronous" system. That the correctness of this algorithm does not depend on a synchronous system makes this algorithm really robust for deployment in distributed systems, especially WAN systems.

Consensus/Paxos based algorithms can make reconfiguration of replication service possible. Examples are RAMBO algorithm, and FAB: Building Distributed Enterprise Disk Arrays from Commodity Components, which provides an implementation of these ideas. But, the reconfiguration tutorial paper explains that it is also possible to implement reconfiguring of replication under the asynchronous model (without consensus)!!

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...