Saturday, December 31, 2016

Selected blog posts from 2016

This is my 42nd post for the year. As is the custom in a year-end post, I mention some highlights among my 41 posts in 2016.

Machine Learning

Facebook papers


Distributed Coordination

TLA+/Pluscal Modeling


Monday, December 26, 2016

Learning Machine Learning: A beginner's journey

I have been learning about machine learning and deep learning (ML/DL) for the last year. I think ML/DL is here to stay. I don't think this is a fad or bubble! Here is why:

  1. ML/DL has results. It is hard to argue against success.
  2. ML/DL has been on the rise organically since 1985 (with the backpropagation algorithms) and went through another phase of acceleration after 2005 (with the wide availability of big data and distributed data processing platforms). The rise of ML/DL is following a rising curve pattern, not the pattern for a hyped ephemeral bubble. Since it grew gradually over many years, I am betting it will be around for at least the same amount of time. 
  3. ML/DL has been co-developed with applications. It has developed very much on the practice side with trial and error, and its theory is still lagging a bit behind and is unable explain many things. According to Nassim Taleb's heuristics ML/DL is antifragile.
  4. ML/DL has the market behind it. Big money provides big incentive and has been attracting a lot of smart people. This many smart people cannot be wrong.

Certainly there is also a lot of hype about ML/DL. ML/DL proved viable for specific sets of applications, and it is a hyperbole to claim that the general AI has arrived. We are far from it. But that is a good thing, because we will have a lot of juicy problems to work on.

So I am doubling down on ML/DL.

Here are my first impressions learning about ML/DL. ML/DL uses a very different toolkit and approach than the distributed systems field I grew up in. I was initially surprised and taken aback by the very experimental and trial and error nature of ML/DL. ML/DL is dealing with noisy/fuzzy/messy real world data and naturally the field produced statistical and probabilistic tools. Validation is only via showing performance on the test set. The data set is the king. Debugging is a mess, and learning is very opaque. On the other hand, I really like the dynamism in the ML/DL area. There are a lot of resources and platforms and a lot of very interesting applications.

My interest in ML/DL is in its interactions with distributed systems. I am not interested in writing image/text/speech processing applications. I learned about ML/DL to think about two questions:
  1. How can we build better distributed systems/architectures to improve the performance of ML/DL systems/applications?
  2. How can we use ML/DL to build better distributed systems?
These are big questions and will take long to answer properly, so I hope to revisit them later. Below I talk about how I went about learning ML/DL, and in the coming days I hope to write brief summaries introductory ML/DL concepts and mechanisms.

How I went about learning ML/DL

In January, I started following Andrew Ng's machine learning course at Coursera. (Alternatively, here is Ng's course material for CS 229 at Stanford.)  After the kids went to sleep, I spent an hour each night following Ng's class videos. Andrew Ng has a nice and simple way of explaining ML concepts. He is a very good teacher.

On a side note, if you like to learn a bit about Ng's thinking process and his approach to life, creativity, and failure, I  recommend this interview. It is a very good read.

I really liked the first 3 weeks of Ng's course: Introduction and Linear Regression, Linear Regression with multiple features, and Logistic Regression and regularization. But as the course went to logistic regression with nonlinear decision boundaries, I started to get overwhelmed with the amount of information and complication. And as the course progressed to neural networks, I started to get lost. For example, I could not form a good mental model and picture of forward and backward propagation in neural networks. So those parts didn't stick with me. (I was also not following the programming assignments well.)

I think the problem was that Ng was explaining the neural networks concepts in a general/generic way. That sounded too abstract to me. It might have worked better if he had settled on a small concrete use case and explained the concepts that way.

Recently, I started auditing a deep learning course on Udacity with a Google Engineer, Vincent Vanhoucke. This course offered a simpler introduction to deep learning. The course started with multinomial logistic classification. Since I knew about logistic regression, I could follow this easily.  I liked the softmax function, one-hot encoding, and cross entropy ideas as they are all very practical and concrete concepts. The course presented these with the use case of MNIST letter classification for the first 10 letters. 

Then using the same MNIST example, the course introduced rectified linear units (ReLu) as a simple way of introducing nonlinearity and showed how to chain multinomial logistic classification with ReLus to construct a deep network that solves the letter classification task much better. This time, I was able to follow the forward and backward propagation ideas much better. Instead of explaining deep networks in a general abstract way, this course explained it in a ReLu-specific way and reusing the letter classification example built with logistical regression. (In Ng's class ReLu came on week 7, when introducing support vector machines).

As a quick way to overview Vincent's course contents, you may watch this YouTube video from another Google engineer. (Watch at 1.5x speed.) The talk uses the same MNIST example and similar approach. But it doesn't go into explanation of forward/backward propagation and deriving the ReLus, instead it focuses more on giving you introduction to TensorFlow skills as well as introducing the basic neural network concepts. Within 1 hour, the talk gets you learning about convolutional networks. The presentation is nice and easy to follow.

Reflecting back, it is interesting how much YouTube helped me to learn about ML/DL. I normally like reading papers better than listening/watching videos, or maybe that is because I am more accustomed to learning that way. But for learning about ML/DL, these YouTube videos have been very helpful. It looks like lecturing is making a come back.

Here is a bonus video of Andrew Ng talking about nuts and bolts of applying deep learning. Ng is a great teacher, so it is easy to follow the concepts presented and learn a lot from this talk.

In the coming days I hope to write brief summaries of the introductory ML/DL concepts I learned from these courses. In my first couple posts on this, I plan to follow the first 3 weeks of Ng's class. There are very good course notes of that course here, and I will summarize even more briefly to mention the big ideas. Then I will switch to Vanhoucke's course to introduce the ideas in multinomial logistical regression and the generalization to neural networks and deep learning from there. I will use the #mlbegin tag for the series. Let's see how that goes.

UPDATE (1/11/17): I wrote them up. Here are the introductory ML/DL concepts I learned from those courses:

Sunday, December 25, 2016

Book review: Intuition pumps and other tools for thinking

The title of this book grabbed my attention immediately. Intuition pumps is a very visual term, and who doesn't like to learn about tools for thinking. The premise of the book is given in the first quote:
"You can't do much carpentry with your bare hands and you can't do much thinking with your bare brain." -- Bo Dahlbom.

The book is by Philosopher Daniel Dennett. The book is surprisingly readable for a philosophy book, which are full of jargon and big words. Dennett took special care in writing in a simple an clean way. For this, he recruited help from undergraduate students in his university, Tufts. The book content was discussed at an undergraduate seminar Dennett offered, and he then got help from these students review the book. This is later revealed as one of Dennett's thinking tools: "Explain to nonexperts: use a decoy audience". I think that worked: the book is accessible to an undergraduate, but a motivated one.

The first chapter explained about intuition pumps and thinking tools. An intuition pump is a simple mind tool. Dennett mentions Galileo's thought experiment that concluded small and big things fall with the same speed as an intuition pump. (Galileo thought about tying a light rock to a heavy rock. If you accept the faulty premise, since the combined system is more heavy it should fall faster, but on the other hand, since the light rock is supposed to fall slower, shouldn't it be slowing down the heavy rock it is tied to. Contradiction.)

A good story/narrative, such as "Sour grapes by Aesop", qualifies as an intuition pump for thinking about some behavioral motivations. Dennett says scientists often underestimate the use of informal tools of prose & poetry as intuition pumps. But this is probably rightly so, because some intuition pumps are misleading. Dennett calls these "Boom Crutch" (he has a catchy name for everything), and uses those to replace the technical jargon.

Chapter 2 is about general thinking tools. Dennett says "History of philosophy is smart men making tempting mistakes". But he is of course not saying that as a negative. He later continues to say the following. Making mistakes is the key to making progress. In contrast to animals, humans can remember their previous thinking, and reflect on their previous thinking and learn. For writing, blurt it out, then you have something to work with. You need to make mistakes to find the right questions. Philosophy is what you do to figure out the right questions.

To give examples of thinking tools, Dennett talks about reductio ad absurdum ("argument to absurdity"). He talks about Occam's razor: "Don't add unnecessary parameters to overfit the data". He also talks about the dual of Occam's razor: Occam's broom with which one whisks away data/facts that doesn't fit the theory. This commits the omission fallacy, and is an example of a Boom crutch. He also mentions other Boom crutches, "rathering" and "surely" which are used for dictating a false dichotomy. Finally, Dennett talks about "Jumping out of the system", a.k.a. Jootsing (he has a short funny name for everything). He says that for there to be creativity, there needs to be rules to rebel to.

I stopped reading at Chapter 3: Thinking tools for meaning or context. This is a big book at 460 pages. Although it is readable, I wasn't motivated enough as the thinking tools mentioned became more specialized for philosophy. The book talks mostly about philosophers' toolkit, and these tools are not as useful for nonphilosophers.

After reading the book, I got a pretty good idea about how the philosophers work. A friend once mentioned me a quote "Philosophy is a race to see who can think the slowest". It is said in jest of course, and probably a better way of putting "slowest" is "deepest/most exhaustively".

For us the nonphilosophers, the practical minded, the question should be: what are the thinking tools that can make our lives better?

I think some thinking tools carry across domains, but most are not. That is why we specialize in domains, and learn thinking tools that apply for those domains. And often our minds are shaped for good or bad by these tools.  My mind is shaped by computational thinking, a psychologists mind is shaped by behavioral thinking, etc. I was surprised the first time I saw this in action: For the same problem of making a toy car follow a circular trace, my Electronics Engineer friend had devised a differential formula and control theory solution, whereas I had an algorithmic/programmatic solution. I guess now the trending paradigm is to devise a machine learning solution.  (Sapir-Whorf hypothesis anyone?)

The thinking tools may not necessarily be internal to our brains, they could be prosthetics. A simple but powerful prosthetics is writing. A good example is checklists, as Atul Gawande pointed out. As a more complicated prosthetics, Steve Jobs once called the Mac as a bicycle for the mind. A thought amplifier.
"I think one of the things that really separates us from the high primates is that we’re tool builders. I read a study that measured the efficiency of locomotion for various species on the planet. The condor used the least energy to move a kilometer. And, humans came in with a rather unimpressive showing, about a third of the way down the list. It was not too proud a showing for the crown of creation. So, that didn’t look so good. But, then somebody at Scientific American had the insight to test the efficiency of locomotion for a man on a bicycle. And, a man on a bicycle, a human on a bicycle, blew the condor away, completely off the top of the charts.
And that’s what a computer is to me. What a computer is to me is it’s the most remarkable tool that we’ve ever come up with, and it’s the equivalent of a bicycle for our minds.” -- Steve Jobs

Thursday, December 22, 2016

TLA+ project2 solution (2016)

In a previous post, I had given the description of project2. Here is the TLA+ project2 solution made available on github. Below I will briefly explain the solution.

If you are not familiar with the chain-replication protocol, read this brief summary. 

The setup and variables

The first part is about the setup. We declare the process IDs for storage nodes (denoted as Nodes), the clients, and the configurator process.

The nodes maintain "db" each, where the item is updated and queried. For simplicity, our distributed key-value store maintains only a single item, and initialized to ver=-1, val=-1, cli=-1.

The nodes also have an auxiliary variable "up", which is used for modeling the status of the node. A node can go "down" at any time provided that less than FAILNUM nodes are currently down.

Each node and client has a message box, initially all message boxes are empty.

Finally, there is a global variable chain. Initially it is an empty sequence. As the configurator populates the chain, the chain will list in order the storage nodes: head, interim nodes, and the tail storage node. The configurator maintains the chain by removing the IDs of crashed nodes from the chain, and adding healthy nodes to the tail of the chain.

(In practice this can be implemented in several ways. chain could be the API of the configurator process, and can be obtained by an RPC call to the configurator. Or chain could be cached at the client-side library, and if/when the configurator changes the chain, it can update the chain variable at the library.)

The define block

Here comes the macros. We write this to simplify our modeling. As I wrote earlier, once we decide on the macros, we are halfway done with our modeling. "I find that once you get the "define" block right, the rest of the PlusCal algorithm practically writes itself. But you may need to make a couple unsuccessful takes before you figure out the most suitable /define/ block that leads to a succinct and elegant model."

IsUp is a function that takes an ID of a storage node, and returns whether that storage is up or not. Its implementation is very simple. Since we model a node being up or not using the up variable for that node, the macro just returns the up variable for the corresponding node. Why did we have a macro for this? We will use this boolean function as a filter for SelectSeq operator when we discuss the configurator.

UpNodes return a set that consists of the storage nodes that are up. We could have also written it as {n \in Nodes: IsUp(n)}. InChain(s) predicate returns whether node ID "s" is included in the chain or not. I implemented this by using the Seq operator. If chain is one of the subsequences of Nodes \ {s}, then s is not in the chain. Otherwise s is in the chain.

ChainNodes returns the set of all nodes s where InChain(s) is satisfied. FreeUpNode returns an "up" node that is not in the chain. GetIndex(s) returns the index of the node s in the chain.GetNext(s) returns the successor of node s in the chain.

The client

The client sends write requests to the storage nodes. The client gets to its work only after a chain is formed by the configurator. To keep our model checking short/finite, we make the client to send STOP number of requests.

Before sending an update request, the client first reads the current version of the item from the key value store. This is done in label CLR inside a while loop. The read request is sent to the node at the tail of the chain, which is given by chain[Len(chain)]. The client sends the request by writing to the messagebox of the tail node. Why is this inside a while loop? Because we leave the fault-tolerance to the client. The client's read request may be dropped. This can happen either via a message loss, or when the tail node crushes after receiving the read request. If the client does not get a response back to its read request, it re-sends the read request to the current tail of the chain, until it receives a message back to its message box. Then, the client sets the hver to be the latest version of the item+1, and removes the message from its message box, by resetting the message box to empty.

In the CLW tag, the client writes the update to the item to the head of the chain, i.e., chain[1], as per the chain replication protocol. This is also done in a while loop, because an update can be lost (if the head or interim node in chain crashes), and it is the client's responsibility to retry the update until it receives back an acknowledgment for that particular update.

After the write is acknowledged, the client removes the acknowledgment message from its messagebox, and increments the local variable counter, which maintains the number of successful updates the client made.

The storage nodes

The storage node actions are in two parts. The second part (starting with label NDF) is the modeling of a crash or recovery of a node. If FAILNUM number of crashes are not reached, any node can crash by setting its up variable False, and any crashed node can recover back by setting its up variable to True.

The first part actions show how a node reacts to a request in its message box. The preconditions to serving a request are that the node has a request in its message box, the node is up and part of the chain. Then, we check the type of the message: val=-1 means this is a read message, otherwise it is an update message and the db is updated with the message. The message is then propagated to the messagebox of the next node in the chain. (This is done regardless of whether it is a read message or update message). If this is the tail, then the message is written to the messagebox of the client indicated on the message (as that client has sent this request to the system). Finally the node sets its messagebox to empty when it finishes processing this message.

The configurator

The configurator process is infallible. (In practice it would be implemented as a Paxos box.) Its job is to monitor and configure the chain. (If you are not familiar with the chain-replication protocol, read this brief summary.)

It first checks if the chain is of length=3 and if there is an available up node to append to chain it will add it to chain to get the chain of length=3. When a new node is appended as the tail, its db is also initialized by copying the db of the current tail of the chain.

After repopulating the chain, the configurator checks the health of the nodes in the chain, and removes the crashed nodes in the chain. For this I used the SelectSeq operation which filters the chain based on the output of IsUp predicate on each member of the chain.

Model checking

The system is to satisfy single-copy consistency. Single-copy consistency means the storage nodes appear to outside as if it is a single virtual infallible node, even though upto FAILNUM of physical storage nodes can fail. More specifically, the highest version number result returned by a read from the tail should match the item stored by the most recently acknowledged write operation on the system.

If there is only one client of the system (you can configure this by setting C=1 in model checking parameters), then there is a shortcut to checking single-copy consistency. Since the single client is using cntr to keep track of the writes completed and updates the item val with cntr+1, the version number, hver, and val of the item should always match. The Consistent invariant checks that this is true at all storage nodes. "Consistent" can be violated if the client reads an old ver, and cntr is incremented as usual, so that ver becomes lower than cntr = val. However, if we implemented our chained-storage to implement single-copy consistency correctly, Consistent should hold as invariant in the presence of a single Client.

Of course that is a very restrictive invariant (assumes a single client, and assumes val=cntr). Let's look for more general invariants. CCon denotes that an earlier node in the chain will have more recent information "hver" than a later node in the chain. CPro is an end-to-end invariant at the client. It says that for any client, the hver maintained at the client is nondecreasing. That is, a client never reads an older/smaller hver from the system after reading a recent/larger hver. CPro is actually written as a temporal formula, so when model-checking CPro should be added to the temporal formulas section in the model. It has a peculiar form: hver' means the next state value of hver. And the formula reads as: It is always the case that for any client, the next state value of hver is greater than or equal to hver value at the current state.

When I model check with C=1 single client I see that Consistent, CCon, and CPro holds for this model. (Of course it took me many iterations to get there. There were subtle bugs that led to referencing nonexistent nodes in chain, etc.). When I check with C=2, the model checker complains that Consistent is violated. Of course this is expected, so I uncheck Consistent from the invariants, so that I can check that CCon and CPro holds.

Surprise! CCon is violated with two clients. (The same violation also applies for CPro of course.) The model checker spits out a 35 step counter example. Here is the gist of the problem.

Initially both client1 and client2 reads version as -1, and they both start an update with version 0. Client1 writes with version 0, client2's write is dropped in the messagebox at the head as Client1 overwrites that. Client1 then reads version as 0, and starts another write version=1, and succeeds. Client2 hasn't acted yet. Finally, Client2 retries its write with version 0 and manages to write with version 0 to the chain. And the chain went from version 1 down to version 0. CCon is violated. (CPro would also be violated, because Client1 would read a version=0 after it has read version=1.)

What went wrong? Modeling the messagebox with length=1 played a role in the counterexample. Only one message can be in the message box at a time, and if another message arrives before this message is read/consumed by the node, that first message gets lost. If the implementation uses TCP, this overwriting problem would be taken care of. However, it is still possible for Client2's write to get lost and CCon and CPro be violated: Client2 might write to the head which subsequently fails, then Client1 writes to the new head and succeeds with version=0 and version=1, then Client2 retries and overwrites the chain db with version=0.

As one way to fix the problem, we can consider changing the blind rewriting at label CLW. We may consider adding the client another action (which could be added to the process with an either clause) to check if the write succeeds and if not to retry the write but starting from CLR so that the client reads the new version (if any). But this is still prone to a race condition, as we make assumptions about relative speeds of Client1 and Client2. After Client2 reads a version, before it starts the CLW portion of the write, Client1 may have completed its write incrementing the version and causing the same counterexample. It seems like the proper way to handle this problem is at the storage node level. The storage node should reject an update that downgrades the version number of the item, and the client should get a rejection for its update request. I haven't put that update in my model to show the counterexample with two client case.

Some other subtle points in the modeling

I chose to model crash detection of nodes using an auxiliary variable "up", and I made it a global variable so that the configurator can readily access this information. I didn't put emphasis on a realistic failure detection part, and instead focused on the correctness modeling of the chain-based key-value store. A more realistic, closer to implementation way of doing this would be to not to use an auxiliary variable at all. For example,  the configurator could rely on periodic heartbeat messages from the nodes to implement failure detection of the nodes.

I also model "db" as a global variable. I tried to avoid this at first, but then I needed a mechanism for copying the db of the tail, when I append a new node to be the new tail of the chain. This could have been done by message passing and inventing a special request message, but that would make the model longer. I went for a shorter model and exposed db of the storage node. The only place I use this is for copying db when appending a new tail, and this would be something to refine further as we move towards implementing our protocol.

One way to refine this could be as follows. The new tail can send a "read" message to the current tail, as if it is a client. This would propagate the db of the current tail to the candidate tail, and that is copied at the candidate tail. But then we need a signaling mechanism between the candidate tail and the configurator so that the candidate tail can be declared as the new tail. But this task  is race condition prone, so this should be modeled in PlusCal and checked before it makes its way to the implementation. For this proposed refinement, the race condition can occur as follows: the current tail acknowledges a write to the client, and then the new tail is added which did not get this update, the client may query the new tail for a read and get an old version of the item. Single copy serializability is violated.

So we unearthed a concurrency race condition for 2 clients, and another race for adding a candidate tail. What other types of concurrency problems could be there with a sloppy protocol? Many different types. It is fun to go through the modeling process with TLA+ as you can observe how things may go wrong in ways you haven't anticipated. The TaxDC paper gives a good study of concurrency bugs in production use, well-debugged open source systems. Here are the common misconceptions about distributed systems that leads to these bugs:
+ One hop is faster than two hops.
+ No hop is faster than one hop.
+ Interactions between multiple protocols seem to be safe.
+ What you thought as an atomic block of execution wasn't, because another process was scheduled to execute something concurrently and that other process changed the system state in a way you didn't anticipate.

Sunday, December 18, 2016

Feedback from my distributed systems class

I am done with my distributed systems class this semester. At the end of the semester, I polled my students for feedback about the class. I was especially interested in hearing about their feedback on using TLA+ in the course and course projects, and using research paper reviewing for homework assignments. Of course, I didn't mention these when I solicited their feedback. I asked them to write a brief review for what they have learned in my distributed systems class this semester.

I received overwhelmingly positive feedback. Below I share some of the comments about TLA+ and the paper review assignments. (I removed the flattering feedback about my teaching, while I enjoyed reading them.  I put a lot of effort in my teaching and I am happy when the students recognize how much I care. That keeps me motivated.)

Feedback about using TLA+

I assigned two TLA+ projects. The first project was modeling Voldemort key-value store with client-side routing, and the second project was about extending the first with chain-replication and server-side routing. I had given the students TLA+ demonstrations in a couple classes, and shared with them PlusCal modeling of several of the distributed algorithms I covered in the class.

(I had wrote about how and why I decided to integrate TLA+ to my distributed systems class in a couple earlier posts including: Using TLA+ for teaching distributed systems, My experience with using TLA+ in distributed systems class. I also posted several TLA+/PlusCal modeling of distributed algorithms in my blog. You can reach all of my posts about TLA+/PlusCal here.)

Most of the students mentioned that they appreciated the TLA+ projects.  The students said TLA+ showed them how tricky distributed algorithms can get, and without TLA+ they wouldn't be able to figure out the intricacies of those algorithms.

TLA+ is a great tool to learn to use if you're planning on pursuing a career in computer science. Its raw ability to detect flaws in algorithms is unparalleled. I am very pleased with having been introduced to it through this class. Of all my takeaways from the class through the semester the opportunity to become familiar with this is the one I believe will help me the most in my future.
Working on the TLA+ projects taught me the most. Sitting in the classroom and going through the whole algorithm gave me an insight of the algorithm but I realized the depth of the algorithm, the problems related to it and the capability of the algorithm only when I worked on the projects. 
TLA seems to be a fantastic way to reason about systems and I really enjoyed using it for the class. I aspire to use TLA+/formal verification all along my career when I get an opportunity. 
I liked the invariant based way of reasoning about the correctness of a program. I liked the expressiveness of PlusCal to model complex distributed algorithms with very short lines of code.

At the same time, many students complained about the learning curve of TLA+, and that they had to spent a lot of time figuring the ropes. They said that it would help if I had arranged for more TLA+ demonstrations, and  if I had assigned them some smaller projects to get them warmed up with TLA+.
I found the element of the class I really had to put a lot of time into was reading the PlusCal manual and the Dynamic help in the eclipse toolbox. There was a great deal of functionality in this programming environment I didn't know how to properly use until the end of the semester. I think the class can greatly benefit from more in depth classes or recitations on TLA+ from the syntax of PlusCal to the options and tools in the model checker. 
Regarding the projects, I really enjoyed learning TLA and PlusCal and I  really liked building Voldemort and Voldchain. The caveat consists in that sometimes it was difficult and frustrating, but I learned a new way of thinking programmatically and mathematically, and it was very satisfying.
Initially, I needed a lot of time to get acquainted to the PlusCal syntax. But once I got used to it, things stared flowing smoothly. I realized that TLA+ does force the designers to clearly describe and understand the design before they actually write the executable code for it. Since TLA+ has numerous benefits and advantages I hope it gets widely adopted in the industry. 
The code translate check cycle was sometimes very long. To be fair this was usually a problem with the code itself or using too much state to design the model. 
Some errors were very difficult to comprehend and debug in TLC model checker. Some error messages were cryptic.

I like this suggestion from one of the students. It is an ambitious idea for an introductory distributed systems course with limited time and much ground to cover, but it would be a great improvement if we can pull that off.
It would be great if the project could be a semester long one, where in the first half we design a distributed algorithm and verify its consistency properties using TLA+ and in the second half we do an implementation of the system in a high-level programming language (like Java or C++).

Reviewing papers

I assigned 6 homework this semester, 4 of which involved reviewing research papers about the topics we covered. Many students commented that they didn't like reading/reviewing papers at first, but then started to find the reviewing assignments useful and enjoyable.
I just needed something to force me to do it [review papers] a few times before I could get the hang of it. Considering how much the industry changes and the absolute need to continue learning from others in the filed I think this will help me later in my career.
Because of this course I have started reading research papers on a regular basis, just because I like reading them now. 
I really enjoyed the paper reviews we did. After reading the papers and writing the reviews, I actually felt like I had a thorough understanding for the content of the papers, which was a very cool feeling. Before this class, I have never read an academic paper and felt like I actually understood it. 
Looking back, I am so glad that I had the opportunity to review the papers. At first I was intimidated, and didn't want to do it, but I feel much more confident and familiar with reading and evaluating research papers as a result. Part of the reason is that you posted information on how to go about this process, and it really helped to have a good method for performing this function. It also helped to know that I'm not the only one who doesn't understand the paper on the first or second pass. That you for your honesty.

I think this last student is referring to my "How I read a research paper" post.

Other comments

Favorite distributed algorithms for my students have been Paxos, hygienic dining philosophers, and Dijkstra's stabilizing token ring algorithms.

Almost all of the students mentioned that they liked the reenacting of the algorithms on the board, and those made the algorithms memorable and more easily understandable. The students all seemed to benefit from the YouTube videos I show at the beginning of lectures.
I loved the fact that the Professor showed YouTube videos in every lecture. Although at times it took me a few minutes to understand how the videos connected to the topic at hand (especially the one with Mr. Miyagi and the karate kid).

This comment was particularly interesting.
I appreciated that the professor would like to share his thinking processes with us. That was really helpful for me. As for a beginner stepping in this area, I lack the proper mind in considering the problems in distributed systems. However, professor was always offering his logic to us. That was what I could take it into my life with the knowledge I learned from the course.

I am intrigued that the student was able to catch this. Once I had wrote a post about why we should be talking more about our thought processes. 

Teaching is hard work, it has its ups and downs, but looking back on the semester, I am very proud to have raised another batch of distributed systems geeks.

Thursday, December 15, 2016

Modeling a Replicated Storage System in TLA+, Project 2

Two weeks ago, I had described phase 1 of the TLA+ project I assigned in my distributed systems class. I promised to give you the solution soon. I put the solution for phase1 on github.

In this post, I will describe the phase 2 of the project. In phase 2 of the project, we use Chain Replication to ensure persistence and consistency of the objects stored. Read my chain replication summary to refresh on the algorithm.

Before performing the write, the client reads from the tail of chain to learn the highest versioned write completed. The client then increments the version number, and performs the write to the head of chain with this version number. Since a node failure may lead to a request being lost (be it a read request, or update request), the client needs to repeat the request until a response is returned from the tail. The client cannot start a new request before it receives a reply to its earlier request.

A configurator process (an infallible process which would in practice be implemented as a Paxos cluster) will be used for configuring the chain. One responsibility of the configurator is to remove a down node from the chain, and the other is to add a new node to the tail of the chain if the length of the chain is less than 3. The chain is initially empty, and the configurator populates the chain using the "add a new node to the tail" action.

A storage node can crush (provided that FAILNUM is not exceeded), and recover at any time. For simplicity we assume the client writes to only one item, so we omit the key part of the key-value pair item, and model the database db at each node to consist of one item. The newer version of the item will writeover the old version.

In this phase of the project, the storage system performs server-side routing, and modeling of the system is done using message passing instead of shared memory. Once a storage node receives a new request in its message-box msg, it will consult the configurator to figure out its successor in the chain and propagate the request to its successor. If the storage node is the tail it will send a reply to the client. An update request modifies the db with the newer version of the item. A read request does not change the db.

The students are asked to write a PlusCal program to model this algorithm. I provide them the below template as a starting point, and they fill in the redacted parts, use the toolkit to translate their code to TLA+, write invariant properties and model-check for correctness. I also ask them to list their observations about phase2 "Voldchain" versus phase1 "Voldemort quorums" version of the protocol. How do they compare? Is Voldchain capable of tolerating more failures with less number of nodes? How do you explain the difference? What is the analog of write quorum, and read quorum in Voldchain?

I will wait a week or so before I share my solution for the Project2.

Saturday, December 3, 2016

Emerging trends in big data software

Mike Franklin, a famous expert on data science, had visited our department at University at Buffalo in May to talk about emerging trends in big data software. I had taken some notes during his talk, and decided to summarize and share them here.

Mike recently joined University of Chicago as the chair of Computer Science Department, and before that he was the head of UC Berkeley's AMPLab (Algorithms, Machines and People Laboratory), where he was involved with the Spark and Mesos projects which had wide academic and industrial impact. Naturally, the talk included a lot of discussion about AMPLab projects, and in particular Spark.

Mike described AMPLab as a federation of faculty that collaborate around an interesting emerging trend. AMPLab has in total 90 faculty and students. Half of the funding comes from government, and the other half from industrial sponsors. The industrial sponsors also provide constant feedback about what the lab works on and how it matters for them. As an AMPLab student graduates, a system he/she worked on also graduates from the lab. Mike credits this model with wide academic and industrial impact.

While I don't have Mike's slides from his talk at Buffalo, I found his slides for a keynote he delivered in March on the same topic. Below I provide very brief highlights from Mike's talk. See his slides for more information.

Motivation for big data

Big data is defined as datasets typically consisting of billions of trillions of records. Mike argues that big data is a big resource. For example, we knew that Tamoxifen is 80% effective for breast cancer, but thanks to big data, now we know that it is 100% at 70-80% of people and ineffective in the rest. Even 1% effective drugs could save lives; with enough of the right data we can determine precisely who the treatment will work for.

Big data spurted a lot of new software framework development. Data processing technology has fundamentally changed to become massively scalable, start using flexible schema, and provide easier integration of search query and analysis with a variety of languages. All these changes drive further innovation in big data area.

The talk continues on summarizing some important trends in big data software.

Trend1: Integrated stacks vs silos

Stonebraker famously said "one size doesn't fit all in DBMS development any more". But, Mike says that in their experience, they found that it is possible to build a single system that solves a lot problems. Of course Mike is talking about their platform Berkeley Data Analytics Stack (BDAS). Mike cites their Spark user survey to support his claim that one size fits many: Among 1400 respondents, 88% use at least 2 components, 60% at least 3, 27% at least 4.

Mike explains AMPLab's unification strategy as generalizing the MapReduce model. This leads to
1. richer programming model (fewer systems to master)
2. better data sharing (less data movement)

Here Mike talked about RDDs (Resilient Distributed Datasets) for improving over the inefficiency of MapReduce redundantly loading and writing data at each iteration. An RDD is a read-only partitioned collection of records distributed across a set of machines. Spark allows users to cache frequently used RDDs in-memory to avoid the overhead of writing intermediate data to disk and achieving up to 10-100x faster performance than MapReduce.

Spark dataflow API provides coarse grained transformations on RDDs such as map groupby, join, sort, filter, sample. RDDs are able to get good fault-tolerance without using the disk, by logging the transformations used to build an RDD and reapplying transformations from earlier RDDs to reconstruct that RDD in case it got lost/damaged.

Trend2: "Real-time" redux

One approach for handling real-time is the lambda architecture, which proposes using real-time speed layer to accompany and complement the traditional batch processing+serving layer.

Mike's complaint about this architecture is that it leads to duplication of work: you need to write processing both for the batch layer and the real-time speed layer, and when you need to modify something you need to do it again both for the batch layer and the real-time speed layer. Instead Mike mentions the kappa architecture based on Spark (first advocated by Jay Kreps) which gets rid off a separate batching layer and uses Spark streaming as both the batching and the real-time speed layer. Spark streaming uses microbatch approach to provide low latency. It introduces additional "windowed" operations. Mike says that Spark streaming doesn't provide everything a fullblown streaming system does, but it does provide most of it most of the time.

Trend 3: Machine learning pipelines

For this part Mike briefly talked about KeystoneML framework which enables the developers to specify machine learning pipelines (using domain specific and general purpose logical operators) on Spark.

Wednesday, November 30, 2016

How does one get motivated for teaching?

A friend recently complained that, even after years in academia, he never got quite adjusted to teaching. He said he doesn't see much incentive in teaching and found it boring, and asked about how one can get motivated for teaching.

This is actually a good question and it is important to get in to the habit of questioning and checking oneself. Here is how I answer this question.

  1. I don't need to get motivated for teaching. Teaching is a responsibility bestowed upon me as an academician. So I look at this from a responsibility perspective. I also have responsibilities as a parent, as a husband, or as a driver (if I am the designated driver), and for matters of responsibility, I don't need motivation. I know I should rise to the occasion.
  2. Teaching is my service to the students, who really need it. Today students have many alternatives. They can watch a lecture on YouTube, and can find a variety of study material on the Internet. But it is only I that stand in front of them in person and I get to serve as an example and inspiration to them. I take this role seriously and try to show the students how a curious researcher approaches problems and concepts. I try to show them, through diffusion, the love and fun of figuring out something new. At each class, I make sure that I take enough time to motivate my students about a question because this motivation determines the amount of attention my students will devote to learning that subject. (Yes there will inevitably be some students that look uninterested and doze off. With experience, I learned not to get demoralized by a couple such students, because that would be a disservice to the rest of the students who needs to see me motivated and engaging. I remind myself that I don't know the situation of the disengaged students---they may have other problems--, and I hope to be able to engage them at another time.)
  3. I regard teaching as a challenge and try to improve myself. I figured out early on that teaching is a good practice for giving better conference talks. When I was a fresh faculty, I felt very anxious and nervous before conference talks and before classes. But with practice gained via teaching this went away gradually, and I started to really enjoy teaching and giving conference talks. Nowadays I started to see teaching as a performance art. (Sure enough, I wasn't the first to think this, and there exists a book with this title.) When teaching, I look at the students' eyes and I try to connect. I try to see that they understand.  I try to focus on the message that I want to communicate, and detach my ego out of the way. I had written on this at a previous post titled "How to present your work". Ironically if you put your effort in impressing the audience you will fail at it, but if you focus on teaching to your audience, you will impress them.
  4. I see teaching as an opportunity to simplify the material. When teaching I try to keep the message simple. I try to figure out the gist of the material, and challenge myself to communicate that to the students. This reductionist approach should be familiar. This is also how we do research. We try to reduce complex material by throwing away the accidental complexity and focus on the intrinsic complexity. We then attack intrinsic complexity to reduce it to simpler principles. Approached this way, teaching also serves a good practice for research. This type of practice helps for research in general and sometimes leads to a new paper in particular.

Some disclaimers are in order to put my advice in context. My teaching load as a professor in a research university is pretty light. I teach one class a semester. So, maybe some of what I say may look impractical for more heavy teaching duties. I teach mostly graduate courses, so advice number 4 about teaching helping research may be less applicable when teaching undergraduate courses.

Finally, here are some practical tricks I picked up about teaching:

  • Prepare well. If I go underprepared to a class, I give a bland lecture, and I feel bad about it all day. Instead of wasting time feeling bad about bad teaching, I shift that time to preparing better for the class, so I enjoy teaching, and enjoy the rest of my day.
  • Concentrate on a simpler message, and communicate that really well in several ways to leave no doubt of transmission. When I have a bad class, it is often because I tried to cram too many things and didn't distill my message to its essence. So I learned to leave out the unessential, and communicate the essentials as clearly and rigorously as possible. As you may have noticed, Powerpoint makes it easy to cram many things to lecture notes, so you have to actively resist the temptation.
  • Slow down, pace yourself better. I learned to empathize with the students who are seeing the material for the first time, and need some time and immersion to wrap their brains around it.
  • Ask questions to the class frequently. To get the students engaged, I direct questions to the class frequently, and wait patiently until I start hearing some answers or guesses. I comment on these replies, suggest alternatives, and ask more specific questions to direct my students to think harder to provide better answers. In order to learn the subject matter, the students should be forced to do some thinking.
  • Use reenactments and use interactive material. To engage the students, I invite the students to the front of the class to reenact some algorithms by acting as the processes involved in the algorithm. I also don't shy away from showing YouTube videos that help communicate a point.

Sunday, November 27, 2016

My Distributed Systems Seminar's reading list for Spring 2017

Below is the first draft list of papers I plan to discuss in my distributed systems seminar in the Spring semester. If you have some suggestions on some good/recent papers to cover, please let me know in the comments.

Datacenter Operating System

Firmament: Fast, Centralized Cluster Scheduling at Scale (OSDI 16)
Large-scale cluster management at Google with Borg (Eurosys 15)
Apache Hadoop YARN: yet another resource negotiator (SOCC 13)
Slicer: Auto-Sharding for Datacenter Applications (OSDI 16)


Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems (SOSP 15)
Shasta: Interactive Reporting At Scale (SIGMOD 16)
Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed In-memory Databases (SIGMOD 16)


The many faces of consistency (2016)
The SNOW Theorem and Latency-Optimal Read-Only Transactions (OSDI 16)
Incremental Consistency Guarantees for Replicated Objects  (OSDI 16)
Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering  (OSDI 16)
FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs  (OSDI 16)


The Honey Badger of BFT Protocols (2016)
The Bitcoin Backbone Protocol: Analysis and Applications (2015)
XFT: Practical Fault Tolerance beyond Crashes (OSDI 16)


2016 Seminar reading list
2015 Seminar reading list

Thursday, November 24, 2016

Modeling a Replicated Storage System in TLA+, Project 1

Why a TLA+ project?

The first project assignment in my distributed systems class this semester was modeling a replicated storage system in TLA+. Assigning a TLA+ project makes me a rarity among distributed systems professors. A common project would be a MapReduce programming assignment or a project to implement a simple distributed service (such as a key-value store) in Java.

I think that a MapReduce deployment project does not teach much about distributed systems, because MapReduce is a very crude abstraction and hides all things distributed under the hood. Using MapReduce for the distributed systems class project would be like handing people a mechanics certification upon successful completion of a driving test.

Implementing a simple distributed service, on the other hand, would teach students that indeed programming and debugging distributed systems is very hard. However, I would suspect that much of the hardship in that project would be due to accidental complexities of the implementation language rather than the intrinsic complexity of reasoning about distributed systems in the presence of concurrent execution and failures. In a distributed systems deployment, it would be hard for the students to test exhaustively for race conditions, concurrency bugs, and exercise many possible combinations of node failures and message losses. That is notoriously hard for even the professionals, as shown by this survey of 104 distributed concurrency bugs from Cassandra, HBase, Hadoop MapReduce, and ZooKeeper.

I am of course not against assigning an implementation project in a distributed systems class. I see the utility and necessity in giving students hands-on programming experience on distributed systems. I think an implementation project would be suitable for an advanced distributed systems class. The class I am teaching is the first distributed systems class for the students, so  an implementation project would be unnecessarily complicated and burdensome for these students, who are also taking 3 other classes.

I teach the distributed systems class with emphasis on reasoning about the correctness of distributed algorithms, so a TLA+ is a good fit and complement for my course. Integrating TLA+ to the class gave students a way to get a hands-on experience in algorithms design and dealing with the intrinsic complexities of distributed systems: concurrent execution, asymmetry of information, concurrency bugs, and a series of untimely failures.

TLA+ has a lot to offer for practice and implementation of distributed systems. At Amazon, the engineers used TLA+ for modeling S3, DynamoDB, and some other production systems that see a lot of updates and new features. TLA+ helped the engineers find several critical bugs introduced by updates/features in the design stage, which if not found would have resulted in large amount of engineering effort later on. These are detailed in a couple of articles from Amazon engineers. I have been hearing other TLA+ adoption cases in the industry, and hope to see increasingly more adoption in the coming years.

Modeling Voldemort replicated storage system with client-side routing in TLA+

I wanted to assign the TLA+ modeling project on a practical useful application. So I chose modeling of a replicated storage system. I assigned the students to model Voldemort with client-side routing as their first project.

Here is the protocol. The client reads the highest version number "hver" from the read quorum (ReadQ). The client then writes to the write quorum nodes (WriteQ) the store the updated record with "hver+1" version number. The storage nodes can crash or recover, provided that no more than FAILNUM number of nodes are crashed at any moment. Our WriteQ and ReadQ selection will consist of the lowest id storage nodes that are up (currently not failed).

I asked the students to model check with different combinations of ReadQ, WriteQ, and FAILNUM, and figure out the relation that needs to be satisfied among these configuration parameters in order to ensure that the protocol satisfies the single-copy consistency property. I wanted my students to see how consistency can be violated as a result of a series of unfortunate events (such as untimely node death and recoveries). The model checker is very good for producing counterexamples where consistency is violated.

Simplifying assumptions and template to get the students started

I tried to keep the first project easy. We simplified things by modeling the storing (and updating) of just a single data item, so we didn't have to model the hashing part. We also used shared memory. The client directly writes (say via an RPC) to the db of the storage nodes. (It is possible to model a message-box at each process, and I assigned that for the second project.)

I also gave the students the template for the model. This helps my TAs a lot in grading. Without any template the students can go in wildly different directions with their modeling. Below is the template. In case you want to give this a try and see how you do with it, I will wait a week or so before I share my solution for the project1.

Extending to the second project

I have assigned the second project again on the modeling of a replicated storage system. But this time instead of a quorum, "chain replication" is used for ensuring persistence and consistency of storage. In the second project, the replication is done by server-side routing, and the modeling includes message passing.


Using TLA+ for teaching distributed systems (Aug 2014)
My experience with using TLA+ in distributed systems class (Jan 2015)
Modeling the hygienic dining philosophers algorithm in TLA+

There is a vibrant Google Groups forum for TLA+ :!forum/tlaplus
Clicking on label "tla" at the end of the post you can reach all my posts about TLA+

Monday, November 21, 2016

Book Review: Einstein (by Walter Isaacson)

I read this book recently and liked it a lot. It was a surprisingly engaging read. I thought I knew a lot about Einstein, and the book would be redundant and bland. But the book proved me wrong. I learned a lot of new things about Einstein, especially about his personality and his research perspective and style. The book also did a great job on explaining Einstein's theories, scientific achievements in an accessible and interesting manner.

Einstein's personality

Let's get this out of the way. Einstein didn't fail math. He was always a very smart and hardworking student. In primary school, he was at the top of his class and "far above the school requirements" in math. And before he was 15 he had mastered differential and integral calculus.

What may have helped propagate the myth was Einstein's unruly and defiant personality. Einstein would do great at the things he likes, and not good at the things he doesn't like. He had excelled in his university classes that he liked, and did average in the ones he disliked (such as physics experiments classes). Einstein was certainly a black sheep and kept clashing with authority. He was an outspoken peace activist and social democrat. He was brave, and not afraid to speak his mind even when he was getting dead threats and the wars were brewing in Europe. This defiant/black-sheep personality also helped him formulate novel theories that revolutionized the prevalent physics views/understanding at the time.

True to the stereotype, Einstein was goofy and forgot stuff all the time. He had spent a lot of time in this brain, thinking about physics all the time. "I can already imagine the fun we will have," he wrote to Maric about a prospective vacation. "And then we'll start in on Helmholtz's electromagnetic theory of light." He was immersed with physics/ideas all the time, and his daily life almost came as interludes to his physics thinking. This has also been a prevalent pattern in  Richard Feynman's life.

Contrary to the nerd professor stereotype, Einstein was also very outdoorsy, sailed, hiked, and liked to live a good life. He was a very good violin player and enjoyed giving concerts to friends. Teaching was initially hard and awkward for Einstein, but he gradually got accustomed to it, and found his own unique style and voice for teaching.

Einstein enjoyed company and attention of his friends. He founded the Olympia Academy with group of friends in Bern, Switzerland, who met in his apartment in order to discuss philosophy and physics. Einstein used this group to bounce off his ideas, and study and learn new things very much  like Benjamin Franklin's Junto group. (I also recommend the Benjamin Franklin's biography written by Walter Isaacson.)

Einstein's research

Einstein was a very intuitive and creative person. He famously said "Imagination is more important than knowledge." Einstein was a big ideas person. He was a deep thinker, but not a brute force technical/math/hardwork physicist type. His papers are not tour-de-force papers, rather they were simple, and more like position papers. He was a theorist. He read a lot of papers, including many experimental papers, and thought a lot and suggested elegant theories to explain the phenomena reported in those papers. The way he developed those theories is mainly by running thought experiments (gedankenexperiment).

The chapter about Einstein's miracle year (1905) is a very good read. This was where Einstein caught his first breakthrough. Einstein, then 26, published 4 groundbreaking papers (short position papers), on the photoelectric effect, Brownian motion, special relativity, and the equivalence of mass and energy. Einstein kept this momentum and kept publishing interesting papers at a good rate, including his new theory of general relativity in 1911.

What factors led to his breakthrough miracle year? Einstein had been churning some ideas and gestating theories starting from his college years. He had also been bouncing ideas off his friends in Olympia Academy. And finally, he was employed at the patent office which stimulated Einstein's thinking. Working at the patent office was like listening to high-tech startup pitches everyday, except the topics were mostly on electric/magnetic waves and clock synchronization, which instigated his thinking on photoelectric effect and relativity.

Einstein's relentless focus

It was very interesting to read about the promising young Einstein trying to get a job, but failing to get even a small teaching job. His attempts at PhD dissertations were also turned down. Unappreciated genius indeed. In addition to those difficulties, Einstein also lived through very turbulent times in Europe. His marriage and personal life also went through very tough periods. Throughout all this, he was always focused on his research and found solace in his studies. He was a man with a mission. He was afraid he would die before he could build his relativity theory, and was OK with dying afterwards.

On page 162, the book talks about how deep Einstein could concentrate. When his students dropped by his apartment, they found that Einstein was finishing work on a complicated problem, while he had his small son in his lap crying. Not exactly an ideal work environment, but Einstein was so focused, he wasn't bothered by the son crying, and solved the problem. A couple page later, there was a heartwarming description of Einstein and Lorentz doing research together. Two passionate physics geeks discussing ideas and bouncing questions back and forth.

Thursday, November 17, 2016

How Complex Systems Fail

This is a 4 page report about the nature of failures in complex systems. It is a gloomy report. It says that complex systems are always ridden with faults, and will fail when some of these faults conspire and cluster. In other words, complex systems constantly dwell on the verge of failures/outages/accidents.

The writing of the report is peculiar. It is written as a list of 18 items (ooh, everyone loves lists). But the items are not independent. For example, it is hard to understand items 1 and 2, until you read item 3. Items 1 and 2 are in fact laying the foundations for item 3.

The report is written by an MD, and is primarily focused on healthcare related complex systems, but I think almost all of the points also apply for other complex systems, and in particular cloud computing systems. In two recent posts (Post1, Post2), I had covered papers that investigate failures in cloud computing systems, so I thought this report would be a nice complement to them.

1) Complex systems are intrinsically hazardous systems.
I think the right wording here should be "high-stakes" rather than "hazardous". For example, cloud computing is not "hazardous" but it is definitely "high-stakes".

2) Complex systems are heavily and successfully defended against failure.
Is there an undertone here which implies these defense mechanisms contribute to make these high-stakes systems even more complex?

3) Catastrophe requires multiple failures – single point failures are not enough.
This is because the anticipated failure modes are already well guarded.

4) Complex systems contain changing mixtures of failures latent within them.
"Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident. The failures change constantly because of changing technology, work organization, and efforts to eradicate failures." This is pretty much the lesson from the cloud outages study. Old services fail as much as new services, because the playground keeps changing.

5) Complex systems run in degraded mode.
"A corollary to the preceding point is that complex systems run as broken systems."

6) Catastrophe is always just around the corner.

7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
"Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident."

8) Hindsight biases post-accident assessments of human performance.
The "everything is obvious in hindsight" fallacy was covered well in this book.

9) Human operators have dual roles: as producers & as defenders against failure.
10) All practitioner actions are gambles.
11) Actions at the sharp end resolve all ambiguity.

12) Human practitioners are the adaptable element of complex systems.
What about software agents? They can also react adaptively to the developing situations.  And today with machine learning and deep learning, especially so.

13) Human expertise in complex systems is constantly changing.
14) Change introduces new forms of failure.
The cloud outages survey has showed that updates and configuration changes and human factors account for more than 1/3rd of outages.

15) Views of ‘cause’ limit the effectiveness of defenses against future events.
Case-by-case addition of fault-tolerance is not very effective. "Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system."

16) Safety is a characteristic of systems and not of their components
Safety is a system-level property, unit testing of components is not enough.

17) People continuously create safety.
18) Failure free operations require experience with failure.
What doesn't kill you makes you stronger. In order to grow, you need to push the limits, and stress the system. Nassim Taleb's book about antifragility makes similar points.
And this short video on resilience is simply excellent.

Thursday, November 10, 2016

Modeling Paxos and Flexible Paxos in Pluscal and TLA+

The first part of this post describes modeling Paxos in Pluscal. The second part shows how to modify that model to achieve a flexible quorum Paxos. The idea for flexible quorums was introduced in a paper in 2016 summer. This simple and surprising result says "majority agreement is not required in Paxos and the sets of acceptors required to participate in agreement (known as quorums) do not even need to intersect with each other".

Modeling Paxos in Pluscal

While there are many examples of Paxos modeling in TLA+ available, I haven't found any Pluscal modeling of Paxos, except this one from Lamport, which helped me come up with my Pluscal model below. The problem with TLA+ is that it is too low-level (i.e., too declarative and math-like) for writing--and reading-- distributed algorithms. The PlusCal language provides a higher-level pseudocode, which is easier to follow.

However, as you go through my Pluscal model below, you will find that it doesn't follow your expectations of an implementation in your favorite imperative language. This is OK, Pluscal is meant to just model the algorithm so that we can model check for correctness against concurrency bugs. I had written more about modeling at a higher abstraction level earlier in this post and this post.

Leader denotes the range used for the ids of the leader processes, and Acceptor denotes the range used for the ids of the acceptor processes. Slot is the range of slots, and Ballots is the range of ballots.

Acceptors are simple, they just react to leaders' Phase1, Phase2, Phase3 messages sent with various ballot numbers. To this end, the acceptor body calls macros, which are inlined while the Pluscal code is being translated to TLA+ for model checking.

An acceptor keeps a variable for remembering the maximum ballot number maxBal it promised. It also remembers all values it accepted at Phase2a using hVal, a set of <slot, ballot, value> tuples. Finally an acceptor stores the decided proposals at each slot as a set. Of course if the agreement property of Paxos holds, the decided set for a slot has cardinality <=1.

The leader loops through the 3 phases of a round for each slot. It tries to dominate in Phase1, so it can go to Phase2. After the leader is elected, Phase1 can be skipped in subsequent slots, if the leader is not preempted by another leader. An elected leader can get preempted any time and CollectP2 can fail, so the leader checks this before it can decide a value at SendP3.

The ballot number of the leader b is incremented modulo M (the number of leaders) so it remains unique across leaders. The variable pVal is a set to store the values accepted in earlier slots, so a suitable value can be re-proposed in Phase2a. (See CollectP1 and SendP2 macros.)

AccMsg denotes set of acceptor messages sent and LMsg denotes set of leader messages sent. Instead of actually sending messages in channels and to each acceptor, sending message is modeled as adding a message in a messageboard, where other processes can nondestructively read the message. (This idea resembles the Linda tuplespaces idea.)

The macros SentXX returns a set of messages in the messageboard that match a specific filter. SuitVal is a macro for identifying the proposal with the highest ballot id accepted for a given slot.

SendP1(b) lets a leader put a Phase1a message with ballot number b to the AccMsg messageboard. ReplyP1(b) lets acceptors react to a Phase1a message with ballot number b by writing a reply back to LMsg messageboard. CollectP1(b) lets a leader to proceed as elected from Phase1 if a majority of acceptors said OK. Or the leader may learn that it has been preempted and retries Phase1 with a higher ballot number. The await statement serves as a guard: if the await predicate is not satisfied, the rest of the macro is not executed.

SendP2(b,s) lets a leader put a Phase2a message with ballot number b, slot number s to the AccMsg messageboard. The message proposes self as value, or SuitVal as value if applicable. ReplyP2(b) lets acceptors react to a Phase2a message with ballot number b by writing a reply back to LMsg messageboard.

Using CollectP2(b,s) a leader can learn that its proposal was accepted by majority of acceptors, or else it can learn that it has been preempted by a higher ballot number. SendP3 and ReceiveP3 macros implement Phase3 of Paxos.

The model checking took 7 minutes with the parameters I mentioned in the comments. Not bad.

Flexible Paxos

The flexible quorums idea was introduced in a paper in 2016 summer.  It says that we can weaken the Paxos requirement that "all quorums intersect" to require that "only quorums from different phases intersect". That is, majority quorums are not necessary, provided that Phase1 quorums intersect with Phase2 quorums.

For example  in a system of 10 acceptors, we can safely allow any set of only 3 acceptors to participate in Phase2, provided that we require 8 acceptors to participate for Phase1. This decreasing of Phase2 quorums at the cost of increasing Phase1 quorums is called as simple quorums.

Or alternatively, we can use grid quorums, where every column forms a Phase1 quorum, and every row a Phase2 quorum. In grid quorums, the quorums within either phase do not intersect with each other.

The flexible Paxos paper gave a TLA+ model, but didn't give a Pluscal model. But as I show next, we can implement the flexible Paxos quorums with a simple modification of our Paxos Pluscal model.

Quorum1 denotes Phase1 quorums, and Quorum2 denotes Phase2 quorums. Quorum1 quorums must intersect with Quorum2 quorums, but neither needs to be a majority quorum. Also Quorum1 quorums need not intersect among each other, and Quorum2 quorums need not intersect among each other. In the comments, I provide example parameters for model-checking the flexible Paxos extension.

Before the leader can get elected in Phase1, it checks to see if acceptors from a Quorum1 quorum said OK. SatQ1(b) and SatQ2(b) are macros for checking whether acceptors from a Quorum1 and Quorum2 quorum responded back for a given ballot number b.

Related links

Paxos and FPaxos Pluscal programs available on Github.

Using TLA+ for teaching distributed systems

My experience with using TLA+ in distributed systems class

There is a vibrant Google Groups forum for TLA+:!forum/tlaplus

By clicking on label "tla" at the end of the post you can reach all my posts about TLA+

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...