Tuesday, December 25, 2018

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+.

The two-phase commit protocol is practical and is used in many distributed systems today. Yet it is still brief enough that we can model it quickly, and learn a lot from modeling it. In fact, we see that through this example we get to illustrate a fundamental impossibility result in distributed systems directly.

The two-phase commit problem

A transaction is performed over resource managers (RMs). All RMs must agree on whether the transaction is committed or aborted.

The transaction manager (TM) finalizes the transaction with a commit or abort decision. For the transaction to be committed, each participating RM must be prepared to commit it. Otherwise, the transaction must be aborted.

Some notes about modeling

We perform this modeling in the shared memory model, rather than in message passing, to keep things simple. This also ensures that the model checking is fast. But we add nonatomicity to the "read from neighbor and update your state" actions in order to capture the interesting behaviors that would happen in a message passing implementation.

A RM can only read the TM's state and read/update its own state. It cannot read other RM's state.  A TM can read all RM nodes' state and read/update its own state.


Lines 9-10 set the initial rmState for each RM to "working" and that of the TM to "init".

The canCommit predicate is defined to be true iff all RMs are "prepared" (they are ready for a commit decision). If there exists an RM with "abort" state, canAbort becomes truthified.

TM model

The TM modeling is straightforward. TM checks if it can commit or can abort, and updates tmState accordingly.

TM can also fail making tmState "unavailable", but this gets exercised only if the constant TMMAYFAIL is set to true before the model checking starts. In TLA+, the labels determine the granularity of atomicity. By putting the labels F1 and F2, we denote that the corresponding statements are executed nonatomically (after some indeterminate time passes) with respect to the previous statements.

RM model

The RM model is also simple. Since "working" and "prepared" states are nonterminal states, the RM nondeterministically chooses among the enabled actions until a terminal state is reached. A "working" RM may transition to an "aborted" or "prepared" state. A "prepared" RM waits to hear the commit and abort decision from the TM and acts accordingly. The figure below shows the state transitions possible for one RM. But note that we have multiple RMs, each of which is going through their state transitions at their pace without the knowledge of the state of another RM.

Model checking the two-phase commit

We are interested in checking that our two-phase commit is consistent: there are no two RMs such that one says "committed" and another "aborted".

The predicate Completed checks that the protocol does not hang on forever: eventually each RM reaches to a terminal "committed" or "aborted" state.

With that, we are ready to model check this protocol. Initially we set TMMAYFAIL=FALSE, RM=1..3 to run the protocol with three RMs and one TM that is reliable. The model checker takes 15 seconds and tells us that there are no errors. Both Consistency and Completed are satisfied by any possible execution of the protocol with different interleaving of enabled RM actions and TM actions.

Now, we set  TMMAYFAIL=TRUE and restart the model checker. The model checker is quick to give us a counterexample trace where the RMs in this execution is stuck waiting to hear back from the TM which has become unavailable.

We see that on State=4 the RM2 transitions to aborted, on State=7 the RM3 transitions to aborted, on State=8 the TM transitions to "abort", and crashes on State=9. On State=10, the system is stuck because the RM1 is in the prepared state waiting forever to learn a decision from the TM which has crashed.

BTM modeling

In order to avoid any downtime for many transactions the TM may be commandeering, we add a Backup TM  (BTM) to quickly take over when the TM becomes unavailable. The BTM uses the same logic as the TM to make decisions. And for simplicity we assume the BTM does not fail.

When we model check with the BTM process added, we get a new error message.

BTM cannot rule for a commit, because our original, canCommit condition asserted that all RMstates should be "prepared" and didn't account for the condition when some RMs already learned the "commit" decision from the original TM before the BTM takes over. So we revise our canCommit condition to account for this.

Success! When we check the model, we find that both Consistency and Completed are satisfied, as the BTM can take over and finalizes the transaction when TM fails. Here is the 2PCwithBTM model in TLA+ (initially the BTM and the second line of canCommit commented out). Here is the pdf corresponding to the pdf.

What if RMs could also fail?

We assumed the RMs are reliable. Now we relax that to see how the protocol behaves in the presence of RM failure. We add an "unavailable" state to model for a crash. In order to explore more behavior and model intermittent loss of availability, we allow a crashed RM to recover back and continue the protocol by reading its state from its log. Here is the RM state transition diagram again with the newly added "unavailable" state and transitions marked with red. And below that is the revised model for the RMs.

We also need to strengthen canAbort to take into account the unavailable state. The TM can rule for "abort" if any of the RMs is in "aborted" or "unavailable" state. If we omit this condition, an RM that has crashed (and never recovers) may make the transaction lose progress. Of course precaution should be taken to ensure that BTM cannot rule for an abort if there exists a RM that has learned of a "committed" decision from the original TM.

Model checking

When we check this model, we discover an inconsistency problem! How did this happen? Let's follow the execution trace for the counterexample.

On State=6, all the RMs are in prepared state, the TM had made a commit decision, and RM1 has seen that decision and moved to label RC, meaning that it is ready to change its state to "committed". (Keep RM1 in mind, this gun will fire at the last act.) Unfortunately, the TM crashes in State=7, and RM2 becomes =unavailable= in State=8. In State 9, the BTM takes over reads the RMstates as <prepared, unavailable, prepared> and rules for an abort in State=10. Remember RM1, it finally acts on the commit decision it saw from the original TM, and transitions to committed in State=11. In State=13, RM3 heeds the abort decision from the BTM and transitions to aborted, and we have the Consistency violated with RM1 deciding for committed and RM3 for aborted.

In this case, the BTM made a decision that violated consistency. On the other hand, if we had made BTM to wait until there are no "unavailable" RMs, it could be waiting forever on a crashed node, and this would then violate the progress condition.

The updated TLA+ model file is available here, and the corresponding pdf here.

FLP impossibility

So, what happened?  We hit the Fischer, Lynch, Paterson (FLP) impossibility result: In an asynchronous system, it is impossible to solve consensus (satisfy both safety and progress conditions) in the presence of crash faults.

In our example, the BTM cannot correctly decide whether RM2 is crashed or not, and rules incorrectly for an abort. If there was only one decision maker, just the TM, the inaccurate failure detector would not be an issue. The RMs would follow whatever the TM decides and both Consistency and Progress would be satisfied.

What surfaces and illustrates the problem is that we have two decision makers TM and BTM looking at the state of the RMs at different times, and making different decisions. This asymmetry of information is the root of all evil in distributed systems.

This problem doesn't go away even with the three-phase commit extension. Here is the three-phase commit extension modeled in TLA+ (here is the pdf version), and below is the error trace that shows this time Progress is violated. (The wikipedia page on three-phase commit says in a timeout after receiving the precommit order the RMs, i.e., RM2 and RM3, should go ahead and commit, but that would instead cause a consistency problem in this case.)

Paxos, making the world a better place

Not all hope is lost. We have Paxos. Paxos treads carefully within bounds of the FLP result. The innovation in Paxos is that it is always safe (even in presence of inaccurate failure detectors, asynchronous execution, faults), and it eventually makes progress when consensus gets in the realm of possibility.

You can emulate the TM with a Paxos cluster of 3 nodes and that will solve the inconsistent TM/BTM problem. Or as Gray and Lamport show in the transaction commit paper, if the RMs use the Paxos box to store their decisions concurrently with replying to the TM, this shaves off the extra one message delay from the obvious protocol.

Monday, December 17, 2018

Master your questioning skills

My new years resolution for 2018 was to "ask more/better/crazier questions."

To realize this resolution, I knew I needed to implement it as a system. So I committed to "ask at least a couple of MAD questions in each blog post." After close to 12 months and 70 posts, these are what I learned from this exercise.

1. Questioning is very useful

Adding the MAD questions section significantly extended/complemented my posts, and with little effort. Just by adding this section, I went for another 15-20 minutes, forcing myself out of my default mode. Sometimes I grudgingly did it, just because I had committed to it, but almost always I was surprised by how useful it was.

Many times the MAD questions section got longer than the main post body. And on average it was about 25% of the main body. Questioning also improved the main content, and I had incorporated the findings/leads from some of the MAD questions in the main body of the post. I keep many half-baked posts in my blog.org file, and had it not been for the questioning that extended the content, I might not have published many of those posts.

2. Questioning is a frame of mind

I found that questioning delivered those good results with relatively little effort.

I think the reason for this is because our default mode is to not question things. We have been implicitly taught this at our family, school, and work. Asking too many questions, especially hard ones, is frowned upon in polite company. Moreover, the human brain is programmed to save energy and go easy on itself, so it prefers shallow tasks and tries to avoid intense thinking sessions required for many creative tasks. By asking questions it is possible to prime the pump and keep the brain more engaged.

Coming up with the questions is not difficult, once you get out of your default mode and give yourself the license to indulge in the questioning frame of mind.

3. To ask better questions, make it a MAD question

By calling these questions MAD questions, I gave myself the license/protection to go wild, uninhibited by traditional assumptions/constraints. This gave rise to reducing the bar, and approaching the topic with a beginner mind, as an outsider. This made it easier to ask more original questions.

By detaching and discarding the baggage of tradition, you can start working from the first principles. You should remember that questioning is the beginning, not the end. There is a separation of concerns: you first focus on what needs to be done, and not how it needs to be done. So, initially, you don't need to be afraid of the difficulty of the answers as it is not your concern for now.

So, don't feel intimidated to question the basic assumptions that people in your field take as granted. They may as well be due for replacement anyway.

4. To get to the good questions, get over with the bad questions

Bad questions lead to good questions. So roll with them and exhaust them to get to the good questions. Don't quit prematurely, as good things will happen if you go a bit deeper. And do this in a stress-free manner being assured that something will eventually come if you persist a bit more.

You should approach this like brainstorming and freewriting.

5. A good question often involves a perspective change 

A good question is like a good joke. It has an unexpected element and provides a perspective change.  After I get through some questions and I eventually arrive to a good question, I have a visceral reaction similar to that of hearing a good joke.

A good question gives you a perspective change that is productive. It opens a new area to explore. It creates new relations/connections.

After following the above suggestions for asking questions, the why and what if questions can help you get to the perspective-changing questions.

A change in perspective is worth 80 IQ points. --Alan Kay

MAD questions

1. Is questioning how we think? If so, can we think better by questioning better?
I feel like my thinking is done mostly via asking questions. But of course I also need to be thinking to ask those questions in the first place, don't I? So I don't know who is the real hero here. (This reminds me of the joke: "My belt holds my pants up, but the belt loops hold my belt up. I don't really know what's happening down there. Who is the real hero?" ---Mitch Hedberg)

To be fair, let's call questioning as a catalyst for thinking. It seems like they bootstrap each other.

Questions also serve as an interrupt for sanity check, self-inspection, and going meta on the task.

Finally, questions also play an important role in the way we master/internalize things. I relate very close to Barry Diller's approach to learning by deconstructing and understanding from the fundamental elements.
DILLER:​ ​By purpose or by temperament, I'm only interested in those things where I haven’t figured it out, and I really do think that however it happened, that when I was presented endlessly with things I didn’t understand, the only thing I could do—because my brain is slower, and therefore is more literal—and therefore my process is, I have to get it down to its tiniest particle, or else... I can’t come in and understand an equation, if you can put it in equation terms, unless I de-equation it—I can’t pick it up. So I’m forced – by a lack of brain matter, I am forced to – no I’m not saying it – it’s true! To break it down as hardly low as I can get it, and only then—and that’s learning. That’s real – that is joyous work to me, is getting through those layers, down to something, and then once I’m down there, once I’m actually at the very, very base of it, I can actually start to do something good.

2. Can we treat questioning as a tool?
I think questioning can be viewed as a tool, and we will increasingly start to treat it as a valuable tool in the age of big data and machine learning information technology markets. The information is already out there, we just need to ask the right questions to the search engines, expert systems, and machine learning based digital assistants. The easy/usual questions will already be asked by the machine learning systems. To differentiate we should learn how to ask the good questions. In Synthetic Serendipity, Vernor Vinge short sci-fi story, the highschool had a course called "Search and Analysis" that aims to teach students to ask the right questions to the search engines and expert systems.

Even when the information is not there, if you ask the right question, find the right perspective to approach to it, the solution bit is not that hard. Questioning acts as a tool for framing/reframing. This is what the master of gedankenexperiment has to say on questioning.
If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask... for once I know the proper question, I could solve the problem in less than five minutes. --Albert Einstein

So, going forward, shouldn't we have more tools/support for questioning as a tool? What are some tools (mindmaps, etc.) you know that are designed to help support questioning/framing?

3. Is it possible to teach how to ask great questions?
Answers are teachable, so we teach answers. But we don't teach how to question. In kindergarten and primary school, the kids are already pretty good at questioning, but come middle school most kids stop asking questions. I don't know maybe students that ask questions come across hard to manage and contrarian, and get discouraged by teachers and adults.

We (academicians) try to teach questioning during PhD by way of examples/apprenticeship. I am not aware of any organized/formal way that asking better questions is being taught. There is no course or good book on it, as far as I can see.

However, I still believe questioning is a skill and can be teachable. What if we could create a course on this? It will likely be a hands-on learn-by-case-studies class, but at least we can provide more support and material than saying "watch me ask questions and start imitating".

Friday, December 14, 2018

Wednesday, December 12, 2018

How to fix scholarly peer review

All academics invariably suffer from an occasional bad and unfair review. We pour our sweat (and sometimes tears) over a paper for many months to watch it summarily and unfairly shot-down by a reviewer. Opening a decision email to read such a review feels so sickening that it hurts in the stomach. Even after 20 years, I am unable to desensitize myself to the pain of being shot-down by an unfair review. I suspect quite many people quit academia being frustrated over bad reviews.

Peer review is a complex and sensitive topic, and this post will inevitably fall short of capturing some important aspects of it. But I believe this topic is so important that it deserves more attention and conversation. Here I first write a bit about how to handle bad reviews. Then I outline some problems with our current peer review process and suggest some fixes to start a conversation on this.

The first rule of review club

The first rule of review club is to always give the benefit of doubt to the reviewers. We are all partial about our work: if we didn't think highly of it, we wouldn't have worked on it so hard, and submitted it to a  good venue. I suggest that you avoid responding to the reviews the first day. Chances are that you will be too busy processing your emotions to have bandwidth to process the reviews calmly. So sleep on it, read the reviews again the next day. Give it your best shot to see what gets criticized. Try to see what got misunderstood. If some reviewers had this problem, several of your readers will have similar problems as well. Can you improve your presentation to avoid that problem? How can you address/disarm the criticisms from the reviewers? What is the opportunity to improve here? The idea is to learn whatever you can learn from the reviews, and become so good they can't ignore you.

The second rule of review club

The second rule of the review club is that occasionally you get a review so malicious, so bad that you are stumped. This is inevitable. Welcome to the club!

What do you do now?

Again exercise caution. Give it a day, meet with your coauthors. Try to give the benefit of the doubt to the reviewers. If you still cannot justify in any way that the reviewer is not malicious and unfair, you should contact the PC chairs or Editor to complain about the behavior. This often gets ignored, so don't expect much.

Basically you swallow this abuse of power and move on. You try not to get too upset by many months wasted waiting for this rejection, and try to plan ahead for another publication venue.  Maybe you make a mental note to avoid this conference and journal. But isn't that penalizing yourself, and not the guilty party?

If you like to fight the unfairness, you don't have much leverage. You can make the reviews public to expose the unfairness to community. But that is about it, and this probably won't make much difference either. Yes, very frustrating. Very unfair. I know.

The current peer review process

This is how the current scholarly peer review works. You submit your paper, and it gets reviewed by (often) 3 anonymous reviewers. After 3 months (or 6 months for the journals), you get the reviews and the decision in the email. The rating is generally average of the reviewers, and a bad review damages your chances especially severely for conferences.

The blind reviewing works well when the reviewers are conscientious, and the PC-chairs/editors take meticulous care in their job to oversee the reviewers. But note that there is an asymmetric power relation here. The problem is that anonymous reviewers don't have much skin in the game, and the process is open to abuse.

Good intentions don't work, good systems work. And there is obviously something broken with our reviewing system today. The quality of reviews within the same conference or journal is uneven and varies wildly. Reviewers, who are overworked academics, are overwhelmed with lots of papers to review. Competent reviewers are especially overwhelmed. As an associate editor at IEEE TCC, when I try to find reviewers for papers, my usual success rate is 3 over 8-10. I need to contact up to 10 reviewers to find 3 that accepts to review the paper.

The basic recipe 

Don't tear down a fence before you understand what it does. As tempting it is, we can't rush to do away with blind review. Disclosing the reviewer names may also cause problems. The authors are partial about their work, and some may unfairly retaliate to the reviewer. Secondly, without the anonymity cloak, the reviewers may not be critical enough of some submissions from certain high-profile faculty/schools.

So with due caution and awareness of the complexity of peer review, I provide some suggestions to improve the system in good faith. This proposal has holes and difficulties in implementation, but I hope it serves to start the conversation.

While I am not confident in the specifics of the implementation, I think the high level solution is clear. We need to incentivize the reviewers and then demand accountability from them. So here are my suggestions to this end.

1. Compensate the reviewers

Reviewing is a thankless job. Conference reviewers get some recognition, as their name appears in the program committee page of the conference (but this gets old quickly). Journal reviewers don't even get that.

The reviewers do these reviews pro bono and are overworked. You may not realize it from outside but academics work really hard between teaching, research, departmental administration duties, and other professional duties.

Did I mention that reviewers and Editors don't get paid a dime? We are talking about domain experts here, who could easily make \$200 an hour consulting. I think we owe it to the reviewers to compensate them for their contribution in the publishing process.

Can the conferences and professional organizations afford to pay reviewers? I don't see why not. I have not been involved in any conference organization, so I may be wrong here, but please humor me.  A conference with 1000 attendees with \$1000 per registration fee (not uncommon) makes a total of  \$1 million. Where does this money go? Definitely not to the dubious quality conference swag. Hotels may be expensive, but not that much. If hotels are taking most of that money, we need tougher negotiators on the conference organization committees. The hotels already get a good deal with attendees staying there for the duration of the conference.

Remember Chesterton's fence. We should of course consider what type of side-effects compensating the reviewers may have. Could this lead to a clique of friends who recruit each other for reviewing gigs? I don't know.  If the compensation is not high, and if we keep reviewers accountable with respect to the quality of the reviews, this may not be a problem. If money is too messy, provide free conference registration to the reviewers.

Even if we don't compensate the reviewers, at least we need to set up a system to prevent freeloading. If you don't help with reviewing, you don't get to have your papers reviewed. If you are a conscientious reviewer, maybe you get to have a fourth reviewer?

2. Provide an author-feedback phase

Conferences should provide a rebuttal phase to give the authors a chance to respond to the reviewers' criticisms.  Although limited in their effectiveness/consequence, this response phase still gives a voice to the authors. As an extra point, I really liked what SIGMOD did with their author feedback; they explicitly asked the authors to report on any bad/offending reviews.

In journals, there is no rush. So even for rejected papers, the journal may provide a response opportunity, and the authors get to present their responses to the reviewers.

3. Form a grievance committee

To give more voice/faculty to the authors, a grievance committee can be formed to inspect the reviews suspected of foul play. The committee shall inspect the situation, consult with the other reviewers on the paper, and write a report on the decision.

Maybe this is on the crazy side, but here it is: It may even be possible to publicize the name of a malicious or severely negligent reviewer. (There may be a decentralized cryptographic signing solution under which two reviewers may make the name of the third reviewer visible if they agree on neglect/abuse by the third reviewer. Crypto geeks, is it possible to implement a multisig solution for this on hotcrp soon?)

4. Take ownership/responsibility for the reviews 

As PC chairs or journal editors, you should take responsibility of the quality of the reviews provided to the authors. You should not blindly sign on the reviews, as at the end of the day the quality of the reviews provided to the authors is your responsibility.

In addition, in-person PC meetings (when feasible) is good for enforcing accountability for the reviewers. Again the travel for PC members should be paid for by the conference registration fees, if an in-person PC meeting is established.

Reviews can be rated by other reviewers and feedback in the form of blind review can be provided to the reviewers. These feedback can help train reviewers to avoid common mistakes: being hypercritical on a minor point, failing to see the overall value provided, being prejudiced towards certain methods/approaches, making unsupported claims, etc. We may even consider pair-reviewing to complement peer-reviewing.

Finally, as a reviewer, you should consider voluntarily signing your names on your review. The idea here is to keep yourself accountable by voluntarily giving up your anonymity. The signing of the name decision should be made before the review assignments and one should not be allowed to sign reviews only for acceptances.

I have seen some people do this. And I will give this a try myself. In my experience, a reject decision doesn't hurt if the reviewer supports her position well and put in the work to understand and fairly evaluate the work with respect to the cohort of papers submitted. So I am OK signing my name on a reject decision.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...