Tuesday, December 25, 2018

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+.

The two-phase commit protocol is practical and is used in many distributed systems today. Yet it is still brief enough that we can model it quickly, and learn a lot from modeling it. In fact, we see that through this example we get to illustrate a fundamental impossibility result in distributed systems directly.

The two-phase commit problem

A transaction is performed over resource managers (RMs). All RMs must agree on whether the transaction is committed or aborted.

The transaction manager (TM) finalizes the transaction with a commit or abort decision. For the transaction to be committed, each participating RM must be prepared to commit it. Otherwise, the transaction must be aborted.

Some notes about modeling

We perform this modeling in the shared memory model, rather than in message passing, to keep things simple. This also ensures that the model checking is fast. But we add nonatomicity to the "read from neighbor and update your state" actions in order to capture the interesting behaviors that would happen in a message passing implementation.

A RM can only read the TM's state and read/update its own state. It cannot read other RM's state.  A TM can read all RM nodes' state and read/update its own state.


Lines 9-10 set the initial rmState for each RM to "working" and that of the TM to "init".

The canCommit predicate is defined to be true iff all RMs are "prepared" (they are ready for a commit decision). If there exists an RM with "abort" state, canAbort becomes truthified.

TM model

The TM modeling is straightforward. TM checks if it can commit or can abort, and updates tmState accordingly.

TM can also fail making tmState "unavailable", but this gets exercised only if the constant TMMAYFAIL is set to true before the model checking starts. In TLA+, the labels determine the granularity of atomicity. By putting the labels F1 and F2, we denote that the corresponding statements are executed nonatomically (after some indeterminate time passes) with respect to the previous statements.

RM model

The RM model is also simple. Since "working" and "prepared" states are nonterminal states, the RM nondeterministically chooses among the enabled actions until a terminal state is reached. A "working" RM may transition to an "aborted" or "prepared" state. A "prepared" RM waits to hear the commit and abort decision from the TM and acts accordingly. The figure below shows the state transitions possible for one RM. But note that we have multiple RMs, each of which is going through their state transitions at their pace without the knowledge of the state of another RM.

Model checking the two-phase commit

We are interested in checking that our two-phase commit is consistent: there are no two RMs such that one says "committed" and another "aborted".

The predicate Completed checks that the protocol does not hang on forever: eventually each RM reaches to a terminal "committed" or "aborted" state.

With that, we are ready to model check this protocol. Initially we set TMMAYFAIL=FALSE, RM=1..3 to run the protocol with three RMs and one TM that is reliable. The model checker takes 15 seconds and tells us that there are no errors. Both Consistency and Completed are satisfied by any possible execution of the protocol with different interleaving of enabled RM actions and TM actions.

Now, we set  TMMAYFAIL=TRUE and restart the model checker. The model checker is quick to give us a counterexample trace where the RMs in this execution is stuck waiting to hear back from the TM which has become unavailable.

We see that on State=4 the RM2 transitions to aborted, on State=7 the RM3 transitions to aborted, on State=8 the TM transitions to "abort", and crashes on State=9. On State=10, the system is stuck because the RM1 is in the prepared state waiting forever to learn a decision from the TM which has crashed.

BTM modeling

In order to avoid any downtime for many transactions the TM may be commandeering, we add a Backup TM  (BTM) to quickly take over when the TM becomes unavailable. The BTM uses the same logic as the TM to make decisions. And for simplicity we assume the BTM does not fail.

When we model check with the BTM process added, we get a new error message.

BTM cannot rule for a commit, because our original, canCommit condition asserted that all RMstates should be "prepared" and didn't account for the condition when some RMs already learned the "commit" decision from the original TM before the BTM takes over. So we revise our canCommit condition to account for this.

Success! When we check the model, we find that both Consistency and Completed are satisfied, as the BTM can take over and finalizes the transaction when TM fails. Here is the 2PCwithBTM model in TLA+ (initially the BTM and the second line of canCommit commented out). Here is the pdf corresponding to the pdf.

What if RMs could also fail?

We assumed the RMs are reliable. Now we relax that to see how the protocol behaves in the presence of RM failure. We add an "unavailable" state to model for a crash. In order to explore more behavior and model intermittent loss of availability, we allow a crashed RM to recover back and continue the protocol by reading its state from its log. Here is the RM state transition diagram again with the newly added "unavailable" state and transitions marked with red. And below that is the revised model for the RMs.

We also need to strengthen canAbort to take into account the unavailable state. The TM can rule for "abort" if any of the RMs is in "aborted" or "unavailable" state. If we omit this condition, an RM that has crashed (and never recovers) may make the transaction lose progress. Of course precaution should be taken to ensure that BTM cannot rule for an abort if there exists a RM that has learned of a "committed" decision from the original TM.

Model checking

When we check this model, we discover an inconsistency problem! How did this happen? Let's follow the execution trace for the counterexample.

On State=6, all the RMs are in prepared state, the TM had made a commit decision, and RM1 has seen that decision and moved to label RC, meaning that it is ready to change its state to "committed". (Keep RM1 in mind, this gun will fire at the last act.) Unfortunately, the TM crashes in State=7, and RM2 becomes =unavailable= in State=8. In State 9, the BTM takes over reads the RMstates as <prepared, unavailable, prepared> and rules for an abort in State=10. Remember RM1, it finally acts on the commit decision it saw from the original TM, and transitions to committed in State=11. In State=13, RM3 heeds the abort decision from the BTM and transitions to aborted, and we have the Consistency violated with RM1 deciding for committed and RM3 for aborted.

In this case, the BTM made a decision that violated consistency. On the other hand, if we had made BTM to wait until there are no "unavailable" RMs, it could be waiting forever on a crashed node, and this would then violate the progress condition.

The updated TLA+ model file is available here, and the corresponding pdf here.

FLP impossibility

So, what happened?  We hit the Fischer, Lynch, Paterson (FLP) impossibility result: In an asynchronous system, it is impossible to solve consensus (satisfy both safety and progress conditions) in the presence of crash faults.

In our example, the BTM cannot correctly decide whether RM2 is crashed or not, and rules incorrectly for an abort. If there was only one decision maker, just the TM, the inaccurate failure detector would not be an issue. The RMs would follow whatever the TM decides and both Consistency and Progress would be satisfied.

What surfaces and illustrates the problem is that we have two decision makers TM and BTM looking at the state of the RMs at different times, and making different decisions. This asymmetry of information is the root of all evil in distributed systems.

This problem doesn't go away even with the three-phase commit extension. Here is the three-phase commit extension modeled in TLA+ (here is the pdf version), and below is the error trace that shows this time Progress is violated. (The wikipedia page on three-phase commit says in a timeout after receiving the precommit order the RMs, i.e., RM2 and RM3, should go ahead and commit, but that would instead cause a consistency problem in this case.)

Paxos, making the world a better place

Not all hope is lost. We have Paxos. Paxos treads carefully within bounds of the FLP result. The innovation in Paxos is that it is always safe (even in presence of inaccurate failure detectors, asynchronous execution, faults), and it eventually makes progress when consensus gets in the realm of possibility.

You can emulate the TM with a Paxos cluster of 3 nodes and that will solve the inconsistent TM/BTM problem. Or as Gray and Lamport show in the transaction commit paper, if the RMs use the Paxos box to store their decisions concurrently with replying to the TM, this shaves off the extra one message delay from the obvious protocol.

Monday, December 17, 2018

Master your questioning skills

My new years resolution for 2018 was to "ask more/better/crazier questions."

To realize this resolution, I knew I needed to implement it as a system. So I committed to "ask at least a couple of MAD questions in each blog post." After close to 12 months and 70 posts, these are what I learned from this exercise.

1. Questioning is very useful

Adding the MAD questions section significantly extended/complemented my posts, and with little effort. Just by adding this section, I went for another 15-20 minutes, forcing myself out of my default mode. Sometimes I grudgingly did it, just because I had committed to it, but almost always I was surprised by how useful it was.

Many times the MAD questions section got longer than the main post body. And on average it was about 25% of the main body. Questioning also improved the main content, and I had incorporated the findings/leads from some of the MAD questions in the main body of the post. I keep many half-baked posts in my blog.org file, and had it not been for the questioning that extended the content, I might not have published many of those posts.

2. Questioning is a frame of mind

I found that questioning delivered those good results with relatively little effort.

I think the reason for this is because our default mode is to not question things. We have been implicitly taught this at our family, school, and work. Asking too many questions, especially hard ones, is frowned upon in polite company. Moreover, the human brain is programmed to save energy and go easy on itself, so it prefers shallow tasks and tries to avoid intense thinking sessions required for many creative tasks. By asking questions it is possible to prime the pump and keep the brain more engaged.

Coming up with the questions is not difficult, once you get out of your default mode and give yourself the license to indulge in the questioning frame of mind.

3. To ask better questions, make it a MAD question

By calling these questions MAD questions, I gave myself the license/protection to go wild, uninhibited by traditional assumptions/constraints. This gave rise to reducing the bar, and approaching the topic with a beginner mind, as an outsider. This made it easier to ask more original questions.

By detaching and discarding the baggage of tradition, you can start working from the first principles. You should remember that questioning is the beginning, not the end. There is a separation of concerns: you first focus on what needs to be done, and not how it needs to be done. So, initially, you don't need to be afraid of the difficulty of the answers as it is not your concern for now.

So, don't feel intimidated to question the basic assumptions that people in your field take as granted. They may as well be due for replacement anyway.

4. To get to the good questions, get over with the bad questions

Bad questions lead to good questions. So roll with them and exhaust them to get to the good questions. Don't quit prematurely, as good things will happen if you go a bit deeper. And do this in a stress-free manner being assured that something will eventually come if you persist a bit more.

You should approach this like brainstorming and freewriting.

5. A good question often involves a perspective change 

A good question is like a good joke. It has an unexpected element and provides a perspective change.  After I get through some questions and I eventually arrive to a good question, I have a visceral reaction similar to that of hearing a good joke.

A good question gives you a perspective change that is productive. It opens a new area to explore. It creates new relations/connections.

After following the above suggestions for asking questions, the why and what if questions can help you get to the perspective-changing questions.

A change in perspective is worth 80 IQ points. --Alan Kay

MAD questions

1. Is questioning how we think? If so, can we think better by questioning better?
I feel like my thinking is done mostly via asking questions. But of course I also need to be thinking to ask those questions in the first place, don't I? So I don't know who is the real hero here. (This reminds me of the joke: "My belt holds my pants up, but the belt loops hold my belt up. I don't really know what's happening down there. Who is the real hero?" ---Mitch Hedberg)

To be fair, let's call questioning as a catalyst for thinking. It seems like they bootstrap each other.

Questions also serve as an interrupt for sanity check, self-inspection, and going meta on the task.

Finally, questions also play an important role in the way we master/internalize things. I relate very close to Barry Diller's approach to learning by deconstructing and understanding from the fundamental elements.
DILLER:​ ​By purpose or by temperament, I'm only interested in those things where I haven’t figured it out, and I really do think that however it happened, that when I was presented endlessly with things I didn’t understand, the only thing I could do—because my brain is slower, and therefore is more literal—and therefore my process is, I have to get it down to its tiniest particle, or else... I can’t come in and understand an equation, if you can put it in equation terms, unless I de-equation it—I can’t pick it up. So I’m forced – by a lack of brain matter, I am forced to – no I’m not saying it – it’s true! To break it down as hardly low as I can get it, and only then—and that’s learning. That’s real – that is joyous work to me, is getting through those layers, down to something, and then once I’m down there, once I’m actually at the very, very base of it, I can actually start to do something good.

2. Can we treat questioning as a tool?
I think questioning can be viewed as a tool, and we will increasingly start to treat it as a valuable tool in the age of big data and machine learning information technology markets. The information is already out there, we just need to ask the right questions to the search engines, expert systems, and machine learning based digital assistants. The easy/usual questions will already be asked by the machine learning systems. To differentiate we should learn how to ask the good questions. In Synthetic Serendipity, Vernor Vinge short sci-fi story, the highschool had a course called "Search and Analysis" that aims to teach students to ask the right questions to the search engines and expert systems.

Even when the information is not there, if you ask the right question, find the right perspective to approach to it, the solution bit is not that hard. Questioning acts as a tool for framing/reframing. This is what the master of gedankenexperiment has to say on questioning.
If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask... for once I know the proper question, I could solve the problem in less than five minutes. --Albert Einstein

So, going forward, shouldn't we have more tools/support for questioning as a tool? What are some tools (mindmaps, etc.) you know that are designed to help support questioning/framing?

3. Is it possible to teach how to ask great questions?
Answers are teachable, so we teach answers. But we don't teach how to question. In kindergarten and primary school, the kids are already pretty good at questioning, but come middle school most kids stop asking questions. I don't know maybe students that ask questions come across hard to manage and contrarian, and get discouraged by teachers and adults.

We (academicians) try to teach questioning during PhD by way of examples/apprenticeship. I am not aware of any organized/formal way that asking better questions is being taught. There is no course or good book on it, as far as I can see.

However, I still believe questioning is a skill and can be teachable. What if we could create a course on this? It will likely be a hands-on learn-by-case-studies class, but at least we can provide more support and material than saying "watch me ask questions and start imitating".

Friday, December 14, 2018

Wednesday, December 12, 2018

How to fix scholarly peer review

All academics invariably suffer from an occasional bad and unfair review. We pour our sweat (and sometimes tears) over a paper for many months to watch it summarily and unfairly shot-down by a reviewer. Opening a decision email to read such a review feels so sickening that it hurts in the stomach. Even after 20 years, I am unable to desensitize myself to the pain of being shot-down by an unfair review. I suspect quite many people quit academia being frustrated over bad reviews.

Peer review is a complex and sensitive topic, and this post will inevitably fall short of capturing some important aspects of it. But I believe this topic is so important that it deserves more attention and conversation. Here I first write a bit about how to handle bad reviews. Then I outline some problems with our current peer review process and suggest some fixes to start a conversation on this.

The first rule of review club

The first rule of review club is to always give the benefit of doubt to the reviewers. We are all partial about our work: if we didn't think highly of it, we wouldn't have worked on it so hard, and submitted it to a  good venue. I suggest that you avoid responding to the reviews the first day. Chances are that you will be too busy processing your emotions to have bandwidth to process the reviews calmly. So sleep on it, read the reviews again the next day. Give it your best shot to see what gets criticized. Try to see what got misunderstood. If some reviewers had this problem, several of your readers will have similar problems as well. Can you improve your presentation to avoid that problem? How can you address/disarm the criticisms from the reviewers? What is the opportunity to improve here? The idea is to learn whatever you can learn from the reviews, and become so good they can't ignore you.

The second rule of review club

The second rule of the review club is that occasionally you get a review so malicious, so bad that you are stumped. This is inevitable. Welcome to the club!

What do you do now?

Again exercise caution. Give it a day, meet with your coauthors. Try to give the benefit of the doubt to the reviewers. If you still cannot justify in any way that the reviewer is not malicious and unfair, you should contact the PC chairs or Editor to complain about the behavior. This often gets ignored, so don't expect much.

Basically you swallow this abuse of power and move on. You try not to get too upset by many months wasted waiting for this rejection, and try to plan ahead for another publication venue.  Maybe you make a mental note to avoid this conference and journal. But isn't that penalizing yourself, and not the guilty party?

If you like to fight the unfairness, you don't have much leverage. You can make the reviews public to expose the unfairness to community. But that is about it, and this probably won't make much difference either. Yes, very frustrating. Very unfair. I know.

The current peer review process

This is how the current scholarly peer review works. You submit your paper, and it gets reviewed by (often) 3 anonymous reviewers. After 3 months (or 6 months for the journals), you get the reviews and the decision in the email. The rating is generally average of the reviewers, and a bad review damages your chances especially severely for conferences.

The blind reviewing works well when the reviewers are conscientious, and the PC-chairs/editors take meticulous care in their job to oversee the reviewers. But note that there is an asymmetric power relation here. The problem is that anonymous reviewers don't have much skin in the game, and the process is open to abuse.

Good intentions don't work, good systems work. And there is obviously something broken with our reviewing system today. The quality of reviews within the same conference or journal is uneven and varies wildly. Reviewers, who are overworked academics, are overwhelmed with lots of papers to review. Competent reviewers are especially overwhelmed. As an associate editor at IEEE TCC, when I try to find reviewers for papers, my usual success rate is 3 over 8-10. I need to contact up to 10 reviewers to find 3 that accepts to review the paper.

The basic recipe 

Don't tear down a fence before you understand what it does. As tempting it is, we can't rush to do away with blind review. Disclosing the reviewer names may also cause problems. The authors are partial about their work, and some may unfairly retaliate to the reviewer. Secondly, without the anonymity cloak, the reviewers may not be critical enough of some submissions from certain high-profile faculty/schools.

So with due caution and awareness of the complexity of peer review, I provide some suggestions to improve the system in good faith. This proposal has holes and difficulties in implementation, but I hope it serves to start the conversation.

While I am not confident in the specifics of the implementation, I think the high level solution is clear. We need to incentivize the reviewers and then demand accountability from them. So here are my suggestions to this end.

1. Compensate the reviewers

Reviewing is a thankless job. Conference reviewers get some recognition, as their name appears in the program committee page of the conference (but this gets old quickly). Journal reviewers don't even get that.

The reviewers do these reviews pro bono and are overworked. You may not realize it from outside but academics work really hard between teaching, research, departmental administration duties, and other professional duties.

Did I mention that reviewers and Editors don't get paid a dime? We are talking about domain experts here, who could easily make \$200 an hour consulting. I think we owe it to the reviewers to compensate them for their contribution in the publishing process.

Can the conferences and professional organizations afford to pay reviewers? I don't see why not. I have not been involved in any conference organization, so I may be wrong here, but please humor me.  A conference with 1000 attendees with \$1000 per registration fee (not uncommon) makes a total of  \$1 million. Where does this money go? Definitely not to the dubious quality conference swag. Hotels may be expensive, but not that much. If hotels are taking most of that money, we need tougher negotiators on the conference organization committees. The hotels already get a good deal with attendees staying there for the duration of the conference.

Remember Chesterton's fence. We should of course consider what type of side-effects compensating the reviewers may have. Could this lead to a clique of friends who recruit each other for reviewing gigs? I don't know.  If the compensation is not high, and if we keep reviewers accountable with respect to the quality of the reviews, this may not be a problem. If money is too messy, provide free conference registration to the reviewers.

Even if we don't compensate the reviewers, at least we need to set up a system to prevent freeloading. If you don't help with reviewing, you don't get to have your papers reviewed. If you are a conscientious reviewer, maybe you get to have a fourth reviewer?

2. Provide an author-feedback phase

Conferences should provide a rebuttal phase to give the authors a chance to respond to the reviewers' criticisms.  Although limited in their effectiveness/consequence, this response phase still gives a voice to the authors. As an extra point, I really liked what SIGMOD did with their author feedback; they explicitly asked the authors to report on any bad/offending reviews.

In journals, there is no rush. So even for rejected papers, the journal may provide a response opportunity, and the authors get to present their responses to the reviewers.

3. Form a grievance committee

To give more voice/faculty to the authors, a grievance committee can be formed to inspect the reviews suspected of foul play. The committee shall inspect the situation, consult with the other reviewers on the paper, and write a report on the decision.

Maybe this is on the crazy side, but here it is: It may even be possible to publicize the name of a malicious or severely negligent reviewer. (There may be a decentralized cryptographic signing solution under which two reviewers may make the name of the third reviewer visible if they agree on neglect/abuse by the third reviewer. Crypto geeks, is it possible to implement a multisig solution for this on hotcrp soon?)

4. Take ownership/responsibility for the reviews 

As PC chairs or journal editors, you should take responsibility of the quality of the reviews provided to the authors. You should not blindly sign on the reviews, as at the end of the day the quality of the reviews provided to the authors is your responsibility.

In addition, in-person PC meetings (when feasible) is good for enforcing accountability for the reviewers. Again the travel for PC members should be paid for by the conference registration fees, if an in-person PC meeting is established.

Reviews can be rated by other reviewers and feedback in the form of blind review can be provided to the reviewers. These feedback can help train reviewers to avoid common mistakes: being hypercritical on a minor point, failing to see the overall value provided, being prejudiced towards certain methods/approaches, making unsupported claims, etc. We may even consider pair-reviewing to complement peer-reviewing.

Finally, as a reviewer, you should consider voluntarily signing your names on your review. The idea here is to keep yourself accountable by voluntarily giving up your anonymity. The signing of the name decision should be made before the review assignments and one should not be allowed to sign reviews only for acceptances.

I have seen some people do this. And I will give this a try myself. In my experience, a reject decision doesn't hurt if the reviewer supports her position well and put in the work to understand and fairly evaluate the work with respect to the cohort of papers submitted. So I am OK signing my name on a reject decision.

Sunday, November 18, 2018

HotNets'18: Networking in Space

HotNets'18 was held at Microsoft Research, Building 99. This is walking distance to my office at Cosmos DB, where I am working at my sabbatical. So I got tempted and crushed this workshop for a couple sessions. And oh my God, am I happy I did it. The space session was particularly very interesting, and definitely worth writing about.

My God, it is full of satellites!

According to a 2018 estimate, there are 4,900 satellites in orbit, of which about 1,900 operational, while the rest have lived out their useful lives and become space debris. Approximately 500 operational satellites are in low-Earth orbit, 50 are in medium-Earth orbit (at 20,000 km), and the rest are in geostationary orbit (at 36,000 km).

The low earth orbit LEO satellites are not stationary and fast moving around the earth at 1.5 hour per rotation. We are talking about the lowest ring in this picture, where International Space Station (ISS) resides.

Since LEO satellites are close to Earth, this makes their communication latency low. Furthermore, if we take into account that the speed of light in vacuum is 1.5 times faster than in fiber/glass, communicating over the LEO satellites becomes a viable alternative to communicating over fiber, especially for reducing latency in WAN deployments.

When this gets built, it will change Internet: in some accounts up to 50% traffic may take this route in the future.

And, it is actually getting built soon.

Starlink: SpaceX's satellite constellation

Starlink is a satellite constellation development project underway by SpaceX, to develop a low-cost, high-performance satellite bus and requisite customer ground transceivers to implement a new space-based Internet communication system. Development began in 2015, and prototype test-flight satellites were launched on 22 February 2018. Initial operation of the constellation could begin in 2020 with satellite deployment beginning mid 2019.

In Starlink’s initial phase, 1,600 satellites in 1,150 km altitude orbits will provide connectivity to all except far north and south regions of the world. A second phase adds another 2,825 satellites in orbits ranging from 1,100 km altitude to 1325 km, increasing density of coverage at lower latitudes and providing coverage at least as far as 70 degrees North.


And guess what! It looks like these satellites will communicate with each other using fricking ``lasers''!

Delay is Not an Option: Low Latency Routing in Space

Mark Handley (University College London) tried to reverse-engineer SpaceX's FCC filings to figure out what is possible with Starlink. It was the most interesting talk at the conference (at least among the talks I attended). People listened to the talk breathlessly and mesmerized. Mark had such exquisite visualizations and darkened the room for us to appreciate them better. It was a 20 minute trip to space and to 10 years in the future to deliberate about networking in space. (Here is a link to the paper.)

A special note about his slides is in order. He coded the satellites and the routing algorithms using the Unity framework. His slides were not showing video of simulations, but rather running the simulator in real-time. Bold and beautiful way to present.

The FCC filings mention "silicon carbide" communication components, which point to laser communication. Since it would be hard to infer bandwidth without more information, Mark took on the question of figuring out what the latency could be, and how would it change as satellites move, and what kind of use would this enable.

Each satellite has 5 inter-satellite communication links. The phase 1 satellites are northeast bound. And the phase 2 satellites are southeast bound.

The coverage is not uniform. London would be able to communicate with 30 LEO satellites at any given time.

Routing over satellites multihop via laser 90ms latency is achievable, compared to 160ms over fiber communication. This is a big improvement, for which financial markets would pay good money for.

Mark also considered how many multiple paths could be run over these satellites? He found that 10 multipaths is feasible. But 20 simultaneous paths not possible in phase1 of constellations.

With the additions in second phase (satellites that are southeast bound), London to Johannesberg latency can come down to 80ms from 190ms. These phase 2 additions will also help for providing better multipath capacity. With the second phase additions, FCC required SpaceX to cover the Alaska north region. This may also serve the purpose of routing over the pole, for example for the NY to Beijing route.

The Starlink deployments open many research questions for networking:

  • how do you avoid building queues? (probably via a form of source routing)
  • how do you coordinate multipath route changes?
  • how do you avoid reordering without increasing latency?
  • how do you make topology adaptive?

The other papers in the session

The "Networking, in Heaven as on Earth" paper considers the interdomain routing problem with satellite constellations. The vision there is to full integration of satellite networks in Internet Control Plane (via BGP). But satellites move very fast which leads to frequent BGP updates. Filtering reduces updates but introduces connectivity problems. The paper mentions that a proactive routing strategy (that leverages predictability of satellite orbits) rather than reactive could work better

The "Gearing up for the 21st century space race" paper talks about miscellaneous issues in space networking. The talk mentioned that some trade activity shows outstripping of fiber speed communications from NY to Frankfurt and that people might be using short-wave (microwave) radios to beat fiber-optical speeds (where light travels 2/3rds slower than it does in vacuum). Then the talk speculated whether it is possible to establish multihop microwave routing using in-flight planes. It turns out it is possible to do it with 20 hops across the globe (east to west) with low-stretch and good latency.

MAD questions

1. Maybe we are getting there, huh?
This session reminded me of the Seveneves book by Neal Stephenson (great read, recommend highly). In Part 1, of the book there was very good coverage of Space orbits, maneuvering in space, and how dangerous space junk could get. Coincidentally, one of the talks mentioned that space is garbage tracked: anything bigger than a marble is tracked. At first I didn't buy this, didn't sound feasible. But turns out radar is used for learning trajectories of the space junk and the trajectories are maintained at the databases in space agencies to help make the  space station and the satellites to avoid them. So the satellites will be routing packets and simultaneously try to route around occasional space junk. We are getting ready to become a space-faring species, and that is very exciting.

2. What is next?
Faster than light quantum communication, anyone? Ender's Game series mentioned such communication. And of course there is a wikipedia page for faster than light communication.

3. Would it be feasible to do store and forward communication via the satellites?
You know the thought-experiment about the plane full-of-disks, right? It has very good throughput. Since these satellites are already moving at a fast speed, could they be used for data mules to improve throughput for big data networking, say between the Hadron collider in CERN and datacenters in US? Think of beaming up data to a row of satellites (one after another) that store this data and in 45 minutes or so dump these at the US datacenters.  Could this be a feasible alternative to fiber? Probably not so much, since the uplink and downlink are still limited.

Thursday, November 15, 2018

My Emacs journey

This is a follow up to my "Master your tools" post. As an example of one of my tools, I talk about my Emacs use.

I have been using Emacs for the last 20 years. At this point, I don't even know Emacs, my fingers do. If you ask me the shortcut for something, I will need to let my fingers do it and try to observe what they are doing. And sometimes ---as in the story of the caterpillar who forgets how to walk when asked to demonstrate it--- I forget about how to do something when I try to attempt it consciously.

From text-editing to text-wrangling

I have been learning Emacs at a glacial pace, but I think that worked for me better. I figured I can internalize so much at a time, so I didn't rush things. I initially used Emacs mostly for power editing LaTeX files.

It was only around 2008 that I started with the Emacs org-mode. I loved its outlining feature, and started using and customizing it ever since. It has been a big part of my thinking and writing process for the last 10 years. You can say that it became my out-of-core memory execution primitive.

When I write a blog post or an article, I use the org-mode outline headers to organize/departmentalize and text-wrangle the content. I have a JUNK header where I move extra text, this helps me overcome my kill-your-darlings syndrome. I have an INSERT header for noting down what I like to insert. I visit these later to decide what is the best place in the article to insert them, or whether I should move them to the junk header.

So this does work like my out-of-core memory module. At any time I only keep a small number of things in my mind, and use the headers as my swap memory. I go forward with one decision/issue at a time without overwhelming myself. This is how I try to scale my attention in my offline thinking mode. I don't have a large working memory (I also suspect you don't either) and this helps immensely. Text wrangling for the win!

Getting things done with org-mode

As for using org-mode (org-agenda) for TODO lists and getting things done, I had gone through 3 unsuccessful attempts before I finally made it to stick. After my failed attempts, I thought I am hopelessly disorganized and this is too much of a hassle. After I saw my colleague Jason use it regularly, I gave it another shot. Incorporating org-agenda to my workflow did wonders for my peace-of-mind, if not for my organization and timeliness. I wrote about this a little here.
Before integrating the emacs org-agenda to my life, I always had open loop tasks that caused me worries, and eating up cycles in my brain: "Oh, I should remember doing this", "Woe to me, I am procrastinating on this", etc. After successfully adopting emacs org-mode as my to-do list and agenda manager (which, took a couple years, and several aborted tries), the benefit I got out was the clarity of mind, and the release of all that background anxiety.

Other assorted Emacs tricks I use

I love the org-export-to-latex functionality of org-mode. This way I can get a pdf file shareable with others anytime, while still staying in org-mode where I do my thinking. In my org-file I would have COMMENTed out the JUNK header, INSERT header, and the META header (for questions/connections) to capture my thinking and provide a snapshot of my brain. The exported LaTeX article hides all of those, but I still need those as my documentation of my thought process and to evolve my work further. Frankly I don't get how Word/Pages/etc users deal with not having COMMENT sections in their documents.

I also use the org-export-to-beamer mode for quickly preparing presentations in Beamer. This helped me survive teaching. Preparing powerpoint presentations takes a lot of time and is painful. On the other hand, due to its integration to my thinking/writing process and due to the COMMENTing/evolving benefits I mentioned above, exporting the org document to Beamer makes things easy/frictionless for me.

Another hack I use is to maintain a notes.org file in any folder to keep track of that project. I use this as a lab-notebook to record everything about the project, and meta-thoughts, concerns, etc. I also add timestamps to my entries with my Emacs shortcut (Wed, 14 Nov 18 - - - 21:07) cntrl-c-t. Of course I had to try this with my fingers first to learn the command I use.

I use M-x-tomatinho for keeping track of my pomodoros in Emacs. It is visual, and it gives me a good picture of how my day is going. In my self.org file, I keep an org-mode table where I note which pomodoro number corresponds to what time and task. This gives me a candid picture of how my day is going and how my week has gone. This post from 3 years ago describes the pomodoro workflow I had then, which is obsolete now.

When I am working on large documents in org-mode, I use hot links (radio-targets) for  definitions, so that when I write the word at a later point in the file, Emacs automatically inserts the link to where it is defined first.

I also use predefined hi-lock for "!!" and "??" to easily highlight some interesting findings  and questions in the text. And I sometimes use impromptu M-x highlight-phrase or M-x highlight-regexp to highlight other things.

I use <f3> and <f4> to define and use keyboard macros for ad hoc custom editing needs.

I wrote a bit about how I prepare my Emacs setup here.

MAD questions

1. What should my Emacs learning/mastering pace be?
I don't consider myself a power Emacs user. I am far from it. I don't go shortcut crazy. I don't try to automate everything. And if some new Emacs tricks/shortkeys don't stick with me, I take it that I don't really need them. As a counterpoint to this though, I couldn't get the org-agenda to stick for a long time, and now that I use it I realize how much I needed it. So where should I draw the line about how much to push to learn/adopt new things?

2. Benefits of drawing/sketching versus typing
As much as I love Emacs, it is linear and text. Although org-mode helps for making things non-linear with its headers, it doesn't give the same visual thinking benefits from drawing/sketching. I think I will try to incorporate more doodling/sketching drawing to my workflow in the coming months. Let me know if you have good suggestions for this.

Monday, November 12, 2018

Is Twitter causally-consistent?

For the past 7-8 years, several research papers have used this example to motivate causal consistency. You must have read this example, right?

  • Alice removes her boss from her friend list, and posts that on her feed that she is looking for a new job. 
  • Tom removes his mom from the friend list, and posts his Spring Break photos.

Well, being the empirical researchers we are, Aleksey and I wanted to put Twitter in to test for this scenario.

On September 25, we performed this test. (I also have a video recording of this. But since I can't stand to hear myself talk in recordings, I am not posting it. I sound really weird, man.)

I first blocked Aleksey on my Twitter account, and then tweeted that Aleksey drinks a lot of tea (it's true). When we checked Aleksey's timeline, we saw that his timeline indeed did not display my tweet.

So, this was kind of an anticlimax. Twitter passed the causal-consistency test easily. No need to publish more causal-consistency papers, right?

Well,  maybe not quite. Maybe this was because Aleksey and I were in the same region and our accounts fall in to the same datacenter. We thought maybe we should repeat the test with Aleksey connected to Twitter via a proxy, but then since Aleksey's laptop was away that day, we decided to test this at another time. (Maybe we should find someone from another region and arrange a cross-continental Twitter causality test this time.)

Anyways, this story gets more interesting. Keep reading.

I unblocked Aleksey, and Aleksey checked that he was following me again, and we called it a day.

But a couple days later, I was tweeting something and wanted to include a mention to @AlekseyCharapko in my tweet. But Twitter didn't autocomplete for me. I finally found his account after some searching on Twitter, and saw that Aleksey is not following me. What!? My student doesn't follow me on Twitter? Impossible. I showed this to Aleksey and he was also caught by surprise.

We then remembered about the Twitter test we did a couple days ago.

It must be that although I unblocked Aleksey after our test, a nightly batch job blocked him from my account and made his account unfollow me.

Reality is often stranger than the clean models we have. Real systems have a lot of back channels and processes.

MAD questions

0) Some clarification on this from someone at Twitter

There was discussion on this at Hacker News.

Isn’t this a known feature of Twitter, usually called a soft block or forced unfollow? You block and unblock someone quickly and it forces them to unfollow you (and more importantly, does not signal them that it happened). I have heard of people fearing harassment to use this technique to get themselves out of a person’s timeline (out of sight, out of mind) without the pseudo-confrontation of a block.

This is correct.When we process a block request from user A -> user B, we remove the follow edges between user A -> user B and user B -> user A, and then add a block edge from user A -> user B.
When we process an unblock request from user A -> user B, we remove the block edge from user A -> user B.
I imagine that the "Aleksey checked that he was following me again" was either client caching, or eventual consistency latency. There's no nightly batch job or anything doing that.
Source: I work on the social graph service at Twitter.   
Here is the video segment that shows that after I unblocked Aleksey, he was able to see my tweets in his timeline. I don't know what happened after that which made his account unfollow mine again.  It might be that since my tweet included a mention to him, Aleksey clicked on the notification and could see my tweets after that, and not in his timeline. We repeated the experiment, and found that after I blocked and unblocked him, he was indeed still unfollowing me.

1) What is the right way to test this?

Sending two consecutive tweets from one account and monitoring from another does not seem sufficient to test causal-consistency to me. The tweets are likely partitioned with the user_id, and so just session consistency or per-key ordering would give the impression of causal consistency, right?

To offset for this, we tried another test, where I tweet something, Ailidani reads it and presses the button on his drafted tweet, and Aleksey checks his feed for the result. It was no contest again. Twitter is so fast that it displayed my tweet on Aleksey's timeline immediately and Ailidani's tweet didn't stand a chance.

(Another thing to note for completeness. Under content preferences, we disable the "show me the best tweets first" to get a time ordered tweet stream.)

2) Where would you put your money?
So let's say we set up this 3-person cross continental Twitter causal consistency test. Where would you put your money on?

Can we infer something from what we know of Twitter architecture? This is a good start for reasoning. But again, in our 3 person test, per-key/object ordering does not help.

Saturday, November 10, 2018

How to be a good machine learning product manager

There are a lot of interesting meetups at Seattle, and I try to attend one every couple weeks. Ruben Lozano Aguilera was the speaker for this meetup on Oct 17. Ruben is a product manager at Google Cloud, and before that he was a product manager at Amazon.

What is ML?

Programming transforms data + rules into answers.

Machine learning turns data + answers into rules.

When should you use ML?

Use ML if the problem:

  • handles complex logic
  • scales up really fast
  • requires specialized personalization
  • adapts in real-time

For example ML is a good fit for the "search" problem.  Search requires complex logic, for which it is not easy to develop rules.  It scales up really fast in terms of new keywords, combinations and content. It requires personalization depending on the context, and has some real-time adaptation component as well.

Another important point is that the problem should have existing examples of actual answers. When you bootstrap from a good enough dataset, you can scale further, because data -> predictions -> customer experience -> more traffic -> more data.

Some popular ML problems are ranking, recommendation, classification, regression, clustering, and anomaly detection.

Don't use ML when your problem:

  • can be solved by simple rules
  • does not adapt to new data
  • requires 100% accuracy
  • requires full interpretability/why-provenance

The data requires further consideration: Can you use data? Is it available, accessible, and sufficient? Is high quality? relevant, fresh, representative, unbiased? Is it appropriate to use the data: privacy, security concerns?

For the following, can you use ML or not?

  • What apparel items should be protected by copyright? No. This is dangerous financially, you need to get 100% accuracy.
  • Which resumes should we prioritize to interview for our candidate pipeline? No, this may be based on biased data.
  • What products should be exclusively sold to Hispanics in the US? No. This is discriminatory and creepy.
  • Which sellers have the greatest revenue potential? Yes.
  • Where should Amazon build next head quarters? No. This is not a repeatable problem; there is only one label: Seattle.
  • Which search queries should we scope for the Amazon fresh store? Yes.

What is the ML lifecycle?

For productizing ML, you need people, processes, and tools/systems.

The people come from two domains:

  • Math, statistics: ml scientist, applied scientist, resarch scientist, data scientist 
  • Software, programming: business intelligence engineer, data engineer, software engineer, dev manager, technical program manager

The ML lifecycle involves 4 phases: problem, data, features, and model.

To formulate the problem, you need to clarify what to solve, establish measurable goals, and determine what to predict.

For the data phase, you need to

  • select data: available, missing data, discarding data (data cleaning)
  • preprocess data: formatting, cleaning, sampling

For the features phase, you need to consider scaling, decomposition, aggregation, and discard any features that are not relevant.

Finally, for the model phase, you first divide the data set into training data and test data, could be 70+30 or 90+10. Then comes the model training (using whatever algorithm you are using), which produces the ML model. You then test this output ML model with the test data.

To productize your model, you should integrate the ML solution with existing software and keep it running over time. At this point considerations about the deployment environment, data storage, security and privacy, monitoring & maintenance come in to play. Some great ML solutions cannot be productized due to high implementation costs or inability to be tested in practice.

The product manager is very much involved in the first 2 phases: formulating the problem and selecting the data. Product manager is also involved in feature selection but not much involved with the final model phase.

MAD questions

1) Umm, deep learning?
Since the presentation didn't mention any deep learning specific problems/tasks/observations, I asked Ruben about what significance deep learning had on the projects he worked on. Turns out, he didn't use any. He said that simpler ML models were enough for the tasks they undertook so he never needed a deep-learning solution. He also said that deep-learning was very expensive up until a couple years ago, and that was also a factor.

With TensorFlow, Google is supposedly using deep-learning a lot, likely more for image and voice processing. But, is there a study about the prominence of deep-learning use among ML solutions in the industry?

2) How do you troubleshoot issues with productizing ML?
As we covered above there are many things that can go wrong, such as  unanticipated bias in your data, in your method, conclusions. How do you check for these? Ruben answered they brainstorm and think very deeply about what could go wrong, and identify these issues. It seems like this needs more processes and tool support. Having seen how TLA+ specifications and model checking create wonders for checking problems with distributed/concurrent systems, I am wondering if similar design level tool support could be developed for ML solutions.

3) How do we learn/teach empathy?
Ruben was a great speaker. He used beautifully designed slides. After all he is a product manager and sympathizes with the users/audience. In Q&A session he mentioned that empathy is the most important skill for a ML product manager. I believe empathizing with your audience also goes a long way in public speaking. How do we learn/teach empathy? This is so basic that you expect/hope we learn this as kids. But it looks like we keep forgetting about this and fail to empathize. Also, there is always levels to things. How do we get better at this?

4) Is ML/DL too application-coupled?
I have a some understanding of ML/DL domain, since I started learning about it in 2016. I am still amazed at how tightly application-coupled is the ML/DL work. On one hand this is good, this makes ML/DL very practical and very applicable. On the other hand, this makes it harder to study the principles and systematize knowledge.

Thursday, November 8, 2018

SDPaxos: Building efficient semi-decentralized geo-replicated state machines

In the last decade, the Paxos protocol family grew with the addition of new categories.

  • Rotating leader: Mencius
  • Leaderless: EPaxos, Fast Paxos
  • Paxos federations: Spanner, vertical Paxos 
  • Dynamic key-leader: WPaxos 

This paper, which appeared in SOCC 18, proposes SDPaxos which prescribes separating the control plane (single leader) from the replication plane (multiple leaders). SD in SDPaxos stands for "semi-decentralized".

The motivation for this stems from the following observation. Single leader Paxos approach has a centralized leader and runs into performance bottleneck problems. On the other hand, the leaderless (or opportunistic multileader) approach is fully decentralized but suffers from the conflicting command problems. Taking a hybrid approach to capture the best of both worlds, SDPaxos makes the command-leaders to be decentralized (the closest replica can lead the command), but the ordering-leader (i.e., the sequencer) is still centralized/unique in the system.

Below I give a brief explanation of the Paxos protocol categories before I discuss how SDPaxos compares and contrasts with those.

Plain vanilla Paxos

Paxos provides fault tolerant consensus among a set of nodes.

  • Agreement: No two correct nodes can decide on different values.
  • Validity: If all initial values are same, nodes must decide that value.
  • Termination: Correct nodes decide eventually.

Paxos runs in 3 phases: propose (phase-1), accept (phase-2), and commit (phase-3).

  1. A node tries to become the leader by proposing a unique ballot number b to its followers with a phase-1a message. The followers acknowledge a leader with the highest ballot seen so far, or reject it with a ballot seen with a number greater than b. Receiving any rejection fails the candidate. 
  2. In the absence of a rejection, a node becomes leader and advances to phase-2 after receiving a majority quorum of acknowledgments. In this phase, the leader chooses a suitable value v for its ballot. The value would be some uncommitted value associated with the highest ballot learned in previous phase, or a new value if no pending value exists. The leader commands its followers to accept the value v and waits for acknowledgement messages. Once the majority of followers acknowledge the value, it becomes anchored and cannot be revoked. Again a single rejection message (carrying an observed higher ballot number) received in phase-2b nullifies the leadership of the node, and sends it back to phase-1 to try with a higher ballot number. 
  3. Finally, the leader sends a commit message in phase-3 that allows the followers to commit and apply the value to their respected state machines.

It's important to see that after phase-2, an anchored value cannot be overridden later as it is guaranteed that any leader with higher ballot number would learn it as part of its phase-1 before proposing a value in its phase-2.

You can find more information on Paxos in this blog, and especially Modeling Paxos in TLA+ could be of interest to you.

Single leader approach

The traditional single-leader Paxos protocol employs a centralized leader to process all client requests and propose commands. The single leader takes a significantly heavier load than the other replicas, and becomes a performance bottleneck. Moreover, in geo-replication, clients not co-located with the leader need to send the requests to the remote leader, which incurs significantly higher wide-area network latency.

Mencius approach

Mencius is a multileader version of Paxos that aims to eliminate the single leader bottleneck in Paxos. Mencius achieves load balancing by partitioning the consensus instances among multiple servers. E.g., if we have 3 servers, server 0 is responsible for acting as a leader for consensus instances numbered 0,3,6, server 1 for 1,4,7, and server 2 for 2,5,8, etc. Mencius tries to avoid the straggler problem by making the replicas skip their turns when they fall behind, however, it cannot fully eliminate the slow-down. Since it uses multiple leaders, Mencius also loses out on the "serve reads locally at the leader" optimization possible in Paxos.

Optimistic leaders approach

EPaxos is a  solution, where every node can opportunistically become a leader for some command and commit it. When a command does not interfere with other concurrent commands, it is committed in a single round after receiving the acks from a fast quorum (which is approximately 3/4ths of all nodes). In a sense, EPaxos compresses the phase-2 to be a part of phase-1 when there are no conflicts. However, if the fast quorum detects a conflict between the commands, EPaxos defaults back to the traditional Paxos mode and proceeds with a second phase to establish order on the conflicting commands.

Unfortunately, services like E-commerce and social network can generate high-contention workload, with many interfering commands on the same object from multiple clients. This problem is aggrevated in wide area network deployments: since requests take much longer time to finish, the probability of contention rises.

Multileader approach

Spanner and CockroachDB are examples of databases that uses a federation of Paxos groups to work on different partitions/shards. These partitioned consensus systems employ another solution on top (such as vertical Paxos) for relocating/assigning data from one Paxos group to another.

WPaxos uses sharding of the key space and takes advantage of flexible quorums idea to improve WAN performance, especially in the presence of access locality. In WPaxos, every node can own some objects/microshards and operate on these independently. Unlike Vertical Paxos, WPaxos does not consider changing the object/microshard ownership as a reconfiguration operation and does not require an external consensus group. Instead WPaxos performs object/microshard migration between leaders by carrying out a phase-1 across the WAN with a higher ballot number, and commands are committed via phase-2 within the region or neighboring regions.

The SDPaxos approach

The root of the inefficiency of leaderless and multileader protocols is the decentralized coordination pattern. Although decentralization addresses the single-leader bottleneck as every replica can propose commands, the replicas still need to agree on a total order on conflicting commands proposed by different replicas to avoid inconsistent state.

To address this issue, SDPaxos divides the consensus protocol in 2 parts: durably replicating each command across replicas without global order (via C-instance Paxos), and ordering all commands to enforce the consistency guarantee (via O-instance Paxos). Replicating via C-instance Paxos is completely decentralized where every replica can freely propose commands and replicate them to other replicas. This evenly distributes the load among replicas, and enables clients to always contact the nearest one. On the other hand, as part of O-instance Paxos, one replica is elected as the sequencer and handles the ordering in a centralized manner: the global view enables this replica to always order commands appropriately. Provided that the ordering messages are smaller than replication messages, the load on the sequencer will not be as severe as that on the single leader in Paxos.

Fault tolerance is provided as both the replicating and ordering instances are conducted based on Paxos. Each replica proposes commands in a series of C-instances of its own to produce its partial log. The sequencer proposes replicas' IDs in O-instances to produce an assignment log. Based on the assignment log, all replicas' partial logs are finally merged into a global log.

Comparison with other protocols

The separation between C-instances and O-instances is the source of SDPaxos's advantages over existing protocols. The decentralization distributes load evenly across replicas, and allows clients to always contact the nearest replica in geo-replication to serve as the command leader. The O-instance leader, i.e., the sequencer, provides conflict free operation.

So, SDPaxos is like Paxos, but it has local leader in each region. This way it avoids the cost of going to the leader, and back.

Also SDPaxos is like EPaxos but with no conflicts, ever!

In SDPaxos, the sequencer is one node, but is backed by O-instances. A new sequencer can be chosen easily in a fault-tolerant way using Phase-1 of Paxos over O-instances. This alleviates the availability problems due to the serializer failure in systems that use Paxos for serializing the log in a central region. In such systems (e.g., Calvin) if the single log serializer is Paxos-replicated within a region, then the availability suffers on region failure. Instead, if the serializer is Paxos-replicated across regions then the performance suffers.

The protocol

In this example, upon receiving a client request for a command, replica R0 becomes the command leader of this command, picks one of its own C-instance and replicates the command to others (using the C-accept, i.e., Accept phase message of the C-instance). In the meantime, this C-accept also informs the sequencer (R2) to start an O-instance for this command. Then R2 proposes R0’s ID in the next (e.g., the jth) O-instance and sends O-accepts to others, to assign this command to the jth global slot. Replicas will then accept these instances and send C-ACKs and O-ACKs to R0; R2 also sends an O-ACK as it has sent an O-accept to itself. The algorithm denotes the ith C-instance of Rn as Cni, and the jth O-instance as Oj.

A C-instance can come from any node without a Paxos phase-1a, because each replica has its own distinct replication log for C-instance. The C-instance messages do not conflict with each other and gets accepted immediately. The C-instance messages do not even need a ballotnum; the ballotnum used is that of the O-instance to denote epoch (i.e., which sequencer the sender thinks is still in-charge).

A command being ready requires the C-instance and enough number of O-instances be committed. The conditions of an instance being committed and a command being ready are defined in lines 18 through 31, which we discuss next. There are two questions here.

  • The safety question is: How do we ensure that the replication is anchored (performed at the majority quorum) from the command leader perspective?
  • The performance question is: How do we achieve consensus in one round while satisfying the safety concern? 

The 1-round feat

In the best case when O-instance of the sequencer overlaps perfectly with the C-instance of the command replication leader, consensus is achieved in one round-trip time ---the optimal possible. But, since the O-instance starts half a round trip later than the C-instance for non-sequencer replicas, it is not always possible to optimize the O-instance completion to just half a round trip to achieve the one-round-trip latency. But the paper shows how this can be achieved for N=3 and N=5 replicas. In groups with more than 5 replicas, the O-instances still need one round trip, thus the overall latency remains at 1.5 round-trips.

In 3-replica groups, when the command leader receives O-ACK from the sequencer, the majority (2 out of 3) is readily established for O-instance completion. This provides the one-round-trip consensus. (For the case, the command leader is also the sequencer, sequencing can be done in advance and one round-trip is satisfied as well.)

In 5-replica groups, a command can also be ready in one round trip, but unlike the case of three replicas, an O-instance cannot be accepted by a majority in half a round trip. Instead, SDPaxos lets each non-sequencer replica commit an O-instance, upon receiving the O-accept from the sequencer (line 29). Here, the O-instance does not rigorously follow Paxos, which raises a complication for recovery: if this non-sequencer replica and the sequencer fail, we cannot recover this O-instance simply by Paxos because the other alive replicas may have not seen the O-accept yet. However, the paper discusses a way for SDPaxos to correctly recover the O-instances of all replicas' ready commands even in such cases (omitted in my review), and is able to allow for 1-round-trip commits for N=5.

Note that the dynamic per key leader approach in WPaxos still has an edge over SDPaxos when there is good locality in the workload (which is often the case in practice) and/or when the number of replicas is greater than 5 (which is often the case for geo-replicated databases). It may be possible to use WPaxos for coordination across regions and integrate SDPaxos for coordination within the region upto 5 replicas.


As an optimization for reads, SDPaxos uses sequencer leases to authorize the sequencer to directly reply to the read requests. In contrast, such an optimization is not possible for leaderless approaches, as there is no single/dedicated leader to lease and read from.

As another optimization, in some cases, it is possible to divide the responsibility of sequencer to all replicas for more load balancing. For example, in a key-value store, we can partition the key space using approaches like consistent hashing, then make each replica order the commands on one partition (commands on different keys can be out-of-order). Again, in this case, it would be possible to use the WPaxos approach for safe key-stealing among the sequencers, and dynamically adapt to the access pattern, rather than being confined to static partitioning.


They implemented a prototype of SDPaxos, and compared its performance with typical single-leader (Multi-Paxos) and multileader (Mencius, EPaxos) protocols. The experiment results demonstrate that SDPaxos achieves: (1) 1.6× the throughput of Mencius with a straggler, (2) stable performance under different contention degrees and 1.7× the throughput of EPaxos even with a low contention rate of 5%, (3) 6.1× the throughput of Multi-Paxos without straggler or contention, (4) 4.6× the throughput of writes when performing reads, and (5) up to 61% and 99% lower wide-area latency of writes and reads than other protocols.

MAD questions

1. Does SDPaxos help with the leader bottleneck significantly?
I wrote above that "Provided that the ordering messages are smaller than the replication messages, the load on the sequencer will not be as severe as that on the single leader in Paxos." But on close inspection I don't think I believe that sentence anymore.  Ailidani, Aleksey, and I have done a detailed bottleneck analysis of Paxos protocol categories (under submission), and we found that the outcast messages are not the biggest source of bottleneck for the leader, as they are serialized once before being sent out to the replicas. The incast messages contribute most to the bottleneck, as the CPU needs to process them one by one and they queue up. Moreover, the incast messages are ACK messages, which are already small, and SDPaxos does not make them smaller. So, maybe SDPaxos does not improve significantly on the single-leader bottleneck in Paxos. On the other hand, it is true that SDPaxos helps distribute the client request load to the C-instance replicas relieving that bottleneck, and it definitely helps with lowering the latency in WAN.

2. How would you build reconfiguration of participants for this protocol? 
Reconfiguration of participants is important to re-establish fault-tolerance capability by replacing failed replicas by accepting fresh new replicas to the system. How would reconfiguration work for SDPaxos? Would Raft's 2-phase reconfiguration method apply readily for SDPaxos?

3. What are additional challenges for efficient strongly-consistent geo-replication implementation at scale?
I am at Microsoft Azure Cosmos DB for my sabbatical. Cosmos DB recently introduced general-availability of multiple write regions. While providing strong-consistency with multi-master writes allowed from any region has cost across the globe, could SDPaxos ideas help improve efficiency further?

Tuesday, November 6, 2018

An unexpected phone call at the elevator

Sometime ago, I was visiting an apartment block. This was a relatively new apartment block, I think less than 5 years old.

I entered the elevator, and I heard "Hello, hello... Are you there?" booming from the speakers. This never happened to me before of course; elevators don't talk to me often. I thought maybe someone had pressed the call for help button, and left the elevator, and I can help sort this out.

I said, "Hi, yes!"

The voice continued saying:
-"The reason I am calling you is because of your last energy bill".

After a second of cognitive dissonance, I started laughing:
-"Umm, I don't know what happened, but I am in an elevator now, and your voice is literally streaming from the elevator."

The voice stopped for a brief couple of seconds, then continued explaining something about the bill, following the script he was given.

-"Dude, this is my floor, and I got to go now. Ok, bye."

It turns out most elevator phones ---which are mandated by law to allow you to call for emergency services in case you get stuck in the elevator--- are public network connected phones. And this occasionally leads to receiving phone calls at the elevator. This short youtube video is so funny. Unfortunately I was so shocked by the phone call it didn't occur to me to record it.

From now on, this is how I will answer any unwanted spam call I receive: "Dude, I don’t know what happened, but I am in an elevator now, and your voice is literally streaming from the elevator. This is my floor, I gotto go now."

MAD questions

1) If we have Internet-of-Things vision realized, do I have this to look forward to? Collection calls at the elevator? Or when trying to drive my car?

2) What is the worst malicious thing you can do if you have (or can find) the phone numbers for elevators at apartment buildings or workplaces?

Huh, of course, there is prior art on the Web.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...