Sunday, November 18, 2018

HotNets'18: Networking in Space

HotNets'18 was held at Microsoft Research, Building 99. This is walking distance to my office at Cosmos DB, where I am working at my sabbatical. So I got tempted and crushed this workshop for a couple sessions. And oh my God, am I happy I did it. The space session was particularly very interesting, and definitely worth writing about.

My God, it is full of satellites!

According to a 2018 estimate, there are 4,900 satellites in orbit, of which about 1,900 operational, while the rest have lived out their useful lives and become space debris. Approximately 500 operational satellites are in low-Earth orbit, 50 are in medium-Earth orbit (at 20,000 km), and the rest are in geostationary orbit (at 36,000 km).


The low earth orbit LEO satellites are not stationary and fast moving around the earth at 1.5 hour per rotation. We are talking about the lowest ring in this picture, where International Space Station (ISS) resides.

Since LEO satellites are close to Earth, this makes their communication latency low. Furthermore, if we take into account that the speed of light in vacuum is 1.5 times faster than in fiber/glass, communicating over the LEO satellites becomes a viable alternative to communicating over fiber, especially for reducing latency in WAN deployments.

When this gets built, it will change Internet: in some accounts up to 50% traffic may take this route in the future.

And, it is actually getting built soon.

Starlink: SpaceX's satellite constellation

Starlink is a satellite constellation development project underway by SpaceX, to develop a low-cost, high-performance satellite bus and requisite customer ground transceivers to implement a new space-based Internet communication system. Development began in 2015, and prototype test-flight satellites were launched on 22 February 2018. Initial operation of the constellation could begin in 2020 with satellite deployment beginning mid 2019.

In Starlink’s initial phase, 1,600 satellites in 1,150 km altitude orbits will provide connectivity to all except far north and south regions of the world. A second phase adds another 2,825 satellites in orbits ranging from 1,100 km altitude to 1325 km, increasing density of coverage at lower latitudes and providing coverage at least as far as 70 degrees North.

via GIPHY

And guess what! It looks like these satellites will communicate with each other using fricking ``lasers''!

Delay is Not an Option: Low Latency Routing in Space

Mark Handley (University College London) tried to reverse-engineer SpaceX's FCC filings to figure out what is possible with Starlink. It was the most interesting talk at the conference (at least among the talks I attended). People listened to the talk breathlessly and mesmerized. Mark had such exquisite visualizations and darkened the room for us to appreciate them better. It was a 20 minute trip to space and to 10 years in the future to deliberate about networking in space. (Here is a link to the paper.)


A special note about his slides is in order. He coded the satellites and the routing algorithms using the Unity framework. His slides were not showing video of simulations, but rather running the simulator in real-time. Bold and beautiful way to present.

The FCC filings mention "silicon carbide" communication components, which point to laser communication. Since it would be hard to infer bandwidth without more information, Mark took on the question of figuring out what the latency could be, and how would it change as satellites move, and what kind of use would this enable.

Each satellite has 5 inter-satellite communication links. The phase 1 satellites are northeast bound. And the phase 2 satellites are southeast bound.

The coverage is not uniform. London would be able to communicate with 30 LEO satellites at any given time.

Routing over satellites multihop via laser 90ms latency is achievable, compared to 160ms over fiber communication. This is a big improvement, for which financial markets would pay good money for.

Mark also considered how many multiple paths could be run over these satellites? He found that 10 multipaths is feasible. But 20 simultaneous paths not possible in phase1 of constellations.

With the additions in second phase (satellites that are southeast bound), London to Johannesberg latency can come down to 80ms from 190ms. These phase 2 additions will also help for providing better multipath capacity. With the second phase additions, FCC required SpaceX to cover the Alaska north region. This may also serve the purpose of routing over the pole, for example for the NY to Beijing route.

The Starlink deployments open many research questions for networking:

  • how do you avoid building queues? (probably via a form of source routing)
  • how do you coordinate multipath route changes?
  • how do you avoid reordering without increasing latency?
  • how do you make topology adaptive?


The other papers in the session

The "Networking, in Heaven as on Earth" paper considers the interdomain routing problem with satellite constellations. The vision there is to full integration of satellite networks in Internet Control Plane (via BGP). But satellites move very fast which leads to frequent BGP updates. Filtering reduces updates but introduces connectivity problems. The paper mentions that a proactive routing strategy (that leverages predictability of satellite orbits) rather than reactive could work better

The "Gearing up for the 21st century space race" paper talks about miscellaneous issues in space networking. The talk mentioned that some trade activity shows outstripping of fiber speed communications from NY to Frankfurt and that people might be using short-wave (microwave) radios to beat fiber-optical speeds (where light travels 2/3rds slower than it does in vacuum). Then the talk speculated whether it is possible to establish multihop microwave routing using in-flight planes. It turns out it is possible to do it with 20 hops across the globe (east to west) with low-stretch and good latency.


MAD questions

1. Maybe we are getting there, huh?
This session reminded me of the Seveneves book by Neal Stephenson (great read, recommend highly). In Part 1, of the book there was very good coverage of Space orbits, maneuvering in space, and how dangerous space junk could get. Coincidentally, one of the talks mentioned that space is garbage tracked: anything bigger than a marble is tracked. At first I didn't buy this, didn't sound feasible. But turns out radar is used for learning trajectories of the space junk and the trajectories are maintained at the databases in space agencies to help make the  space station and the satellites to avoid them. So the satellites will be routing packets and simultaneously try to route around occasional space junk. We are getting ready to become a space-faring species, and that is very exciting.

2. What is next?
Faster than light quantum communication, anyone? Ender's Game series mentioned such communication. And of course there is a wikipedia page for faster than light communication.

3. Would it be feasible to do store and forward communication via the satellites?
You know the thought-experiment about the plane full-of-disks, right? It has very good throughput. Since these satellites are already moving at a fast speed, could they be used for data mules to improve throughput for big data networking, say between the Hadron collider in CERN and datacenters in US? Think of beaming up data to a row of satellites (one after another) that store this data and in 45 minutes or so dump these at the US datacenters.  Could this be a feasible alternative to fiber? Probably not so much, since the uplink and downlink are still limited.

Thursday, November 15, 2018

My Emacs journey

This is a follow up to my "Master your tools" post. As an example of one of my tools, I talk about my Emacs use.

I have been using Emacs for the last 20 years. At this point, I don't even know Emacs, my fingers do. If you ask me the shortcut for something, I will need to let my fingers do it and try to observe what they are doing. And sometimes ---as in the story of the caterpillar who forgets how to walk when asked to demonstrate it--- I forget about how to do something when I try to attempt it consciously.


From text-editing to text-wrangling

I have been learning Emacs at a glacial pace, but I think that worked for me better. I figured I can internalize so much at a time, so I didn't rush things. I initially used Emacs mostly for power editing LaTeX files.

It was only around 2008 that I started with the Emacs org-mode. I loved its outlining feature, and started using and customizing it ever since. It has been a big part of my thinking and writing process for the last 10 years. You can say that it became my out-of-core memory execution primitive.

When I write a blog post or an article, I use the org-mode outline headers to organize/departmentalize and text-wrangle the content. I have a JUNK header where I move extra text, this helps me overcome my kill-your-darlings syndrome. I have an INSERT header for noting down what I like to insert. I visit these later to decide what is the best place in the article to insert them, or whether I should move them to the junk header.

So this does work like my out-of-core memory module. At any time I only keep a small number of things in my mind, and use the headers as my swap memory. I go forward with one decision/issue at a time without overwhelming myself. This is how I try to scale my attention in my offline thinking mode. I don't have a large working memory (I also suspect you don't either) and this helps immensely. Text wrangling for the win!

Getting things done with org-mode

As for using org-mode (org-agenda) for TODO lists and getting things done, I had gone through 3 unsuccessful attempts before I finally made it to stick. After my failed attempts, I thought I am hopelessly disorganized and this is too much of a hassle. After I saw my colleague Jason use it regularly, I gave it another shot. Incorporating org-agenda to my workflow did wonders for my peace-of-mind, if not for my organization and timeliness. I wrote about this a little here.
Before integrating the emacs org-agenda to my life, I always had open loop tasks that caused me worries, and eating up cycles in my brain: "Oh, I should remember doing this", "Woe to me, I am procrastinating on this", etc. After successfully adopting emacs org-mode as my to-do list and agenda manager (which, took a couple years, and several aborted tries), the benefit I got out was the clarity of mind, and the release of all that background anxiety.

Other assorted Emacs tricks I use

I love the org-export-to-latex functionality of org-mode. This way I can get a pdf file shareable with others anytime, while still staying in org-mode where I do my thinking. In my org-file I would have COMMENTed out the JUNK header, INSERT header, and the META header (for questions/connections) to capture my thinking and provide a snapshot of my brain. The exported LaTeX article hides all of those, but I still need those as my documentation of my thought process and to evolve my work further. Frankly I don't get how Word/Pages/etc users deal with not having COMMENT sections in their documents.

I also use the org-export-to-beamer mode for quickly preparing presentations in Beamer. This helped me survive teaching. Preparing powerpoint presentations takes a lot of time and is painful. On the other hand, due to its integration to my thinking/writing process and due to the COMMENTing/evolving benefits I mentioned above, exporting the org document to Beamer makes things easy/frictionless for me.

Another hack I use is to maintain a notes.org file in any folder to keep track of that project. I use this as a lab-notebook to record everything about the project, and meta-thoughts, concerns, etc. I also add timestamps to my entries with my Emacs shortcut (Wed, 14 Nov 18 - - - 21:07) cntrl-c-t. Of course I had to try this with my fingers first to learn the command I use.

I use M-x-tomatinho for keeping track of my pomodoros in Emacs. It is visual, and it gives me a good picture of how my day is going. In my self.org file, I keep an org-mode table where I note which pomodoro number corresponds to what time and task. This gives me a candid picture of how my day is going and how my week has gone. This post from 3 years ago describes the pomodoro workflow I had then, which is obsolete now.

When I am working on large documents in org-mode, I use hot links (radio-targets) for  definitions, so that when I write the word at a later point in the file, Emacs automatically inserts the link to where it is defined first.

I also use predefined hi-lock for "!!" and "??" to easily highlight some interesting findings  and questions in the text. And I sometimes use impromptu M-x highlight-phrase or M-x highlight-regexp to highlight other things.

I use <f3> and <f4> to define and use keyboard macros for ad hoc custom editing needs.

I wrote a bit about how I prepare my Emacs setup here.

MAD questions

1. What should my Emacs learning/mastering pace be?
I don't consider myself a power Emacs user. I am far from it. I don't go shortcut crazy. I don't try to automate everything. And if some new Emacs tricks/shortkeys don't stick with me, I take it that I don't really need them. As a counterpoint to this though, I couldn't get the org-agenda to stick for a long time, and now that I use it I realize how much I needed it. So where should I draw the line about how much to push to learn/adopt new things?

2. Benefits of drawing/sketching versus typing
As much as I love Emacs, it is linear and text. Although org-mode helps for making things non-linear with its headers, it doesn't give the same visual thinking benefits from drawing/sketching. I think I will try to incorporate more doodling/sketching drawing to my workflow in the coming months. Let me know if you have good suggestions for this.

Monday, November 12, 2018

Is Twitter causally-consistent?

For the past 7-8 years, several research papers have used this example to motivate causal consistency. You must have read this example, right?

  • Alice removes her boss from her friend list, and posts that on her feed that she is looking for a new job. 
  • Tom removes his mom from the friend list, and posts his Spring Break photos.

Well, being the empirical researchers we are, Aleksey and I wanted to put Twitter in to test for this scenario.

On September 25, we performed this test. (I also have a video recording of this. But since I can't stand to hear myself talk in recordings, I am not posting it. I sound really weird, man.)

I first blocked Aleksey on my Twitter account, and then tweeted that Aleksey drinks a lot of tea (it's true). When we checked Aleksey's timeline, we saw that his timeline indeed did not display my tweet.

So, this was kind of an anticlimax. Twitter passed the causal-consistency test easily. No need to publish more causal-consistency papers, right?

Well,  maybe not quite. Maybe this was because Aleksey and I were in the same region and our accounts fall in to the same datacenter. We thought maybe we should repeat the test with Aleksey connected to Twitter via a proxy, but then since Aleksey's laptop was away that day, we decided to test this at another time. (Maybe we should find someone from another region and arrange a cross-continental Twitter causality test this time.)

Anyways, this story gets more interesting. Keep reading.

I unblocked Aleksey, and Aleksey checked that he was following me again, and we called it a day.

But a couple days later, I was tweeting something and wanted to include a mention to @AlekseyCharapko in my tweet. But Twitter didn't autocomplete for me. I finally found his account after some searching on Twitter, and saw that Aleksey is not following me. What!? My student doesn't follow me on Twitter? Impossible. I showed this to Aleksey and he was also caught by surprise.

We then remembered about the Twitter test we did a couple days ago.

It must be that although I unblocked Aleksey after our test, a nightly batch job blocked him from my account and made his account unfollow me.

Reality is often stranger than the clean models we have. Real systems have a lot of back channels and processes.

MAD questions

0) Some clarification on this from someone at Twitter

There was discussion on this at Hacker News.

Isn’t this a known feature of Twitter, usually called a soft block or forced unfollow? You block and unblock someone quickly and it forces them to unfollow you (and more importantly, does not signal them that it happened). I have heard of people fearing harassment to use this technique to get themselves out of a person’s timeline (out of sight, out of mind) without the pseudo-confrontation of a block.

This is correct.When we process a block request from user A -> user B, we remove the follow edges between user A -> user B and user B -> user A, and then add a block edge from user A -> user B.
When we process an unblock request from user A -> user B, we remove the block edge from user A -> user B.
I imagine that the "Aleksey checked that he was following me again" was either client caching, or eventual consistency latency. There's no nightly batch job or anything doing that.
Source: I work on the social graph service at Twitter.   
Here is the video segment that shows that after I unblocked Aleksey, he was able to see my tweets in his timeline. I don't know what happened after that which made his account unfollow mine again.  It might be that since my tweet included a mention to him, Aleksey clicked on the notification and could see my tweets after that, and not in his timeline. We repeated the experiment, and found that after I blocked and unblocked him, he was indeed still unfollowing me.

1) What is the right way to test this?

Sending two consecutive tweets from one account and monitoring from another does not seem sufficient to test causal-consistency to me. The tweets are likely partitioned with the user_id, and so just session consistency or per-key ordering would give the impression of causal consistency, right?

To offset for this, we tried another test, where I tweet something, Ailidani reads it and presses the button on his drafted tweet, and Aleksey checks his feed for the result. It was no contest again. Twitter is so fast that it displayed my tweet on Aleksey's timeline immediately and Ailidani's tweet didn't stand a chance.

(Another thing to note for completeness. Under content preferences, we disable the "show me the best tweets first" to get a time ordered tweet stream.)


2) Where would you put your money?
So let's say we set up this 3-person cross continental Twitter causal consistency test. Where would you put your money on?

Can we infer something from what we know of Twitter architecture? This is a good start for reasoning. But again, in our 3 person test, per-key/object ordering does not help.

Saturday, November 10, 2018

How to be a good machine learning product manager

There are a lot of interesting meetups at Seattle, and I try to attend one every couple weeks. Ruben Lozano Aguilera was the speaker for this meetup on Oct 17. Ruben is a product manager at Google Cloud, and before that he was a product manager at Amazon.

What is ML?

Programming transforms data + rules into answers.

Machine learning turns data + answers into rules.

When should you use ML?

Use ML if the problem:

  • handles complex logic
  • scales up really fast
  • requires specialized personalization
  • adapts in real-time

For example ML is a good fit for the "search" problem.  Search requires complex logic, for which it is not easy to develop rules.  It scales up really fast in terms of new keywords, combinations and content. It requires personalization depending on the context, and has some real-time adaptation component as well.

Another important point is that the problem should have existing examples of actual answers. When you bootstrap from a good enough dataset, you can scale further, because data -> predictions -> customer experience -> more traffic -> more data.

Some popular ML problems are ranking, recommendation, classification, regression, clustering, and anomaly detection.

Don't use ML when your problem:

  • can be solved by simple rules
  • does not adapt to new data
  • requires 100% accuracy
  • requires full interpretability/why-provenance


The data requires further consideration: Can you use data? Is it available, accessible, and sufficient? Is high quality? relevant, fresh, representative, unbiased? Is it appropriate to use the data: privacy, security concerns?

For the following, can you use ML or not?

  • What apparel items should be protected by copyright? No. This is dangerous financially, you need to get 100% accuracy.
  • Which resumes should we prioritize to interview for our candidate pipeline? No, this may be based on biased data.
  • What products should be exclusively sold to Hispanics in the US? No. This is discriminatory and creepy.
  • Which sellers have the greatest revenue potential? Yes.
  • Where should Amazon build next head quarters? No. This is not a repeatable problem; there is only one label: Seattle.
  • Which search queries should we scope for the Amazon fresh store? Yes.


What is the ML lifecycle?

For productizing ML, you need people, processes, and tools/systems.

The people come from two domains:

  • Math, statistics: ml scientist, applied scientist, resarch scientist, data scientist 
  • Software, programming: business intelligence engineer, data engineer, software engineer, dev manager, technical program manager

The ML lifecycle involves 4 phases: problem, data, features, and model.

To formulate the problem, you need to clarify what to solve, establish measurable goals, and determine what to predict.

For the data phase, you need to

  • select data: available, missing data, discarding data (data cleaning)
  • preprocess data: formatting, cleaning, sampling


For the features phase, you need to consider scaling, decomposition, aggregation, and discard any features that are not relevant.

Finally, for the model phase, you first divide the data set into training data and test data, could be 70+30 or 90+10. Then comes the model training (using whatever algorithm you are using), which produces the ML model. You then test this output ML model with the test data.

To productize your model, you should integrate the ML solution with existing software and keep it running over time. At this point considerations about the deployment environment, data storage, security and privacy, monitoring & maintenance come in to play. Some great ML solutions cannot be productized due to high implementation costs or inability to be tested in practice.

The product manager is very much involved in the first 2 phases: formulating the problem and selecting the data. Product manager is also involved in feature selection but not much involved with the final model phase.

MAD questions

1) Umm, deep learning?
Since the presentation didn't mention any deep learning specific problems/tasks/observations, I asked Ruben about what significance deep learning had on the projects he worked on. Turns out, he didn't use any. He said that simpler ML models were enough for the tasks they undertook so he never needed a deep-learning solution. He also said that deep-learning was very expensive up until a couple years ago, and that was also a factor.

With TensorFlow, Google is supposedly using deep-learning a lot, likely more for image and voice processing. But, is there a study about the prominence of deep-learning use among ML solutions in the industry?

2) How do you troubleshoot issues with productizing ML?
As we covered above there are many things that can go wrong, such as  unanticipated bias in your data, in your method, conclusions. How do you check for these? Ruben answered they brainstorm and think very deeply about what could go wrong, and identify these issues. It seems like this needs more processes and tool support. Having seen how TLA+ specifications and model checking create wonders for checking problems with distributed/concurrent systems, I am wondering if similar design level tool support could be developed for ML solutions.

3) How do we learn/teach empathy?
Ruben was a great speaker. He used beautifully designed slides. After all he is a product manager and sympathizes with the users/audience. In Q&A session he mentioned that empathy is the most important skill for a ML product manager. I believe empathizing with your audience also goes a long way in public speaking. How do we learn/teach empathy? This is so basic that you expect/hope we learn this as kids. But it looks like we keep forgetting about this and fail to empathize. Also, there is always levels to things. How do we get better at this?

4) Is ML/DL too application-coupled?
I have a some understanding of ML/DL domain, since I started learning about it in 2016. I am still amazed at how tightly application-coupled is the ML/DL work. On one hand this is good, this makes ML/DL very practical and very applicable. On the other hand, this makes it harder to study the principles and systematize knowledge.

Thursday, November 8, 2018

SDPaxos: Building efficient semi-decentralized geo-replicated state machines

In the last decade, the Paxos protocol family grew with the addition of new categories.

  • Rotating leader: Mencius
  • Leaderless: EPaxos, Fast Paxos
  • Paxos federations: Spanner, vertical Paxos 
  • Dynamic key-leader: WPaxos 

This paper, which appeared in SOCC 18, proposes SDPaxos which prescribes separating the control plane (single leader) from the replication plane (multiple leaders). SD in SDPaxos stands for "semi-decentralized".

The motivation for this stems from the following observation. Single leader Paxos approach has a centralized leader and runs into performance bottleneck problems. On the other hand, the leaderless (or opportunistic multileader) approach is fully decentralized but suffers from the conflicting command problems. Taking a hybrid approach to capture the best of both worlds, SDPaxos makes the command-leaders to be decentralized (the closest replica can lead the command), but the ordering-leader (i.e., the sequencer) is still centralized/unique in the system.

Below I give a brief explanation of the Paxos protocol categories before I discuss how SDPaxos compares and contrasts with those.

Plain vanilla Paxos

Paxos provides fault tolerant consensus among a set of nodes.

  • Agreement: No two correct nodes can decide on different values.
  • Validity: If all initial values are same, nodes must decide that value.
  • Termination: Correct nodes decide eventually.


Paxos runs in 3 phases: propose (phase-1), accept (phase-2), and commit (phase-3).

  1. A node tries to become the leader by proposing a unique ballot number b to its followers with a phase-1a message. The followers acknowledge a leader with the highest ballot seen so far, or reject it with a ballot seen with a number greater than b. Receiving any rejection fails the candidate. 
  2. In the absence of a rejection, a node becomes leader and advances to phase-2 after receiving a majority quorum of acknowledgments. In this phase, the leader chooses a suitable value v for its ballot. The value would be some uncommitted value associated with the highest ballot learned in previous phase, or a new value if no pending value exists. The leader commands its followers to accept the value v and waits for acknowledgement messages. Once the majority of followers acknowledge the value, it becomes anchored and cannot be revoked. Again a single rejection message (carrying an observed higher ballot number) received in phase-2b nullifies the leadership of the node, and sends it back to phase-1 to try with a higher ballot number. 
  3. Finally, the leader sends a commit message in phase-3 that allows the followers to commit and apply the value to their respected state machines.

It's important to see that after phase-2, an anchored value cannot be overridden later as it is guaranteed that any leader with higher ballot number would learn it as part of its phase-1 before proposing a value in its phase-2.

You can find more information on Paxos in this blog, and especially Modeling Paxos in TLA+ could be of interest to you.

Single leader approach

The traditional single-leader Paxos protocol employs a centralized leader to process all client requests and propose commands. The single leader takes a significantly heavier load than the other replicas, and becomes a performance bottleneck. Moreover, in geo-replication, clients not co-located with the leader need to send the requests to the remote leader, which incurs significantly higher wide-area network latency.

Mencius approach

Mencius is a multileader version of Paxos that aims to eliminate the single leader bottleneck in Paxos. Mencius achieves load balancing by partitioning the consensus instances among multiple servers. E.g., if we have 3 servers, server 0 is responsible for acting as a leader for consensus instances numbered 0,3,6, server 1 for 1,4,7, and server 2 for 2,5,8, etc. Mencius tries to avoid the straggler problem by making the replicas skip their turns when they fall behind, however, it cannot fully eliminate the slow-down. Since it uses multiple leaders, Mencius also loses out on the "serve reads locally at the leader" optimization possible in Paxos.

Leaderless approach

EPaxos is a leaderless solution, where every node can opportunistically become a leader for some command and commit it. When a command does not interfere with other concurrent commands, it is committed in a single round after receiving the acks from a fast quorum (which is approximately 3/4ths of all nodes). In a sense, EPaxos compresses the phase-2 to be a part of phase-1 when there are no conflicts. However, if the fast quorum detects a conflict between the commands, EPaxos defaults back to the traditional Paxos mode and proceeds with a second phase to establish order on the conflicting commands.

Unfortunately, services like E-commerce and social network can generate high-contention workload, with many interfering commands on the same object from multiple clients. This problem is aggrevated in wide area network deployments: since requests take much longer time to finish, the probability of contention rises.

Multileader approach

Spanner and CockroachDB are examples of databases that uses a federation of Paxos groups to work on different partitions/shards. These partitioned consensus systems employ another solution on top (such as vertical Paxos) for relocating/assigning data from one Paxos group to another.

WPaxos uses sharding of the key space and takes advantage of flexible quorums idea to improve WAN performance, especially in the presence of access locality. In WPaxos, every node can own some objects/microshards and operate on these independently. Unlike Vertical Paxos, WPaxos does not consider changing the object/microshard ownership as a reconfiguration operation and does not require an external consensus group. Instead WPaxos performs object/microshard migration between leaders by carrying out a phase-1 across the WAN with a higher ballot number, and commands are committed via phase-2 within the region or neighboring regions.

The SDPaxos approach

The root of the inefficiency of leaderless and multileader protocols is the decentralized coordination pattern. Although decentralization addresses the single-leader bottleneck as every replica can propose commands, the replicas still need to agree on a total order on conflicting commands proposed by different replicas to avoid inconsistent state.

To address this issue, SDPaxos divides the consensus protocol in 2 parts: durably replicating each command across replicas without global order (via C-instance Paxos), and ordering all commands to enforce the consistency guarantee (via O-instance Paxos). Replicating via C-instance Paxos is completely decentralized where every replica can freely propose commands and replicate them to other replicas. This evenly distributes the load among replicas, and enables clients to always contact the nearest one. On the other hand, as part of O-instance Paxos, one replica is elected as the sequencer and handles the ordering in a centralized manner: the global view enables this replica to always order commands appropriately. Provided that the ordering messages are smaller than replication messages, the load on the sequencer will not be as severe as that on the single leader in Paxos.

Fault tolerance is provided as both the replicating and ordering instances are conducted based on Paxos. Each replica proposes commands in a series of C-instances of its own to produce its partial log. The sequencer proposes replicas' IDs in O-instances to produce an assignment log. Based on the assignment log, all replicas' partial logs are finally merged into a global log.

Comparison with other protocols

The separation between C-instances and O-instances is the source of SDPaxos's advantages over existing protocols. The decentralization distributes load evenly across replicas, and allows clients to always contact the nearest replica in geo-replication to serve as the command leader. The O-instance leader, i.e., the sequencer, provides conflict free operation.

So, SDPaxos is like Paxos, but it has local leader in each region. This way it avoids the cost of going to the leader, and back.

Also SDPaxos is like EPaxos but with no conflicts, ever!

In SDPaxos, the sequencer is one node, but is backed by O-instances. A new sequencer can be chosen easily in a fault-tolerant way using Phase-1 of Paxos over O-instances. This alleviates the availability problems due to the serializer failure in systems that use Paxos for serializing the log in a central region. In such systems (e.g., Calvin) if the single log serializer is Paxos-replicated within a region, then the availability suffers on region failure. Instead, if the serializer is Paxos-replicated across regions then the performance suffers.

The protocol


In this example, upon receiving a client request for a command, replica R0 becomes the command leader of this command, picks one of its own C-instance and replicates the command to others (using the C-accept, i.e., Accept phase message of the C-instance). In the meantime, this C-accept also informs the sequencer (R2) to start an O-instance for this command. Then R2 proposes R0’s ID in the next (e.g., the jth) O-instance and sends O-accepts to others, to assign this command to the jth global slot. Replicas will then accept these instances and send C-ACKs and O-ACKs to R0; R2 also sends an O-ACK as it has sent an O-accept to itself. The algorithm denotes the ith C-instance of Rn as Cni, and the jth O-instance as Oj.



A C-instance can come from any node without a Paxos phase-1a, because each replica has its own distinct replication log for C-instance. The C-instance messages do not conflict with each other and gets accepted immediately. The C-instance messages do not even need a ballotnum; the ballotnum used is that of the O-instance to denote epoch (i.e., which sequencer the sender thinks is still in-charge).

A command being ready requires the C-instance and enough number of O-instances be committed. The conditions of an instance being committed and a command being ready are defined in lines 18 through 31, which we discuss next. There are two questions here.

  • The safety question is: How do we ensure that the replication is anchored (performed at the majority quorum) from the command leader perspective?
  • The performance question is: How do we achieve consensus in one round while satisfying the safety concern? 


The 1-round feat

In the best case when O-instance of the sequencer overlaps perfectly with the C-instance of the command replication leader, consensus is achieved in one round-trip time ---the optimal possible. But, since the O-instance starts half a round trip later than the C-instance for non-sequencer replicas, it is not always possible to optimize the O-instance completion to just half a round trip to achieve the one-round-trip latency. But the paper shows how this can be achieved for N=3 and N=5 replicas. In groups with more than 5 replicas, the O-instances still need one round trip, thus the overall latency remains at 1.5 round-trips.


In 3-replica groups, when the command leader receives O-ACK from the sequencer, the majority (2 out of 3) is readily established for O-instance completion. This provides the one-round-trip consensus. (For the case, the command leader is also the sequencer, sequencing can be done in advance and one round-trip is satisfied as well.)


In 5-replica groups, a command can also be ready in one round trip, but unlike the case of three replicas, an O-instance cannot be accepted by a majority in half a round trip. Instead, SDPaxos lets each non-sequencer replica commit an O-instance, upon receiving the O-accept from the sequencer (line 29). Here, the O-instance does not rigorously follow Paxos, which raises a complication for recovery: if this non-sequencer replica and the sequencer fail, we cannot recover this O-instance simply by Paxos because the other alive replicas may have not seen the O-accept yet. However, the paper discusses a way for SDPaxos to correctly recover the O-instances of all replicas' ready commands even in such cases (omitted in my review), and is able to allow for 1-round-trip commits for N=5.


Note that the dynamic per key leader approach in WPaxos still has an edge over SDPaxos when there is good locality in the workload (which is often the case in practice) and/or when the number of replicas is greater than 5 (which is often the case for geo-replicated databases). It may be possible to use WPaxos for coordination across regions and integrate SDPaxos for coordination within the region upto 5 replicas.

Optimizations

As an optimization for reads, SDPaxos uses sequencer leases to authorize the sequencer to directly reply to the read requests. In contrast, such an optimization is not possible for leaderless approaches, as there is no single/dedicated leader to lease and read from.

As another optimization, in some cases, it is possible to divide the responsibility of sequencer to all replicas for more load balancing. For example, in a key-value store, we can partition the key space using approaches like consistent hashing, then make each replica order the commands on one partition (commands on different keys can be out-of-order). Again, in this case, it would be possible to use the WPaxos approach for safe key-stealing among the sequencers, and dynamically adapt to the access pattern, rather than being confined to static partitioning.

Evaluation

They implemented a prototype of SDPaxos, and compared its performance with typical single-leader (Multi-Paxos) and multileader (Mencius, EPaxos) protocols. The experiment results demonstrate that SDPaxos achieves: (1) 1.6× the throughput of Mencius with a straggler, (2) stable performance under different contention degrees and 1.7× the throughput of EPaxos even with a low contention rate of 5%, (3) 6.1× the throughput of Multi-Paxos without straggler or contention, (4) 4.6× the throughput of writes when performing reads, and (5) up to 61% and 99% lower wide-area latency of writes and reads than other protocols.



MAD questions

1. Does SDPaxos help with the leader bottleneck significantly?
I wrote above that "Provided that the ordering messages are smaller than the replication messages, the load on the sequencer will not be as severe as that on the single leader in Paxos." But on close inspection I don't think I believe that sentence anymore.  Ailidani, Aleksey, and I have done a detailed bottleneck analysis of Paxos protocol categories (under submission), and we found that the outcast messages are not the biggest source of bottleneck for the leader, as they are serialized once before being sent out to the replicas. The incast messages contribute most to the bottleneck, as the CPU needs to process them one by one and they queue up. Moreover, the incast messages are ACK messages, which are already small, and SDPaxos does not make them smaller. So, maybe SDPaxos does not improve significantly on the single-leader bottleneck in Paxos. On the other hand, it is true that SDPaxos helps distribute the client request load to the C-instance replicas relieving that bottleneck, and it definitely helps with lowering the latency in WAN.

2. How would you build reconfiguration of participants for this protocol? 
Reconfiguration of participants is important to re-establish fault-tolerance capability by replacing failed replicas by accepting fresh new replicas to the system. How would reconfiguration work for SDPaxos? Would Raft's 2-phase reconfiguration method apply readily for SDPaxos?


3. What are additional challenges for efficient strongly-consistent geo-replication implementation at scale?
I am at Microsoft Azure Cosmos DB for my sabbatical. Cosmos DB recently introduced general-availability of multiple write regions. While providing strong-consistency with multi-master writes allowed from any region has cost across the globe, could SDPaxos ideas help improve efficiency further?

Tuesday, November 6, 2018

An unexpected phone call at the elevator

Sometime ago, I was visiting an apartment block. This was a relatively new apartment block, I think less than 5 years old.

I entered the elevator, and I heard "Hello, hello... Are you there?" booming from the speakers. This never happened to me before of course; elevators don't talk to me often. I thought maybe someone had pressed the call for help button, and left the elevator, and I can help sort this out.

I said, "Hi, yes!"

The voice continued saying:
-"The reason I am calling you is because of your last energy bill".

After a second of cognitive dissonance, I started laughing:
-"Umm, I don't know what happened, but I am in an elevator now, and your voice is literally streaming from the elevator."

The voice stopped for a brief couple of seconds, then continued explaining something about the bill, following the script he was given.

-"Dude, this is my floor, and I got to go now. Ok, bye."


It turns out most elevator phones ---which are mandated by law to allow you to call for emergency services in case you get stuck in the elevator--- are public network connected phones. And this occasionally leads to receiving phone calls at the elevator. This short youtube video is so funny. Unfortunately I was so shocked by the phone call it didn't occur to me to record it.

From now on, this is how I will answer any unwanted spam call I receive: "Dude, I don’t know what happened, but I am in an elevator now, and your voice is literally streaming from the elevator. This is my floor, I gotto go now."

MAD questions

1) If we have Internet-of-Things vision realized, do I have this to look forward to? Collection calls at the elevator? Or when trying to drive my car?

2) What is the worst malicious thing you can do if you have (or can find) the phone numbers for elevators at apartment buildings or workplaces?

Huh, of course, there is prior art on the Web.

Monday, November 5, 2018

Judoing the Dunning-Kruger effect: the "surprisingly-popular option" strategy for crowdsourcing

Let's say you want to crowdsource the answer to the question:
(Q1) What is the capital of Brazil?

The surprisingly-popular option strategy for crowdsourcing suggests piggybacking a control question to the real question Q1:
(Q2) What do you think the majority of other people will respond to this question?

Many people (non-Brazilian and non-geography-nerds) will answer with Rio for both Q1 and Q2. But there will be some people that will answer with Brasilia for Q1 and yet Rio for Q2.

The first set of people that replied with Rio to both questions did not know much about Brazil, and went with what they know as the most prominent city in Brazil. The second set of people not only knew of the correct answer, Brasilia, but they also anticipated that the majority of participants will go wrong by answering with Rio.

Rio is the popular option for Q1, but Brasilia is the surprisingly popular option because the respondents for Brasilia had anticipated that Rio would be the popular option.

The lesson is: "People who expect to be in the minority opinion deserve some extra attention."


Here is an article about this algorithm, and here is the nature paper.
(I wish this would have occurred to us 4 years ago, when we were collecting crowdsourced data from our "who wants to be a millionaire" app.)

MAD questions

1. What are similar judoing techniques?
I am fascinated with this technique, because this is like applying a judo-throw on the Dunning-Kruger and reversing its effects.

It takes a weakness observed in the Dunning-Kruger effect (that people of low ability suffer from illusory superiority and mistakenly assess their cognitive ability as greater than it is), and transforms that in to a strength for identifying the expert voices among the participants, the people who knew the answer to Q2 but provided a minority answer for Q1.

What are similar judo moves in computer science/technology and in general?


2. What about the conspiracy nuts?
This surprisingly-popular option technique has a conspiracy nuts problem, isn't it? There is inevitably a strongly opinionated group that will be in the minority and they will correctly anticipate that they will be in the minority. This technique does not filter for them, right?

So, let's take anti-vaxxers. They will say "no" to (Q1) "Should I vaccinate my kids?" and they will correctly guess the popular option "yes" to (Q2) "What do you think the majority of other people will respond to this question?"

How do we fix this issue with the method? Maybe we can add a (Q3) What do you think the minority choice is and why it is wrong? Of course we are moving away from the automatic aggregation in the original method, but the essence of the idea still holds: Are you aware of the other opinions on this question, and can you explain and refute them?

Can you not only strawman, but also steelman the other arguments and address them?

3. How does this apply to me as I assess my views/positions?
For the hygiene of the mind, it is important to reflect on your beliefsets/behaviors/positions and reassess them occasionally. Am I able to explain why I am in the majority or minority side for my positions? Do I understand the other positions and not only refute them but can also appreciate some valuable points in them?

Some other related posts from my blog on this are: