## Sunday, November 18, 2018

### HotNets'18: Networking in Space

HotNets'18 was held at Microsoft Research, Building 99. This is walking distance to my office at Cosmos DB, where I am working at my sabbatical. So I got tempted and crushed this workshop for a couple sessions. And oh my God, am I happy I did it. The space session was particularly very interesting, and definitely worth writing about.

## My God, it is full of satellites!

According to a 2018 estimate, there are 4,900 satellites in orbit, of which about 1,900 operational, while the rest have lived out their useful lives and become space debris. Approximately 500 operational satellites are in low-Earth orbit, 50 are in medium-Earth orbit (at 20,000 km), and the rest are in geostationary orbit (at 36,000 km).

The low earth orbit LEO satellites are not stationary and fast moving around the earth at 1.5 hour per rotation. We are talking about the lowest ring in this picture, where International Space Station (ISS) resides.

Since LEO satellites are close to Earth, this makes their communication latency low. Furthermore, if we take into account that the speed of light in vacuum is 1.5 times faster than in fiber/glass, communicating over the LEO satellites becomes a viable alternative to communicating over fiber, especially for reducing latency in WAN deployments.

When this gets built, it will change Internet: in some accounts up to 50% traffic may take this route in the future.

And, it is actually getting built soon.

Starlink is a satellite constellation development project underway by SpaceX, to develop a low-cost, high-performance satellite bus and requisite customer ground transceivers to implement a new space-based Internet communication system. Development began in 2015, and prototype test-flight satellites were launched on 22 February 2018. Initial operation of the constellation could begin in 2020 with satellite deployment beginning mid 2019.

In Starlink’s initial phase, 1,600 satellites in 1,150 km altitude orbits will provide connectivity to all except far north and south regions of the world. A second phase adds another 2,825 satellites in orbits ranging from 1,100 km altitude to 1325 km, increasing density of coverage at lower latitudes and providing coverage at least as far as 70 degrees North.

via GIPHY

And guess what! It looks like these satellites will communicate with each other using fricking lasers''!

## Delay is Not an Option: Low Latency Routing in Space

Mark Handley (University College London) tried to reverse-engineer SpaceX's FCC filings to figure out what is possible with Starlink. It was the most interesting talk at the conference (at least among the talks I attended). People listened to the talk breathlessly and mesmerized. Mark had such exquisite visualizations and darkened the room for us to appreciate them better. It was a 20 minute trip to space and to 10 years in the future to deliberate about networking in space. (Here is a link to the paper.)

A special note about his slides is in order. He coded the satellites and the routing algorithms using the Unity framework. His slides were not showing video of simulations, but rather running the simulator in real-time. Bold and beautiful way to present.

The FCC filings mention "silicon carbide" communication components, which point to laser communication. Since it would be hard to infer bandwidth without more information, Mark took on the question of figuring out what the latency could be, and how would it change as satellites move, and what kind of use would this enable.

Each satellite has 5 inter-satellite communication links. The phase 1 satellites are northeast bound. And the phase 2 satellites are southeast bound.

The coverage is not uniform. London would be able to communicate with 30 LEO satellites at any given time.

Routing over satellites multihop via laser 90ms latency is achievable, compared to 160ms over fiber communication. This is a big improvement, for which financial markets would pay good money for.

Mark also considered how many multiple paths could be run over these satellites? He found that 10 multipaths is feasible. But 20 simultaneous paths not possible in phase1 of constellations.

With the additions in second phase (satellites that are southeast bound), London to Johannesberg latency can come down to 80ms from 190ms. These phase 2 additions will also help for providing better multipath capacity. With the second phase additions, FCC required SpaceX to cover the Alaska north region. This may also serve the purpose of routing over the pole, for example for the NY to Beijing route.

The Starlink deployments open many research questions for networking:

• how do you avoid building queues? (probably via a form of source routing)
• how do you coordinate multipath route changes?
• how do you avoid reordering without increasing latency?
• how do you make topology adaptive?

## The other papers in the session

The "Networking, in Heaven as on Earth" paper considers the interdomain routing problem with satellite constellations. The vision there is to full integration of satellite networks in Internet Control Plane (via BGP). But satellites move very fast which leads to frequent BGP updates. Filtering reduces updates but introduces connectivity problems. The paper mentions that a proactive routing strategy (that leverages predictability of satellite orbits) rather than reactive could work better

The "Gearing up for the 21st century space race" paper talks about miscellaneous issues in space networking. The talk mentioned that some trade activity shows outstripping of fiber speed communications from NY to Frankfurt and that people might be using short-wave (microwave) radios to beat fiber-optical speeds (where light travels 2/3rds slower than it does in vacuum). Then the talk speculated whether it is possible to establish multihop microwave routing using in-flight planes. It turns out it is possible to do it with 20 hops across the globe (east to west) with low-stretch and good latency.

1. Maybe we are getting there, huh?
This session reminded me of the Seveneves book by Neal Stephenson (great read, recommend highly). In Part 1, of the book there was very good coverage of Space orbits, maneuvering in space, and how dangerous space junk could get. Coincidentally, one of the talks mentioned that space is garbage tracked: anything bigger than a marble is tracked. At first I didn't buy this, didn't sound feasible. But turns out radar is used for learning trajectories of the space junk and the trajectories are maintained at the databases in space agencies to help make the  space station and the satellites to avoid them. So the satellites will be routing packets and simultaneously try to route around occasional space junk. We are getting ready to become a space-faring species, and that is very exciting.

2. What is next?
Faster than light quantum communication, anyone? Ender's Game series mentioned such communication. And of course there is a wikipedia page for faster than light communication.

3. Would it be feasible to do store and forward communication via the satellites?
You know the thought-experiment about the plane full-of-disks, right? It has very good throughput. Since these satellites are already moving at a fast speed, could they be used for data mules to improve throughput for big data networking, say between the Hadron collider in CERN and datacenters in US? Think of beaming up data to a row of satellites (one after another) that store this data and in 45 minutes or so dump these at the US datacenters.  Could this be a feasible alternative to fiber? Probably not so much, since the uplink and downlink are still limited.

## Thursday, November 15, 2018

### My Emacs journey

This is a follow up to my "Master your tools" post. As an example of one of my tools, I talk about my Emacs use.

I have been using Emacs for the last 20 years. At this point, I don't even know Emacs, my fingers do. If you ask me the shortcut for something, I will need to let my fingers do it and try to observe what they are doing. And sometimes ---as in the story of the caterpillar who forgets how to walk when asked to demonstrate it--- I forget about how to do something when I try to attempt it consciously.

## From text-editing to text-wrangling

I have been learning Emacs at a glacial pace, but I think that worked for me better. I figured I can internalize so much at a time, so I didn't rush things. I initially used Emacs mostly for power editing LaTeX files.

It was only around 2008 that I started with the Emacs org-mode. I loved its outlining feature, and started using and customizing it ever since. It has been a big part of my thinking and writing process for the last 10 years. You can say that it became my out-of-core memory execution primitive.

When I write a blog post or an article, I use the org-mode outline headers to organize/departmentalize and text-wrangle the content. I have a JUNK header where I move extra text, this helps me overcome my kill-your-darlings syndrome. I have an INSERT header for noting down what I like to insert. I visit these later to decide what is the best place in the article to insert them, or whether I should move them to the junk header.

So this does work like my out-of-core memory module. At any time I only keep a small number of things in my mind, and use the headers as my swap memory. I go forward with one decision/issue at a time without overwhelming myself. This is how I try to scale my attention in my offline thinking mode. I don't have a large working memory (I also suspect you don't either) and this helps immensely. Text wrangling for the win!

## Getting things done with org-mode

As for using org-mode (org-agenda) for TODO lists and getting things done, I had gone through 3 unsuccessful attempts before I finally made it to stick. After my failed attempts, I thought I am hopelessly disorganized and this is too much of a hassle. After I saw my colleague Jason use it regularly, I gave it another shot. Incorporating org-agenda to my workflow did wonders for my peace-of-mind, if not for my organization and timeliness. I wrote about this a little here.
Before integrating the emacs org-agenda to my life, I always had open loop tasks that caused me worries, and eating up cycles in my brain: "Oh, I should remember doing this", "Woe to me, I am procrastinating on this", etc. After successfully adopting emacs org-mode as my to-do list and agenda manager (which, took a couple years, and several aborted tries), the benefit I got out was the clarity of mind, and the release of all that background anxiety.

## Other assorted Emacs tricks I use

I love the org-export-to-latex functionality of org-mode. This way I can get a pdf file shareable with others anytime, while still staying in org-mode where I do my thinking. In my org-file I would have COMMENTed out the JUNK header, INSERT header, and the META header (for questions/connections) to capture my thinking and provide a snapshot of my brain. The exported LaTeX article hides all of those, but I still need those as my documentation of my thought process and to evolve my work further. Frankly I don't get how Word/Pages/etc users deal with not having COMMENT sections in their documents.

I also use the org-export-to-beamer mode for quickly preparing presentations in Beamer. This helped me survive teaching. Preparing powerpoint presentations takes a lot of time and is painful. On the other hand, due to its integration to my thinking/writing process and due to the COMMENTing/evolving benefits I mentioned above, exporting the org document to Beamer makes things easy/frictionless for me.

Another hack I use is to maintain a notes.org file in any folder to keep track of that project. I use this as a lab-notebook to record everything about the project, and meta-thoughts, concerns, etc. I also add timestamps to my entries with my Emacs shortcut (Wed, 14 Nov 18 - - - 21:07) cntrl-c-t. Of course I had to try this with my fingers first to learn the command I use.

I use M-x-tomatinho for keeping track of my pomodoros in Emacs. It is visual, and it gives me a good picture of how my day is going. In my self.org file, I keep an org-mode table where I note which pomodoro number corresponds to what time and task. This gives me a candid picture of how my day is going and how my week has gone. This post from 3 years ago describes the pomodoro workflow I had then, which is obsolete now.

When I am working on large documents in org-mode, I use hot links (radio-targets) for  definitions, so that when I write the word at a later point in the file, Emacs automatically inserts the link to where it is defined first.

I also use predefined hi-lock for "!!" and "??" to easily highlight some interesting findings  and questions in the text. And I sometimes use impromptu M-x highlight-phrase or M-x highlight-regexp to highlight other things.

I use <f3> and <f4> to define and use keyboard macros for ad hoc custom editing needs.

I wrote a bit about how I prepare my Emacs setup here.

1. What should my Emacs learning/mastering pace be?
I don't consider myself a power Emacs user. I am far from it. I don't go shortcut crazy. I don't try to automate everything. And if some new Emacs tricks/shortkeys don't stick with me, I take it that I don't really need them. As a counterpoint to this though, I couldn't get the org-agenda to stick for a long time, and now that I use it I realize how much I needed it. So where should I draw the line about how much to push to learn/adopt new things?

2. Benefits of drawing/sketching versus typing
As much as I love Emacs, it is linear and text. Although org-mode helps for making things non-linear with its headers, it doesn't give the same visual thinking benefits from drawing/sketching. I think I will try to incorporate more doodling/sketching drawing to my workflow in the coming months. Let me know if you have good suggestions for this.

## Monday, November 12, 2018

For the past 7-8 years, several research papers have used this example to motivate causal consistency. You must have read this example, right?

• Alice removes her boss from her friend list, and posts that on her feed that she is looking for a new job.
• Tom removes his mom from the friend list, and posts his Spring Break photos.

Well, being the empirical researchers we are, Aleksey and I wanted to put Twitter in to test for this scenario.

On September 25, we performed this test. (I also have a video recording of this. But since I can't stand to hear myself talk in recordings, I am not posting it. I sound really weird, man.)

I first blocked Aleksey on my Twitter account, and then tweeted that Aleksey drinks a lot of tea (it's true). When we checked Aleksey's timeline, we saw that his timeline indeed did not display my tweet.

So, this was kind of an anticlimax. Twitter passed the causal-consistency test easily. No need to publish more causal-consistency papers, right?

Well,  maybe not quite. Maybe this was because Aleksey and I were in the same region and our accounts fall in to the same datacenter. We thought maybe we should repeat the test with Aleksey connected to Twitter via a proxy, but then since Aleksey's laptop was away that day, we decided to test this at another time. (Maybe we should find someone from another region and arrange a cross-continental Twitter causality test this time.)

Anyways, this story gets more interesting. Keep reading.

I unblocked Aleksey, and Aleksey checked that he was following me again, and we called it a day.

But a couple days later, I was tweeting something and wanted to include a mention to @AlekseyCharapko in my tweet. But Twitter didn't autocomplete for me. I finally found his account after some searching on Twitter, and saw that Aleksey is not following me. What!? My student doesn't follow me on Twitter? Impossible. I showed this to Aleksey and he was also caught by surprise.

We then remembered about the Twitter test we did a couple days ago.

It must be that although I unblocked Aleksey after our test, a nightly batch job blocked him from my account and made his account unfollow me.

Reality is often stranger than the clean models we have. Real systems have a lot of back channels and processes.

0) Some clarification on this from someone at Twitter

There was discussion on this at Hacker News.
 Isn’t this a known feature of Twitter, usually called a soft block or forced unfollow? You block and unblock someone quickly and it forces them to unfollow you (and more importantly, does not signal them that it happened). I have heard of people fearing harassment to use this technique to get themselves out of a person’s timeline (out of sight, out of mind) without the pseudo-confrontation of a block.
 This is correct.When we process a block request from user A -> user B, we remove the follow edges between user A -> user B and user B -> user A, and then add a block edge from user A -> user B. When we process an unblock request from user A -> user B, we remove the block edge from user A -> user B. I imagine that the "Aleksey checked that he was following me again" was either client caching, or eventual consistency latency. There's no nightly batch job or anything doing that. Source: I work on the social graph service at Twitter.
Here is the video segment that shows that after I unblocked Aleksey, he was able to see my tweets in his timeline. I don't know what happened after that which made his account unfollow mine again.  It might be that since my tweet included a mention to him, Aleksey clicked on the notification and could see my tweets after that, and not in his timeline. We repeated the experiment, and found that after I blocked and unblocked him, he was indeed still unfollowing me.

1) What is the right way to test this?

Sending two consecutive tweets from one account and monitoring from another does not seem sufficient to test causal-consistency to me. The tweets are likely partitioned with the user_id, and so just session consistency or per-key ordering would give the impression of causal consistency, right?

To offset for this, we tried another test, where I tweet something, Ailidani reads it and presses the button on his drafted tweet, and Aleksey checks his feed for the result. It was no contest again. Twitter is so fast that it displayed my tweet on Aleksey's timeline immediately and Ailidani's tweet didn't stand a chance.

(Another thing to note for completeness. Under content preferences, we disable the "show me the best tweets first" to get a time ordered tweet stream.)

2) Where would you put your money?
So let's say we set up this 3-person cross continental Twitter causal consistency test. Where would you put your money on?

Can we infer something from what we know of Twitter architecture? This is a good start for reasoning. But again, in our 3 person test, per-key/object ordering does not help.

## Saturday, November 10, 2018

### How to be a good machine learning product manager

There are a lot of interesting meetups at Seattle, and I try to attend one every couple weeks. Ruben Lozano Aguilera was the speaker for this meetup on Oct 17. Ruben is a product manager at Google Cloud, and before that he was a product manager at Amazon.

## What is ML?

Programming transforms data + rules into answers.

Machine learning turns data + answers into rules.

## When should you use ML?

Use ML if the problem:

• handles complex logic
• scales up really fast
• requires specialized personalization

For example ML is a good fit for the "search" problem.  Search requires complex logic, for which it is not easy to develop rules.  It scales up really fast in terms of new keywords, combinations and content. It requires personalization depending on the context, and has some real-time adaptation component as well.

Another important point is that the problem should have existing examples of actual answers. When you bootstrap from a good enough dataset, you can scale further, because data -> predictions -> customer experience -> more traffic -> more data.

Some popular ML problems are ranking, recommendation, classification, regression, clustering, and anomaly detection.

Don't use ML when your problem:

• can be solved by simple rules
• does not adapt to new data
• requires 100% accuracy
• requires full interpretability/why-provenance

The data requires further consideration: Can you use data? Is it available, accessible, and sufficient? Is high quality? relevant, fresh, representative, unbiased? Is it appropriate to use the data: privacy, security concerns?

For the following, can you use ML or not?

• What apparel items should be protected by copyright? No. This is dangerous financially, you need to get 100% accuracy.
• Which resumes should we prioritize to interview for our candidate pipeline? No, this may be based on biased data.
• What products should be exclusively sold to Hispanics in the US? No. This is discriminatory and creepy.
• Which sellers have the greatest revenue potential? Yes.
• Where should Amazon build next head quarters? No. This is not a repeatable problem; there is only one label: Seattle.
• Which search queries should we scope for the Amazon fresh store? Yes.

## What is the ML lifecycle?

For productizing ML, you need people, processes, and tools/systems.

The people come from two domains:

• Math, statistics: ml scientist, applied scientist, resarch scientist, data scientist
• Software, programming: business intelligence engineer, data engineer, software engineer, dev manager, technical program manager

The ML lifecycle involves 4 phases: problem, data, features, and model.

To formulate the problem, you need to clarify what to solve, establish measurable goals, and determine what to predict.

For the data phase, you need to

• select data: available, missing data, discarding data (data cleaning)
• preprocess data: formatting, cleaning, sampling

For the features phase, you need to consider scaling, decomposition, aggregation, and discard any features that are not relevant.

Finally, for the model phase, you first divide the data set into training data and test data, could be 70+30 or 90+10. Then comes the model training (using whatever algorithm you are using), which produces the ML model. You then test this output ML model with the test data.

To productize your model, you should integrate the ML solution with existing software and keep it running over time. At this point considerations about the deployment environment, data storage, security and privacy, monitoring & maintenance come in to play. Some great ML solutions cannot be productized due to high implementation costs or inability to be tested in practice.

The product manager is very much involved in the first 2 phases: formulating the problem and selecting the data. Product manager is also involved in feature selection but not much involved with the final model phase.

1) Umm, deep learning?
Since the presentation didn't mention any deep learning specific problems/tasks/observations, I asked Ruben about what significance deep learning had on the projects he worked on. Turns out, he didn't use any. He said that simpler ML models were enough for the tasks they undertook so he never needed a deep-learning solution. He also said that deep-learning was very expensive up until a couple years ago, and that was also a factor.

With TensorFlow, Google is supposedly using deep-learning a lot, likely more for image and voice processing. But, is there a study about the prominence of deep-learning use among ML solutions in the industry?

2) How do you troubleshoot issues with productizing ML?
As we covered above there are many things that can go wrong, such as  unanticipated bias in your data, in your method, conclusions. How do you check for these? Ruben answered they brainstorm and think very deeply about what could go wrong, and identify these issues. It seems like this needs more processes and tool support. Having seen how TLA+ specifications and model checking create wonders for checking problems with distributed/concurrent systems, I am wondering if similar design level tool support could be developed for ML solutions.

3) How do we learn/teach empathy?
Ruben was a great speaker. He used beautifully designed slides. After all he is a product manager and sympathizes with the users/audience. In Q&A session he mentioned that empathy is the most important skill for a ML product manager. I believe empathizing with your audience also goes a long way in public speaking. How do we learn/teach empathy? This is so basic that you expect/hope we learn this as kids. But it looks like we keep forgetting about this and fail to empathize. Also, there is always levels to things. How do we get better at this?

4) Is ML/DL too application-coupled?
I have a some understanding of ML/DL domain, since I started learning about it in 2016. I am still amazed at how tightly application-coupled is the ML/DL work. On one hand this is good, this makes ML/DL very practical and very applicable. On the other hand, this makes it harder to study the principles and systematize knowledge.

## Thursday, November 8, 2018

### SDPaxos: Building efficient semi-decentralized geo-replicated state machines

In the last decade, the Paxos protocol family grew with the addition of new categories.

• Paxos federations: Spanner, vertical Paxos

This paper, which appeared in SOCC 18, proposes SDPaxos which prescribes separating the control plane (single leader) from the replication plane (multiple leaders). SD in SDPaxos stands for "semi-decentralized".

The motivation for this stems from the following observation. Single leader Paxos approach has a centralized leader and runs into performance bottleneck problems. On the other hand, the leaderless (or opportunistic multileader) approach is fully decentralized but suffers from the conflicting command problems. Taking a hybrid approach to capture the best of both worlds, SDPaxos makes the command-leaders to be decentralized (the closest replica can lead the command), but the ordering-leader (i.e., the sequencer) is still centralized/unique in the system.

Below I give a brief explanation of the Paxos protocol categories before I discuss how SDPaxos compares and contrasts with those.

## Plain vanilla Paxos

Paxos provides fault tolerant consensus among a set of nodes.

• Agreement: No two correct nodes can decide on different values.
• Validity: If all initial values are same, nodes must decide that value.
• Termination: Correct nodes decide eventually.

Paxos runs in 3 phases: propose (phase-1), accept (phase-2), and commit (phase-3).

1. A node tries to become the leader by proposing a unique ballot number b to its followers with a phase-1a message. The followers acknowledge a leader with the highest ballot seen so far, or reject it with a ballot seen with a number greater than b. Receiving any rejection fails the candidate.
2. In the absence of a rejection, a node becomes leader and advances to phase-2 after receiving a majority quorum of acknowledgments. In this phase, the leader chooses a suitable value v for its ballot. The value would be some uncommitted value associated with the highest ballot learned in previous phase, or a new value if no pending value exists. The leader commands its followers to accept the value v and waits for acknowledgement messages. Once the majority of followers acknowledge the value, it becomes anchored and cannot be revoked. Again a single rejection message (carrying an observed higher ballot number) received in phase-2b nullifies the leadership of the node, and sends it back to phase-1 to try with a higher ballot number.
3. Finally, the leader sends a commit message in phase-3 that allows the followers to commit and apply the value to their respected state machines.

It's important to see that after phase-2, an anchored value cannot be overridden later as it is guaranteed that any leader with higher ballot number would learn it as part of its phase-1 before proposing a value in its phase-2.

You can find more information on Paxos in this blog, and especially Modeling Paxos in TLA+ could be of interest to you.

The traditional single-leader Paxos protocol employs a centralized leader to process all client requests and propose commands. The single leader takes a significantly heavier load than the other replicas, and becomes a performance bottleneck. Moreover, in geo-replication, clients not co-located with the leader need to send the requests to the remote leader, which incurs significantly higher wide-area network latency.

## Mencius approach

Mencius is a multileader version of Paxos that aims to eliminate the single leader bottleneck in Paxos. Mencius achieves load balancing by partitioning the consensus instances among multiple servers. E.g., if we have 3 servers, server 0 is responsible for acting as a leader for consensus instances numbered 0,3,6, server 1 for 1,4,7, and server 2 for 2,5,8, etc. Mencius tries to avoid the straggler problem by making the replicas skip their turns when they fall behind, however, it cannot fully eliminate the slow-down. Since it uses multiple leaders, Mencius also loses out on the "serve reads locally at the leader" optimization possible in Paxos.

EPaxos is a leaderless solution, where every node can opportunistically become a leader for some command and commit it. When a command does not interfere with other concurrent commands, it is committed in a single round after receiving the acks from a fast quorum (which is approximately 3/4ths of all nodes). In a sense, EPaxos compresses the phase-2 to be a part of phase-1 when there are no conflicts. However, if the fast quorum detects a conflict between the commands, EPaxos defaults back to the traditional Paxos mode and proceeds with a second phase to establish order on the conflicting commands.

Unfortunately, services like E-commerce and social network can generate high-contention workload, with many interfering commands on the same object from multiple clients. This problem is aggrevated in wide area network deployments: since requests take much longer time to finish, the probability of contention rises.

Spanner and CockroachDB are examples of databases that uses a federation of Paxos groups to work on different partitions/shards. These partitioned consensus systems employ another solution on top (such as vertical Paxos) for relocating/assigning data from one Paxos group to another.

WPaxos uses sharding of the key space and takes advantage of flexible quorums idea to improve WAN performance, especially in the presence of access locality. In WPaxos, every node can own some objects/microshards and operate on these independently. Unlike Vertical Paxos, WPaxos does not consider changing the object/microshard ownership as a reconfiguration operation and does not require an external consensus group. Instead WPaxos performs object/microshard migration between leaders by carrying out a phase-1 across the WAN with a higher ballot number, and commands are committed via phase-2 within the region or neighboring regions.

## The SDPaxos approach

The root of the inefficiency of leaderless and multileader protocols is the decentralized coordination pattern. Although decentralization addresses the single-leader bottleneck as every replica can propose commands, the replicas still need to agree on a total order on conflicting commands proposed by different replicas to avoid inconsistent state.

To address this issue, SDPaxos divides the consensus protocol in 2 parts: durably replicating each command across replicas without global order (via C-instance Paxos), and ordering all commands to enforce the consistency guarantee (via O-instance Paxos). Replicating via C-instance Paxos is completely decentralized where every replica can freely propose commands and replicate them to other replicas. This evenly distributes the load among replicas, and enables clients to always contact the nearest one. On the other hand, as part of O-instance Paxos, one replica is elected as the sequencer and handles the ordering in a centralized manner: the global view enables this replica to always order commands appropriately. Provided that the ordering messages are smaller than replication messages, the load on the sequencer will not be as severe as that on the single leader in Paxos.

Fault tolerance is provided as both the replicating and ordering instances are conducted based on Paxos. Each replica proposes commands in a series of C-instances of its own to produce its partial log. The sequencer proposes replicas' IDs in O-instances to produce an assignment log. Based on the assignment log, all replicas' partial logs are finally merged into a global log.

## Comparison with other protocols

The separation between C-instances and O-instances is the source of SDPaxos's advantages over existing protocols. The decentralization distributes load evenly across replicas, and allows clients to always contact the nearest replica in geo-replication to serve as the command leader. The O-instance leader, i.e., the sequencer, provides conflict free operation.

So, SDPaxos is like Paxos, but it has local leader in each region. This way it avoids the cost of going to the leader, and back.

Also SDPaxos is like EPaxos but with no conflicts, ever!

In SDPaxos, the sequencer is one node, but is backed by O-instances. A new sequencer can be chosen easily in a fault-tolerant way using Phase-1 of Paxos over O-instances. This alleviates the availability problems due to the serializer failure in systems that use Paxos for serializing the log in a central region. In such systems (e.g., Calvin) if the single log serializer is Paxos-replicated within a region, then the availability suffers on region failure. Instead, if the serializer is Paxos-replicated across regions then the performance suffers.

## The protocol

In this example, upon receiving a client request for a command, replica R0 becomes the command leader of this command, picks one of its own C-instance and replicates the command to others (using the C-accept, i.e., Accept phase message of the C-instance). In the meantime, this C-accept also informs the sequencer (R2) to start an O-instance for this command. Then R2 proposes R0’s ID in the next (e.g., the jth) O-instance and sends O-accepts to others, to assign this command to the jth global slot. Replicas will then accept these instances and send C-ACKs and O-ACKs to R0; R2 also sends an O-ACK as it has sent an O-accept to itself. The algorithm denotes the ith C-instance of Rn as Cni, and the jth O-instance as Oj.

A C-instance can come from any node without a Paxos phase-1a, because each replica has its own distinct replication log for C-instance. The C-instance messages do not conflict with each other and gets accepted immediately. The C-instance messages do not even need a ballotnum; the ballotnum used is that of the O-instance to denote epoch (i.e., which sequencer the sender thinks is still in-charge).

A command being ready requires the C-instance and enough number of O-instances be committed. The conditions of an instance being committed and a command being ready are defined in lines 18 through 31, which we discuss next. There are two questions here.

• The safety question is: How do we ensure that the replication is anchored (performed at the majority quorum) from the command leader perspective?
• The performance question is: How do we achieve consensus in one round while satisfying the safety concern?

## The 1-round feat

In the best case when O-instance of the sequencer overlaps perfectly with the C-instance of the command replication leader, consensus is achieved in one round-trip time ---the optimal possible. But, since the O-instance starts half a round trip later than the C-instance for non-sequencer replicas, it is not always possible to optimize the O-instance completion to just half a round trip to achieve the one-round-trip latency. But the paper shows how this can be achieved for N=3 and N=5 replicas. In groups with more than 5 replicas, the O-instances still need one round trip, thus the overall latency remains at 1.5 round-trips.

In 3-replica groups, when the command leader receives O-ACK from the sequencer, the majority (2 out of 3) is readily established for O-instance completion. This provides the one-round-trip consensus. (For the case, the command leader is also the sequencer, sequencing can be done in advance and one round-trip is satisfied as well.)

In 5-replica groups, a command can also be ready in one round trip, but unlike the case of three replicas, an O-instance cannot be accepted by a majority in half a round trip. Instead, SDPaxos lets each non-sequencer replica commit an O-instance, upon receiving the O-accept from the sequencer (line 29). Here, the O-instance does not rigorously follow Paxos, which raises a complication for recovery: if this non-sequencer replica and the sequencer fail, we cannot recover this O-instance simply by Paxos because the other alive replicas may have not seen the O-accept yet. However, the paper discusses a way for SDPaxos to correctly recover the O-instances of all replicas' ready commands even in such cases (omitted in my review), and is able to allow for 1-round-trip commits for N=5.

Note that the dynamic per key leader approach in WPaxos still has an edge over SDPaxos when there is good locality in the workload (which is often the case in practice) and/or when the number of replicas is greater than 5 (which is often the case for geo-replicated databases). It may be possible to use WPaxos for coordination across regions and integrate SDPaxos for coordination within the region upto 5 replicas.

## Optimizations

As an optimization for reads, SDPaxos uses sequencer leases to authorize the sequencer to directly reply to the read requests. In contrast, such an optimization is not possible for leaderless approaches, as there is no single/dedicated leader to lease and read from.

As another optimization, in some cases, it is possible to divide the responsibility of sequencer to all replicas for more load balancing. For example, in a key-value store, we can partition the key space using approaches like consistent hashing, then make each replica order the commands on one partition (commands on different keys can be out-of-order). Again, in this case, it would be possible to use the WPaxos approach for safe key-stealing among the sequencers, and dynamically adapt to the access pattern, rather than being confined to static partitioning.

## Evaluation

They implemented a prototype of SDPaxos, and compared its performance with typical single-leader (Multi-Paxos) and multileader (Mencius, EPaxos) protocols. The experiment results demonstrate that SDPaxos achieves: (1) 1.6× the throughput of Mencius with a straggler, (2) stable performance under different contention degrees and 1.7× the throughput of EPaxos even with a low contention rate of 5%, (3) 6.1× the throughput of Multi-Paxos without straggler or contention, (4) 4.6× the throughput of writes when performing reads, and (5) up to 61% and 99% lower wide-area latency of writes and reads than other protocols.

1. Does SDPaxos help with the leader bottleneck significantly?
I wrote above that "Provided that the ordering messages are smaller than the replication messages, the load on the sequencer will not be as severe as that on the single leader in Paxos." But on close inspection I don't think I believe that sentence anymore.  Ailidani, Aleksey, and I have done a detailed bottleneck analysis of Paxos protocol categories (under submission), and we found that the outcast messages are not the biggest source of bottleneck for the leader, as they are serialized once before being sent out to the replicas. The incast messages contribute most to the bottleneck, as the CPU needs to process them one by one and they queue up. Moreover, the incast messages are ACK messages, which are already small, and SDPaxos does not make them smaller. So, maybe SDPaxos does not improve significantly on the single-leader bottleneck in Paxos. On the other hand, it is true that SDPaxos helps distribute the client request load to the C-instance replicas relieving that bottleneck, and it definitely helps with lowering the latency in WAN.

2. How would you build reconfiguration of participants for this protocol?
Reconfiguration of participants is important to re-establish fault-tolerance capability by replacing failed replicas by accepting fresh new replicas to the system. How would reconfiguration work for SDPaxos? Would Raft's 2-phase reconfiguration method apply readily for SDPaxos?

3. What are additional challenges for efficient strongly-consistent geo-replication implementation at scale?
I am at Microsoft Azure Cosmos DB for my sabbatical. Cosmos DB recently introduced general-availability of multiple write regions. While providing strong-consistency with multi-master writes allowed from any region has cost across the globe, could SDPaxos ideas help improve efficiency further?

## Tuesday, November 6, 2018

### An unexpected phone call at the elevator

Sometime ago, I was visiting an apartment block. This was a relatively new apartment block, I think less than 5 years old.

I entered the elevator, and I heard "Hello, hello... Are you there?" booming from the speakers. This never happened to me before of course; elevators don't talk to me often. I thought maybe someone had pressed the call for help button, and left the elevator, and I can help sort this out.

I said, "Hi, yes!"

The voice continued saying:
-"The reason I am calling you is because of your last energy bill".

After a second of cognitive dissonance, I started laughing:
-"Umm, I don't know what happened, but I am in an elevator now, and your voice is literally streaming from the elevator."

The voice stopped for a brief couple of seconds, then continued explaining something about the bill, following the script he was given.

-"Dude, this is my floor, and I got to go now. Ok, bye."

It turns out most elevator phones ---which are mandated by law to allow you to call for emergency services in case you get stuck in the elevator--- are public network connected phones. And this occasionally leads to receiving phone calls at the elevator. This short youtube video is so funny. Unfortunately I was so shocked by the phone call it didn't occur to me to record it.

From now on, this is how I will answer any unwanted spam call I receive: "Dude, I don’t know what happened, but I am in an elevator now, and your voice is literally streaming from the elevator. This is my floor, I gotto go now."

1) If we have Internet-of-Things vision realized, do I have this to look forward to? Collection calls at the elevator? Or when trying to drive my car?

2) What is the worst malicious thing you can do if you have (or can find) the phone numbers for elevators at apartment buildings or workplaces?

Huh, of course, there is prior art on the Web.

## Monday, November 5, 2018

### Judoing the Dunning-Kruger effect: the "surprisingly-popular option" strategy for crowdsourcing

Let's say you want to crowdsource the answer to the question:
(Q1) What is the capital of Brazil?

The surprisingly-popular option strategy for crowdsourcing suggests piggybacking a control question to the real question Q1:
(Q2) What do you think the majority of other people will respond to this question?

Many people (non-Brazilian and non-geography-nerds) will answer with Rio for both Q1 and Q2. But there will be some people that will answer with Brasilia for Q1 and yet Rio for Q2.

The first set of people that replied with Rio to both questions did not know much about Brazil, and went with what they know as the most prominent city in Brazil. The second set of people not only knew of the correct answer, Brasilia, but they also anticipated that the majority of participants will go wrong by answering with Rio.

Rio is the popular option for Q1, but Brasilia is the surprisingly popular option because the respondents for Brasilia had anticipated that Rio would be the popular option.

The lesson is: "People who expect to be in the minority opinion deserve some extra attention."

(I wish this would have occurred to us 4 years ago, when we were collecting crowdsourced data from our "who wants to be a millionaire" app.)

1. What are similar judoing techniques?
I am fascinated with this technique, because this is like applying a judo-throw on the Dunning-Kruger and reversing its effects.

It takes a weakness observed in the Dunning-Kruger effect (that people of low ability suffer from illusory superiority and mistakenly assess their cognitive ability as greater than it is), and transforms that in to a strength for identifying the expert voices among the participants, the people who knew the answer to Q2 but provided a minority answer for Q1.

What are similar judo moves in computer science/technology and in general?

2. What about the conspiracy nuts?
This surprisingly-popular option technique has a conspiracy nuts problem, isn't it? There is inevitably a strongly opinionated group that will be in the minority and they will correctly anticipate that they will be in the minority. This technique does not filter for them, right?

So, let's take anti-vaxxers. They will say "no" to (Q1) "Should I vaccinate my kids?" and they will correctly guess the popular option "yes" to (Q2) "What do you think the majority of other people will respond to this question?"

How do we fix this issue with the method? Maybe we can add a (Q3) What do you think the minority choice is and why it is wrong? Of course we are moving away from the automatic aggregation in the original method, but the essence of the idea still holds: Are you aware of the other opinions on this question, and can you explain and refute them?

Can you not only strawman, but also steelman the other arguments and address them?

3. How does this apply to me as I assess my views/positions?
For the hygiene of the mind, it is important to reflect on your beliefsets/behaviors/positions and reassess them occasionally. Am I able to explain why I am in the majority or minority side for my positions? Do I understand the other positions and not only refute them but can also appreciate some valuable points in them?

Some other related posts from my blog on this are:

## Monday, October 8, 2018

### Everything is broken

Last Wednesday, I attended one of the monthly meetings of the "Everything is Broken" meet up at Seattle. It turns out I selected a great meeting to attend, because both speakers, Charity Majors and Tammy Butow, were excellent.

Here are some select quotes without context.

## Observability-driven development - Charity Majors

Chaos engineering is testing code in production. "What if I told you: you could test both in and before production."

Deploying code is not a binary switch; deploying code is a process of increasing your confidence in your code.

"Microservices are hard!" as a caption for a figure comparing the LAMP stack 2005 versus the complexity of the Parse stack 2015.

We are all distributed systems engineers and unknowns outnumber the knowns!
Distributed systems have an infinite number of almost-impossible failures!

Without observability you don't have chaos engineering, you have a chaos.

Monitoring systems have not changed significantly in 20 years, from Nagios. Complexity is exploding everywhere, but our tools are designed for a predictable world.

Observability for software engineers: can you understand what is happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output?

For the LAMP stack monitoring was sufficient for identifying the problems.
For microservices, it is unclear what we are supposed to monitor for. We need observability!
The hard part is not debugging your code, but to find which part to debug!

Facebook's  Scuba was ugly, but it helped us slice and dice and improve our debugging! It improved things a lot. I understand Scuba was hacked to deal with MySQL problems.

You don't know what you don't know, so dashboards are very limited utility. Dashboards are only for anticipated cases: every dashboard is an artifact of past failures. There are too many dashboards, and they are too slow.

Aggregates are the kiss of death; important details get lost.

Black swans are the norm; you must care about 99.9%, epsilons, corner cases.

Watch things run in production in the normal case; get used to observing your systems when they aren't on fire.

## Building Resilient Systems Using Chaos Engineering - Tammy Butow

Chaos engineering is "thoughtful planned experiments designed to show weak points in the system".

Top 5 popular ways to use chaos engineering now: kubernetes, kafka, aws ecs, cassandra, elasticsearch.

Fullstack chaos engineering: inject faults at api, app, cache, database, os, host, network, power

We are exploring a new direction and collaborating with the UI engineers on ways to hide impact of faults.

prerequisites for chaos engineering:
1. monitoring & observability
2. on-call & incident management
3. know the cost of your downtime per hour (British Airlines's 1 day outage costed \$150 millon)

How to choose a chaos experiment?
+ identify top 5 critical systems
+ choose 1 system
+ whiteboard the system
+ select attack: resource/state/network
+ determine scope

How to run your own gameday: http://gremlin.com/gameday

Outage post-mortems: https://github.com/danluu/post-mortems

First chaos engineering conference this year: http://twitter.com/chaosconf

## Some notes about the venue: Snap Inc

There were fancy appetizers, very fancy. They had a kitchen there at the fifth floor (and every floor?). Do they provide free lunch to snap employees?

At the 5th floor, where the meeting took place, we had a great view of Puget Sound bay. The Snap building is just behind the Pike Market Place. There were about 80-100 people. I think the 30+ folks outnumbered 40+ folks, but not severely. Good show up from female engineers. There was ambient music in the beginning from 6-6:30pm, but it was loud.

By the way, I never used snapchat... I am old. But I don't have a Facebook account, so maybe I am not that old.

1. Do you need to test in production?
The act of sabotaging parts of your system/availability may sound crazy to some people. But it puts forth a very firm commitment in place. You should be ready for these faults, as they will happen in one of these Thursdays. It establishes a discipline that you would test, gets you prepared with writing the instrumentation for observability, and toughens you up. It puts you into a useful paranoid mindset: the enemy is always at bay and never sleeps, I should be ready to face attacks. (Hmm, here is an army analogy: should you train with live ammunition? It is still controversial because of the lives on the line.)

Why not wait till faults occur in production by themselves, they will happen anyways. But when you do chaos testing, you have control in the inputs/failures, so you already know the root cause. And this can be give you much better opportunity to observe the percolation effects.

2. Analogies for chaos engineering
I have heard vaccination used as an analogy. It is a tactful analogy (much better than the live firing analogy). Nobody can argue against usefulness of vaccinations.

Other things chaos testing evokes could be blood letting and antifragility. I had read somewhere that the athletes in ancient Greek would induce a diarrhea on purpose a couple weeks before competitions, so that their body can recover and get much stronger at the time of the competition. I guess the reasoning goes as "too much of a monotone is a bad thing" and it is beneficial to stress/shake the system to avoid a local maxima. That reminds me of this YouTube video I show in my distributed systems class on the topic of resilience.

3. Debugging designs with TLA+
Even after you have a verified design, the implementation can still introduce errors, so using chaos engineering tools is valuable and important even then.

It helps even for "verified" systems for its nonverified parts:
Folks encouraged us to try testing verified file systems; we were skeptical we would find anything, but to our surprise, when we tested MIT’s FSCQ file system, we found it did not persist data on fdatasync()! Apparently they had a bug in the un-verified portion of their code (Haskell-C bindings), which was caught by Crashmonkey! This shows that even verified file systems have un-verified components which are often complex, and which will have bugs.

4. Chaos tag
Turns out I have several posts mentioning chaos engineering, so I am creating a chaos tag to be available for use for future posts.

## Sunday, October 7, 2018

### Debugging designs with TLA+

This post talks about why you should model your systems and exhaustively test these models/designs with the TLA+ framework. In the first part, I will discuss why modeling your designs is important and beneficial, and in the second part I will explain why TLA+ is a very suitable framework for modeling, especially for distributed and concurrent systems.

## Modeling is important

If you have worked on a large software system, you know that they are prone to corner cases, failed assumptions, race conditions, and cascading faults.

There are many corner cases because there are many parameters, and these do interfere in unanticipated ways with each other. The corner cases violate your seemingly reasonable implicit assumptions about the system components and environment, e.g.,"1-hop is faster than 2-hops", "0-hop is faster than 1-hop", and "processes work with the same rate". There are abundant race conditions because today (with the rise of SOA, cloud, and microservices) all systems are distributed systems. Code that is supposedly "atomic block of execution" fails due to other processes executing concurrently. Finally, faults happen and their effects are almost always underestimated pre-deployment. Faults take your system to unanticipated states, and from there on with the interleaving of recovery actions with normal system actions, the system may be thrown to even more unanticipated states.

In large software systems, which are inevitably distributed systems, there are many unknown-unknowns and an infinite number of highly-improbable ways things can go wrong. Human brain and reasoning cannot scale to handle all these possibilities. To alleviate these problems, the industry developed tools for better observability and even testing in production for improving availability. These tools are very important and indispensable. But by the time you figure out some inherent problems with your design it may be too hard and expensive to fix things. What you thought would be the last 10% of the project ends up taking 90% of your time at production and operations.

If you model your designs first and exhaustively test and debug these models for correctness against corner cases, failed assumptions, concurrency, and failures, you can catch errors at the design time and fix them before they develop into problems and become costly to fix.

• Modeling first does not extend your development time, on the contrary it saves you time by reducing futile development attempts. Embarking on development with a flawed design almost always ensures that the implementation is flawed. While having a precise and correct model at hand does not guarantee that your implementation of the model is correct, it helps you avoid the big/intricate problems and also provides a good reference for testing your implementation against.
• Constructing a precise model of your system gives you clarity of thinking and supports your development immensely.  By modeling you discover about the inherent complexities of the problem; that helps you focus your attention and ignore accidental/byproduct complexities.
• The model also helps you to communicate precisely with your team and others as you avoid the ambiguity of natural language and the hand-waving and generalizations involved.
• Finally with the model at hand, you also have a chance to gradually introduce design decisions, and see alternative ways to implement the design.

## TLA+ is great for modeling

TLA+ is a formal language for describing and reasoning about distributed and concurrent systems. It is developed by Dr. Leslie Lamport, Turing Award winner 2013. Lamport is a very important figure in distributed systems due to his logical clocks work and Paxos work among many others. For the last decade, he is very involved with improving the TLA+ framework to help make distributed systems more manageable.

TLA+ uses basic math to model and reason about algorithms: practical logic, set theory, and temporal logic are used for specifying  systems. Best of all, the framework integrates a model checker that exhaustively tests your models to the face of corner cases, failed assumptions, concurrency, and failures. The model checker tries all executions possible for your model and tells you for which executions, your invariants and system guarantees break.

Invariant-based reasoning
TLA+ framework promotes invariant-based reasoning to prevent the problems that arise from operational reasoning. In operational reasoning, you start with a "happy path", and then you try to figure out "what can go wrong?" and how to prevent them. Of course, you always fall short in that enumeration of problem scenarios and overlook corner cases, race conditions, and cascading failures. In contrast, invariant-based reasoning focuses on "what needs to go right?" and how to ensure this properties as invariants of your system at all times. Invariant-based reasoning takes a principled state-based rather than operation/execution-based view of your system.

To attain invariant-based reasoning, we specify safety and liveness properties for our models. Safety properties specify "what the system is allowed to do". For example, at all times, all committed data is present and correct. Liveness properties specify "what the system should eventually do". For example, whenever the system receives a request, it must eventually respond to that request. In other words, safety properties are concerned with "nothing bad happens", and liveness properties with "something good eventually happens".

Modeling with TLA+
The TLA+ framework supports you in building a model and figuring out its invariant properties in two major ways. Firstly, the math-based formal language helps you achieve precision while still working with high-level declarative statements. Secondly, the integrated model checker exhaustively debugs your model to the face of concurrency and failures, and produces counterexamples for which your candidate invariants fail. (After years of working with TLA+, I am still surprised about the counterexamples the model checkers spit out for my models: It is very easy to overlook some scenarios, but the model checker sets you straight.) You address these problems by improving your model or sometimes by relaxing your candidate invariants, and after many iterations converge to an exhaustively debugged model which guarantees the invariants.

Building a TLA+ model is beneficial even for systems that are already implemented and running. Through building the model, you learn about your system better, and figure out some latent failure modes and correct them before they occur in production.

Finally, maintaining a TLA+ model of your system provides important benefits for continuous development. While software systems need to be extended with new features frequently, these extensions may interfere in unanticipated way with the system and lead to downtimes. With the TLA+ model at hand, you can first add these features to your model, and catch/debug the problems at the design-level using the model-checker. This way you resolve potential issues before they even become problems.

TLA+ is practical
Since using TLA+ actually saves time for building large software systems, TLA+ modeling is adopted as a practice by many software companies.

I am on sabbatical at Cosmos DB, Microsoft globally distributed cloud-native database. The team has been using TLA+ to model the replication and global distribution protocols and exhaustively tests the designs for correctness against failures. We have recently published the customer-facing part of the model which precisely defines the 5 consistency levels offered by Cosmos DB.

Amazon has also used TLA+ modeling for some of their AWS offerings and has written a nice experience report on this. There are also reports of using TLA+ for modeling hardware systems as well.

For the last 4 years, I have been incorporating TLA+ in my distributed systems classes. TLA+ enables students to learn about concurrency and invariant-based reasoning and it provides them hands-on experience with distributed protocols. I also use TLA+ exhaustively in my research on new distributed algorithms.

In my experience, it is possible to pick up TLA+ up in a couple weeks. This is firstly because TLA+ adopts a very simple state-machine approach to model systems. A system consists of: (1) A set of variables which define the state of the system, and (2) A finite set of assignments/actions that serves to transition the system from one state to another.

Furthermore, PlusCal provides syntactic a sugar for the TLA+, which has a tendency to grow long (due to its low-level state-transition centric syntax) and look cryptic for some people. PlusCal is a pseudocode for writing algorithms at a higher-level of abstraction, and it is translated to the underlying TLA+ specifications for model checking. To give you some idea about the PlusCal, here is an example of a PlusCal code for a database replica process. While this is a straightforward code, you can see a nondeterministic choice construct "either or" in action. The model checker will exhaustively test all possible combinations of these "either or" actions and check if a certain sequence would break one of your safety and liveness specifications.

There is a very active TLA+ forum at Google Groups. Leslie Lamport chimes in several threads.

LearnTLA provides a user-friendly introduction to TLA+/PlusCal.

## Wednesday, September 26, 2018

### The last mile problem in trust

Blockchains are supposed to solve the trust problem. But blockchains attack only the easy part of the trust problem, and avoid the hard part. The easy part is to store the transactions in a tamper-resistant database. The hard part is to attest to physical world actions and state.

The blockchain is a database technology and it does not attempt to attest to physical world actions/state. It solves the problem of tamper-proofing the state after it is added to the database. It doesn't attempt to validate/test/certify if the state is correct as it is added to the database. If humans create the state, there is inherently a trust problem: Were the lettuce bad before it was loaded to the trucks, or are the truck conditions to blame? Did the farmer or the trucker lie?

If sensors create the state, this is still a very hard problem, but not because the sensors may have been tampered with ---that is a relatively easy problem to solve in hardware. The problem is hard because of the corner-cases involved; how do you even start to pretend that the sensors have complete coverage (or good/fair sampling) and the detection/verdict is accurate? It is really a very complex and messy problem. As far as complete coverage of food supply-chains are concerned, you need DNA-sequencing and metagenomics.

This is a classic last mile problem. The last mile problems are always hardest to solve because of the massive fan-out both in terms of scale and in terms of corner cases to handle. The last mile problems haunted many domains, most notoriously the telecommunications and transportation domains.

## Walmart, Lettuce, and Blockchains

A couple days ago there was a lot of hype about Walmart starting to use blockchain in its supply chain, to pinpoint where the lettuce come from in an E.Coli contamination event.

Ok, let's get to the bottom of this. "Walmart, Lettuce, Blockchain." It felt very weird to type this in Google search, but I did it anyways... for science.

See, I knew there was a lot of hype: "The giant retailer will begin requiring lettuce and spinach suppliers to contribute to a blockchain database that can rapidly pinpoint contamination."

De-hyped, this just says Walmart wants the farmers to record transactions in a database.  And actually the article makes sense if you replace blockchain with database:  "Walmart says it now has a better system for pinpointing which batches of leafy green vegetables might be contaminated. After a two-year pilot project, the retailer announced on Monday that it would be using a blockchain, the type of database technology behind Bitcoin, database to keep track of every bag of spinach and head of lettuce."

I blame IBM's over-excitement in blockchains for the hype in the article. Supply-chains is a very complex topic, and this use of a database to record information doesn't come close to scratching the surface of it. There are many automation and logistics problems that remain to be solved. And the dreaded last mile problem of course.

### 1. What is the nature of trust?

What or who do you trust?

Trusting a deterministic machine with few inputs/environmental-parameters is reasonable. Especially if you verified and validated it, and tested it extensively.

But what would make you trust humans? Humans are complex nondeterministic beings, and the input and environment surrounding humans are also very complex.

Reid Hoffman defines trust as consistency through time. But this is assuming the conditions don't change. If conditions change, that is the inputs/environmental conditions change, the other side can change its actions.

The answer to the trust puzzle has got to do with "consequences", right?

It is easier to trust in a situation where you have little to lose, but the other side has a lot at stake. And ironically, this makes the other side have problems trusting you, since you have little at stake, and she is risking a lot. For mutual trust and better collaboration, all parties should have skin in the game.

So what is at stake? This can be reputation, if reputation is a currency valued by the individual and his environment. What is at stake can be jail time, if one breaks laws and get caught. This is assuming one doesn't enjoy jail. Under certain conditions, people commit crimes to get into jail to get fed and have reliable healthcare, and even not to feel lonely.

I think trust is not complicated, rather the calculation, alignment, and managing of consequences/incentives is complicated. And this again harkens back to the last mile problem in trust.

I believe the parties involved are going to push the limits of what they can get away with as long as the deterrents do not outweigh the incentives.

I don't know if there is a technology solution here.

At a recent A16Z podcast, one speaker was rightfully complaining that we have a lot of trust issues and fight among complementary business rather than substitute/alternative business. For example even though iphone apps and iphone platform are complementary businesses, there is a lot of fight there. Or consider the Yelp versus Google fight. Or the fights Facebook, the platform, picks with the applications it enables. The speaker was implying that with the right incentivization and cuts from cryptocurrencies like ethereum gas, the parties will actually synergize and grow together rather than fight.

This sounds nice and simple, but I don't think I buy this. The fights are due to the greedy nature of humans and companies. To repeat what I said said above, I believe the parties involved are going to push the limits of what they can get away with as long as the deterrents do not outweigh the incentives. Even if cryptocurrencies and Ethereum gas is used between platforms and applications enabled, next we will see fights over how much of the payment is fair etc. I don't know if technology can fix that. Maybe this is supposed to be a dynamic equilibrium with constant push-backs and small-battles erupting from the parties involved.

### 2. What is the verdict?

I don't hate/despise blockchains, as I have seen some colleagues do. That is a radical and unreasonable position. There are many smart people working on this domain, they cannot be all and completely wrong.

I am still ambivalent about blockchains. I believe there is still a big contribution potential coming from blockchains and smartconracts. But the hype news make things harder to see.

## Saturday, September 8, 2018

### Book review. Ignorance: How it drives science

I picked this up from my local library, because the title was interesting. I wrote about this earlier.
Once you get a B.S., you think "you know everything". Once you get an M.S., you realize "you know nothing". Once you get a Ph.D., you realize that "yes, you know nothing, but that is not a problem, because nobody knows anything!"
This turned out to be a nice read. The author, Stuart Firestein, has a very interesting background. He was working at a theater, and started a biology undergraduate at 30, and got his PhD at 40.

Here are some tidbits from the book.

Leibniz. page 38
The 17th-century German philosopher and mathematician Gottfried Leibniz, one of the inventors of calculus, had a lifelong project to construct a "basic alphabet of human thoughts" that would allow one to take combinations of simple thoughts and form any complex idea, just as a limited number of words can be combined endlessly to form any sentence -- including sentences never before heard or spoken. Thus, with a few primary simple thoughts and the rules of combination one could generate computationally (although in Leibniz's day it would have been mechanically) all the possible human thoughts. It was Leibniz's idea that this procedure would allow one to determine immediately if a thought were true or valuable or interesting in much the same way these judgments can be made about a sentence of an equation -- is it properly formed, does it make sense, is it interesting? He was famously quoted as saying that any dispute could be settled by calculating-- "Let us calculate!" he was apparently known to blurt out in the middle of a bar brawl. It was this obsession that led Leibniz to develop the branch of mathematics known today as combinatorics. This in turn sprang from the original insight that all truths can be deduced from a smaller number of primary or primitive statements, which could be made no simpler, and that mathematical operations (multiplication was the one Leibniz proposed but also prime factorization) could derive all subsequents thoughts. In many ways this was the beginning of modern logic' indeed, some consider his /On the Art of Combinations/ the major step leading from Aristotle to modern logic, although Leibniz himself never made such claims.

Godel. page 41
What Godel showed, using a strange new correspondence between mathematics and logic that he invented, was that if a system were the rules of that system. This means that something that could be shown to be true using the system could not in fact be proved to be so. Since proofs are the foundation of mathematics, it is quite curious when obviously true statements cannot be proved.

Godel. page 42
Was this the end of the messianic program to establish the primacy of mathematics and of logical thinking? As it turns out, quite the contrary. Godel's small, by comparison, but revolutionary output is so asttonishing because of the technical and philosophical research opportunities it has created. Previously unconsidered ideas about reccursiveness, paradox, algorithms, and even consciousness owe their foundations to Godel's ideas about imcompleteness. What at first seems like a negative --eternal incompleteness-- turns out to be fruitful beyond imagining. Perhaps much of computer science, an area one might think was most dependent on empirical statements of unimpeachable logic, could not have progressed without the seminal ideas of Godel. Indeed, unknowability and incompleteness are the best things that ever happened to science.

Hilbert. page 48
In fact, one of the most predictable things about predictions is how often they're wrong. Nonetheless, they are a measure, even if somewhat imprecise, of our ignorance. They are a catalog of what we think the important ignorance is, and perhaps also a judgment of what we think is the most solvable ignorance.

Ignorance is not just an excuse for poor planning. We must think about how ignorance works, and we have to be explicit about how to make it work to our advantage. While for many experienced scientists this is intuitive, it is not so obvious to the layperson, and it often seems not so apparent to young scientists starting out their career and worrying about grant support and tenure.

Grants. page 59
How do scientists ponder these big questions about ignorance? How do they get from these and other interesting and important issues to an actual scientific research program? Well, at the most pedestrian, but nonetheless critical level, there are grant proposals. Every scientist spends a significant percentage of his or her time writing grants. Many complain about this, but I actually think it's a good idea. These documents are, after all, a detailed statement of what the scientist hopes to know, but doesn't, as well as  a rudimentary plan for finding it out.

Models. page 70
This strategy of using smaller questions to ask larger ones, is, if not particular to science, one of its foundations. In scientific parlance this is called using a "model system". As Marvin Minsky, one of the fathers of artificial intelligence, points out, "In science one can learn the most by studying the least". Think how much more we know about viruses and how they work than about elephants and how they work. The brain, for example, is a very complicated piece of biological machinery. Figuring out how it works is understandably one of humankind's great quests. But, unlike a real machine, a man-made, designed machine, we have no schematic. We have to discover, uncover, the inner workings by dissection-- we have to take it apart. Not just physically but also functionally. That's a tall order since there are some 80 billion nerve cells that make up the human brain, and they make about 100 trillion connections with each other. ... So instead of a human brain, neuroscientists study rat and mouse brains, fly brains because they  can do some very fancy genetics on them, or even the nervous system of the nematode worm, which has exactly 302 neurons.

1. What do you feel ignorant about in your line of work?

## Friday, August 31, 2018

### TLA+ specification of the bounded staleness and strong consistency guarantees

In my previous post, I had presented a TLA+ modeling of distributed data store that provides the consistent prefix property. In this post, I extend this model slightly to build bounded and strong consistency. In fact the strong consistency specification is achieved when we take the Delta on the bounded consistency as 1.

The TLA+ (well, PlusCal to be more accurate) code for this model is available at https://www.dropbox.com/s/qvmhhgjf9iycaca/boundedstrongP.tla?dl=0

## The system model

As in the previous post, we assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions.
WriteRegion == 1
chan = [n \in 1..NumRegions |-> <<>>];

We use D to denote the Delta on the bounded staleness consistency. Bounded staleness ensures that read results are not too stale. That is, the read operation is guaranteed to see at least all writes that precedes D  number of updates before the read started. The read may potentially see some more recently written values.

Strong consistency ensures that a read operation returns the value that was last written for a given object. This is achieved by using D=1.

## The write region actions

The write region has 3 actions to perform: Commit the write in the write region, Multicast the write to the read regions, and Receive/process an ack message from a read region.

These actions can be performed in any arbitrary order inside the process loop, and the model checker will methodically explore all possible combinations of these actions interleaved with the read region actions to expose any flaws in the protocol.

The first action selection is to commit a write in the write region. (We don't get into how that is implemented in the region; maybe commit is done by replicating the update to a write quorum in the region.) As a result the CommittedLSN is incremented. Note that, in contrast to prefix consistency model, in the bounded staleness model the write is throttled to not advance more than D ahead of any of the read regions.

The second action selection is to multicast a committed write to the read regions through the FIFO channels. and this is an asynchronous replication. These may be sent whenever this action is chosen, and is not blocked on waiting any acknowledgements from the read regions.

This action is exactly the same as in the prefix consistency model. The SentLSN variable denotes the last CommittedLSN write that is forwarded to the read regions and is used for ensuring ordered replication.

The final action selection is to receive an Ack message from the channel from a read region. The Progress variable keeps track of the Ack messages, and the CompletedLSN is updated to reflect the highest write that is acknowledged by all the read regions. Harking back to action 1, notice that the write region lazily disseminates this CompletedLSN information with the read regions by piggybacking this to the commit-write messages. In this model the read-regions do not utilize this CompletedLSN information, but as I start to explore in the MAD questions, this can be useful.

The read regions only react to the replicated messages from the write region. The first action selection is to receive a message pending in the channel from the write region. The second action selection is to send back an Ack message for any replication message that is not acknowledged yet. The actions are almost the same except for the line updating the CompletedLSN at the read region.

## Invariant checking and testing

The consistent prefix invariant still holds for this model as we refined that model to obtain this one.
CP == [][\A i \in ReadRegions:
CommittedLSN'[i] = CommittedLSN[i]
\/ CommittedLSN'[i] = CommittedLSN[i] + 1]_vars

The BoundedC invariant is to check that the read regions are always maintained to be within the staleness bound of the most recent CommittedLSN.  (Since I used CommittedLSN variable for both read and write regions, the TLA translator assigned CommittedLSN_ "with underscore" to the write region's version to distinguish it from that of the read regions.)
BoundedC  == \A i \in ReadRegions :
CommittedLSN[i]=< CommittedLSN_[1]
/\ CommittedLSN[i]>= CommittedLSN_[1] -D

The SyncStep invariant is to check the relationship between the CompletedLSN at the write region and the copies maintained at the read regions.
SyncStep  == \A i \in ReadRegions :
CompletedLSN[i] =< CompletedLSN_[1]
\/ CompletedLSN[i] > CompletedLSN_[1] -D

I first wrote this predicate with "CompletedLSN[i] > CompletedLSN_[1] -1" but the model checker was quick to tell me I was wrong. This is bounded by D and "not 1" as receive operations at the read regions can be asynchronous within the D staleness bound. Here the write region received Acks for its two commits back to back so the CompletedLSN at the write region was 2 versions ahead of those in the read regions.