Wednesday, December 12, 2018

How to fix scholarly peer review

All academics invariably suffer from an occasional bad and unfair review. We pour our sweat (and sometimes tears) over a paper for many months to watch it summarily and unfairly shot-down by a reviewer. Opening a decision email to read such a review feels so sickening that it hurts in the stomach. Even after 20 years, I am unable to desensitize myself to the pain of being shot-down by an unfair review. I suspect quite many people quit academia being frustrated over bad reviews.

Peer review is a complex and sensitive topic, and this post will inevitably fall short of capturing some important aspects of it. But I believe this topic is so important that it deserves more attention and conversation. Here I first write a bit about how to handle bad reviews. Then I outline some problems with our current peer review process and suggest some fixes to start a conversation on this.

The first rule of review club

The first rule of review club is to always give the benefit of doubt to the reviewers. We are all partial about our work: if we didn't think highly of it, we wouldn't have worked on it so hard, and submitted it to a  good venue. I suggest that you avoid responding to the reviews the first day. Chances are that you will be too busy processing your emotions to have bandwidth to process the reviews calmly. So sleep on it, read the reviews again the next day. Give it your best shot to see what gets criticized. Try to see what got misunderstood. If some reviewers had this problem, several of your readers will have similar problems as well. Can you improve your presentation to avoid that problem? How can you address/disarm the criticisms from the reviewers? What is the opportunity to improve here? The idea is to learn whatever you can learn from the reviews, and become so good they can't ignore you.

The second rule of review club

The second rule of the review club is that occasionally you get a review so malicious, so bad that you are stumped. This is inevitable. Welcome to the club!

What do you do now?

Again exercise caution. Give it a day, meet with your coauthors. Try to give the benefit of the doubt to the reviewers. If you still cannot justify in any way that the reviewer is not malicious and unfair, you should contact the PC chairs or Editor to complain about the behavior. This often gets ignored, so don't expect much.

Basically you swallow this abuse of power and move on. You try not to get too upset by many months wasted waiting for this rejection, and try to plan ahead for another publication venue.  Maybe you make a mental note to avoid this conference and journal. But isn't that penalizing yourself, and not the guilty party?

If you like to fight the unfairness, you don't have much leverage. You can make the reviews public to expose the unfairness to community. But that is about it, and this probably won't make much difference either. Yes, very frustrating. Very unfair. I know.

The current peer review process

This is how the current scholarly peer review works. You submit your paper, and it gets reviewed by (often) 3 anonymous reviewers. After 3 months (or 6 months for the journals), you get the reviews and the decision in the email. The rating is generally average of the reviewers, and a bad review damages your chances especially severely for conferences.

The blind reviewing works well when the reviewers are conscientious, and the PC-chairs/editors take meticulous care in their job to oversee the reviewers. But note that there is an asymmetric power relation here. The problem is that anonymous reviewers don't have much skin in the game, and the process is open to abuse.

Good intentions don't work, good systems work. And there is obviously something broken with our reviewing system today. The quality of reviews within the same conference or journal is uneven and varies wildly. Reviewers, who are overworked academics, are overwhelmed with lots of papers to review. Competent reviewers are especially overwhelmed. As an associate editor at IEEE TCC, when I try to find reviewers for papers, my usual success rate is 3 over 8-10. I need to contact up to 10 reviewers to find 3 that accepts to review the paper.

The basic recipe

Don't tear down a fence before you understand what it does. As tempting it is, we can't rush to do away with blind review. Disclosing the reviewer names may also cause problems. The authors are partial about their work, and some may unfairly retaliate to the reviewer. Secondly, without the anonymity cloak, the reviewers may not be critical enough of some submissions from certain high-profile faculty/schools.

So with due caution and awareness of the complexity of peer review, I provide some suggestions to improve the system in good faith. This proposal has holes and difficulties in implementation, but I hope it serves to start the conversation.

While I am not confident in the specifics of the implementation, I think the high level solution is clear. We need to incentivize the reviewers and then demand accountability from them. So here are my suggestions to this end.

1. Compensate the reviewers

Reviewing is a thankless job. Conference reviewers get some recognition, as their name appears in the program committee page of the conference (but this gets old quickly). Journal reviewers don't even get that.

The reviewers do these reviews pro bono and are overworked. You may not realize it from outside but academics work really hard between teaching, research, departmental administration duties, and other professional duties.

Did I mention that reviewers and Editors don't get paid a dime? We are talking about domain experts here, who could easily make \$200 an hour consulting. I think we owe it to the reviewers to compensate them for their contribution in the publishing process. Can the conferences and professional organizations afford to pay reviewers? I don't see why not. I have not been involved in any conference organization, so I may be wrong here, but please humor me. A conference with 1000 attendees with \$1000 per registration fee (not uncommon) makes a total of  \$1 million. Where does this money go? Definitely not to the dubious quality conference swag. Hotels may be expensive, but not that much. If hotels are taking most of that money, we need tougher negotiators on the conference organization committees. The hotels already get a good deal with attendees staying there for the duration of the conference. Remember Chesterton's fence. We should of course consider what type of side-effects compensating the reviewers may have. Could this lead to a clique of friends who recruit each other for reviewing gigs? I don't know. If the compensation is not high, and if we keep reviewers accountable with respect to the quality of the reviews, this may not be a problem. If money is too messy, provide free conference registration to the reviewers. Even if we don't compensate the reviewers, at least we need to set up a system to prevent freeloading. If you don't help with reviewing, you don't get to have your papers reviewed. If you are a conscientious reviewer, maybe you get to have a fourth reviewer? 2. Provide an author-feedback phase Conferences should provide a rebuttal phase to give the authors a chance to respond to the reviewers' criticisms. Although limited in their effectiveness/consequence, this response phase still gives a voice to the authors. As an extra point, I really liked what SIGMOD did with their author feedback; they explicitly asked the authors to report on any bad/offending reviews. In journals, there is no rush. So even for rejected papers, the journal may provide a response opportunity, and the authors get to present their responses to the reviewers. 3. Form a grievance committee To give more voice/faculty to the authors, a grievance committee can be formed to inspect the reviews suspected of foul play. The committee shall inspect the situation, consult with the other reviewers on the paper, and write a report on the decision. Maybe this is on the crazy side, but here it is: It may even be possible to publicize the name of a malicious or severely negligent reviewer. (There may be a decentralized cryptographic signing solution under which two reviewers may make the name of the third reviewer visible if they agree on neglect/abuse by the third reviewer. Crypto geeks, is it possible to implement a multisig solution for this on hotcrp soon?) 4. Take ownership/responsibility for the reviews As PC chairs or journal editors, you should take responsibility of the quality of the reviews provided to the authors. You should not blindly sign on the reviews, as at the end of the day the quality of the reviews provided to the authors is your responsibility. In addition, in-person PC meetings (when feasible) is good for enforcing accountability for the reviewers. Again the travel for PC members should be paid for by the conference registration fees, if an in-person PC meeting is established. Reviews can be rated by other reviewers and feedback in the form of blind review can be provided to the reviewers. These feedback can help train reviewers to avoid common mistakes: being hypercritical on a minor point, failing to see the overall value provided, being prejudiced towards certain methods/approaches, making unsupported claims, etc. We may even consider pair-reviewing to complement peer-reviewing. Finally, as a reviewer, you should consider voluntarily signing your names on your review. The idea here is to keep yourself accountable by voluntarily giving up your anonymity. The signing of the name decision should be made before the review assignments and one should not be allowed to sign reviews only for acceptances. I have seen some people do this. And I will give this a try myself. In my experience, a reject decision doesn't hurt if the reviewer supports her position well and put in the work to understand and fairly evaluate the work with respect to the cohort of papers submitted. So I am OK signing my name on a reject decision. Sunday, November 18, 2018 HotNets'18: Networking in Space HotNets'18 was held at Microsoft Research, Building 99. This is walking distance to my office at Cosmos DB, where I am working at my sabbatical. So I got tempted and crushed this workshop for a couple sessions. And oh my God, am I happy I did it. The space session was particularly very interesting, and definitely worth writing about. My God, it is full of satellites! According to a 2018 estimate, there are 4,900 satellites in orbit, of which about 1,900 operational, while the rest have lived out their useful lives and become space debris. Approximately 500 operational satellites are in low-Earth orbit, 50 are in medium-Earth orbit (at 20,000 km), and the rest are in geostationary orbit (at 36,000 km). The low earth orbit LEO satellites are not stationary and fast moving around the earth at 1.5 hour per rotation. We are talking about the lowest ring in this picture, where International Space Station (ISS) resides. Since LEO satellites are close to Earth, this makes their communication latency low. Furthermore, if we take into account that the speed of light in vacuum is 1.5 times faster than in fiber/glass, communicating over the LEO satellites becomes a viable alternative to communicating over fiber, especially for reducing latency in WAN deployments. When this gets built, it will change Internet: in some accounts up to 50% traffic may take this route in the future. And, it is actually getting built soon. Starlink: SpaceX's satellite constellation Starlink is a satellite constellation development project underway by SpaceX, to develop a low-cost, high-performance satellite bus and requisite customer ground transceivers to implement a new space-based Internet communication system. Development began in 2015, and prototype test-flight satellites were launched on 22 February 2018. Initial operation of the constellation could begin in 2020 with satellite deployment beginning mid 2019. In Starlink’s initial phase, 1,600 satellites in 1,150 km altitude orbits will provide connectivity to all except far north and south regions of the world. A second phase adds another 2,825 satellites in orbits ranging from 1,100 km altitude to 1325 km, increasing density of coverage at lower latitudes and providing coverage at least as far as 70 degrees North. via GIPHY And guess what! It looks like these satellites will communicate with each other using fricking lasers''! Delay is Not an Option: Low Latency Routing in Space Mark Handley (University College London) tried to reverse-engineer SpaceX's FCC filings to figure out what is possible with Starlink. It was the most interesting talk at the conference (at least among the talks I attended). People listened to the talk breathlessly and mesmerized. Mark had such exquisite visualizations and darkened the room for us to appreciate them better. It was a 20 minute trip to space and to 10 years in the future to deliberate about networking in space. (Here is a link to the paper.) A special note about his slides is in order. He coded the satellites and the routing algorithms using the Unity framework. His slides were not showing video of simulations, but rather running the simulator in real-time. Bold and beautiful way to present. The FCC filings mention "silicon carbide" communication components, which point to laser communication. Since it would be hard to infer bandwidth without more information, Mark took on the question of figuring out what the latency could be, and how would it change as satellites move, and what kind of use would this enable. Each satellite has 5 inter-satellite communication links. The phase 1 satellites are northeast bound. And the phase 2 satellites are southeast bound. The coverage is not uniform. London would be able to communicate with 30 LEO satellites at any given time. Routing over satellites multihop via laser 90ms latency is achievable, compared to 160ms over fiber communication. This is a big improvement, for which financial markets would pay good money for. Mark also considered how many multiple paths could be run over these satellites? He found that 10 multipaths is feasible. But 20 simultaneous paths not possible in phase1 of constellations. With the additions in second phase (satellites that are southeast bound), London to Johannesberg latency can come down to 80ms from 190ms. These phase 2 additions will also help for providing better multipath capacity. With the second phase additions, FCC required SpaceX to cover the Alaska north region. This may also serve the purpose of routing over the pole, for example for the NY to Beijing route. The Starlink deployments open many research questions for networking: • how do you avoid building queues? (probably via a form of source routing) • how do you coordinate multipath route changes? • how do you avoid reordering without increasing latency? • how do you make topology adaptive? The other papers in the session The "Networking, in Heaven as on Earth" paper considers the interdomain routing problem with satellite constellations. The vision there is to full integration of satellite networks in Internet Control Plane (via BGP). But satellites move very fast which leads to frequent BGP updates. Filtering reduces updates but introduces connectivity problems. The paper mentions that a proactive routing strategy (that leverages predictability of satellite orbits) rather than reactive could work better The "Gearing up for the 21st century space race" paper talks about miscellaneous issues in space networking. The talk mentioned that some trade activity shows outstripping of fiber speed communications from NY to Frankfurt and that people might be using short-wave (microwave) radios to beat fiber-optical speeds (where light travels 2/3rds slower than it does in vacuum). Then the talk speculated whether it is possible to establish multihop microwave routing using in-flight planes. It turns out it is possible to do it with 20 hops across the globe (east to west) with low-stretch and good latency. MAD questions 1. Maybe we are getting there, huh? This session reminded me of the Seveneves book by Neal Stephenson (great read, recommend highly). In Part 1, of the book there was very good coverage of Space orbits, maneuvering in space, and how dangerous space junk could get. Coincidentally, one of the talks mentioned that space is garbage tracked: anything bigger than a marble is tracked. At first I didn't buy this, didn't sound feasible. But turns out radar is used for learning trajectories of the space junk and the trajectories are maintained at the databases in space agencies to help make the space station and the satellites to avoid them. So the satellites will be routing packets and simultaneously try to route around occasional space junk. We are getting ready to become a space-faring species, and that is very exciting. 2. What is next? Faster than light quantum communication, anyone? Ender's Game series mentioned such communication. And of course there is a wikipedia page for faster than light communication. 3. Would it be feasible to do store and forward communication via the satellites? You know the thought-experiment about the plane full-of-disks, right? It has very good throughput. Since these satellites are already moving at a fast speed, could they be used for data mules to improve throughput for big data networking, say between the Hadron collider in CERN and datacenters in US? Think of beaming up data to a row of satellites (one after another) that store this data and in 45 minutes or so dump these at the US datacenters. Could this be a feasible alternative to fiber? Probably not so much, since the uplink and downlink are still limited. Thursday, November 15, 2018 My Emacs journey This is a follow up to my "Master your tools" post. As an example of one of my tools, I talk about my Emacs use. I have been using Emacs for the last 20 years. At this point, I don't even know Emacs, my fingers do. If you ask me the shortcut for something, I will need to let my fingers do it and try to observe what they are doing. And sometimes ---as in the story of the caterpillar who forgets how to walk when asked to demonstrate it--- I forget about how to do something when I try to attempt it consciously. From text-editing to text-wrangling I have been learning Emacs at a glacial pace, but I think that worked for me better. I figured I can internalize so much at a time, so I didn't rush things. I initially used Emacs mostly for power editing LaTeX files. It was only around 2008 that I started with the Emacs org-mode. I loved its outlining feature, and started using and customizing it ever since. It has been a big part of my thinking and writing process for the last 10 years. You can say that it became my out-of-core memory execution primitive. When I write a blog post or an article, I use the org-mode outline headers to organize/departmentalize and text-wrangle the content. I have a JUNK header where I move extra text, this helps me overcome my kill-your-darlings syndrome. I have an INSERT header for noting down what I like to insert. I visit these later to decide what is the best place in the article to insert them, or whether I should move them to the junk header. So this does work like my out-of-core memory module. At any time I only keep a small number of things in my mind, and use the headers as my swap memory. I go forward with one decision/issue at a time without overwhelming myself. This is how I try to scale my attention in my offline thinking mode. I don't have a large working memory (I also suspect you don't either) and this helps immensely. Text wrangling for the win! Getting things done with org-mode As for using org-mode (org-agenda) for TODO lists and getting things done, I had gone through 3 unsuccessful attempts before I finally made it to stick. After my failed attempts, I thought I am hopelessly disorganized and this is too much of a hassle. After I saw my colleague Jason use it regularly, I gave it another shot. Incorporating org-agenda to my workflow did wonders for my peace-of-mind, if not for my organization and timeliness. I wrote about this a little here. Before integrating the emacs org-agenda to my life, I always had open loop tasks that caused me worries, and eating up cycles in my brain: "Oh, I should remember doing this", "Woe to me, I am procrastinating on this", etc. After successfully adopting emacs org-mode as my to-do list and agenda manager (which, took a couple years, and several aborted tries), the benefit I got out was the clarity of mind, and the release of all that background anxiety. Other assorted Emacs tricks I use I love the org-export-to-latex functionality of org-mode. This way I can get a pdf file shareable with others anytime, while still staying in org-mode where I do my thinking. In my org-file I would have COMMENTed out the JUNK header, INSERT header, and the META header (for questions/connections) to capture my thinking and provide a snapshot of my brain. The exported LaTeX article hides all of those, but I still need those as my documentation of my thought process and to evolve my work further. Frankly I don't get how Word/Pages/etc users deal with not having COMMENT sections in their documents. I also use the org-export-to-beamer mode for quickly preparing presentations in Beamer. This helped me survive teaching. Preparing powerpoint presentations takes a lot of time and is painful. On the other hand, due to its integration to my thinking/writing process and due to the COMMENTing/evolving benefits I mentioned above, exporting the org document to Beamer makes things easy/frictionless for me. Another hack I use is to maintain a notes.org file in any folder to keep track of that project. I use this as a lab-notebook to record everything about the project, and meta-thoughts, concerns, etc. I also add timestamps to my entries with my Emacs shortcut (Wed, 14 Nov 18 - - - 21:07) cntrl-c-t. Of course I had to try this with my fingers first to learn the command I use. I use M-x-tomatinho for keeping track of my pomodoros in Emacs. It is visual, and it gives me a good picture of how my day is going. In my self.org file, I keep an org-mode table where I note which pomodoro number corresponds to what time and task. This gives me a candid picture of how my day is going and how my week has gone. This post from 3 years ago describes the pomodoro workflow I had then, which is obsolete now. When I am working on large documents in org-mode, I use hot links (radio-targets) for definitions, so that when I write the word at a later point in the file, Emacs automatically inserts the link to where it is defined first. I also use predefined hi-lock for "!!" and "??" to easily highlight some interesting findings and questions in the text. And I sometimes use impromptu M-x highlight-phrase or M-x highlight-regexp to highlight other things. I use <f3> and <f4> to define and use keyboard macros for ad hoc custom editing needs. I wrote a bit about how I prepare my Emacs setup here. MAD questions 1. What should my Emacs learning/mastering pace be? I don't consider myself a power Emacs user. I am far from it. I don't go shortcut crazy. I don't try to automate everything. And if some new Emacs tricks/shortkeys don't stick with me, I take it that I don't really need them. As a counterpoint to this though, I couldn't get the org-agenda to stick for a long time, and now that I use it I realize how much I needed it. So where should I draw the line about how much to push to learn/adopt new things? 2. Benefits of drawing/sketching versus typing As much as I love Emacs, it is linear and text. Although org-mode helps for making things non-linear with its headers, it doesn't give the same visual thinking benefits from drawing/sketching. I think I will try to incorporate more doodling/sketching drawing to my workflow in the coming months. Let me know if you have good suggestions for this. Monday, November 12, 2018 Is Twitter causally-consistent? For the past 7-8 years, several research papers have used this example to motivate causal consistency. You must have read this example, right? • Alice removes her boss from her friend list, and posts that on her feed that she is looking for a new job. • Tom removes his mom from the friend list, and posts his Spring Break photos. Well, being the empirical researchers we are, Aleksey and I wanted to put Twitter in to test for this scenario. On September 25, we performed this test. (I also have a video recording of this. But since I can't stand to hear myself talk in recordings, I am not posting it. I sound really weird, man.) I first blocked Aleksey on my Twitter account, and then tweeted that Aleksey drinks a lot of tea (it's true). When we checked Aleksey's timeline, we saw that his timeline indeed did not display my tweet. So, this was kind of an anticlimax. Twitter passed the causal-consistency test easily. No need to publish more causal-consistency papers, right? Well, maybe not quite. Maybe this was because Aleksey and I were in the same region and our accounts fall in to the same datacenter. We thought maybe we should repeat the test with Aleksey connected to Twitter via a proxy, but then since Aleksey's laptop was away that day, we decided to test this at another time. (Maybe we should find someone from another region and arrange a cross-continental Twitter causality test this time.) Anyways, this story gets more interesting. Keep reading. I unblocked Aleksey, and Aleksey checked that he was following me again, and we called it a day. But a couple days later, I was tweeting something and wanted to include a mention to @AlekseyCharapko in my tweet. But Twitter didn't autocomplete for me. I finally found his account after some searching on Twitter, and saw that Aleksey is not following me. What!? My student doesn't follow me on Twitter? Impossible. I showed this to Aleksey and he was also caught by surprise. We then remembered about the Twitter test we did a couple days ago. It must be that although I unblocked Aleksey after our test, a nightly batch job blocked him from my account and made his account unfollow me. Reality is often stranger than the clean models we have. Real systems have a lot of back channels and processes. MAD questions 0) Some clarification on this from someone at Twitter There was discussion on this at Hacker News.  Isn’t this a known feature of Twitter, usually called a soft block or forced unfollow? You block and unblock someone quickly and it forces them to unfollow you (and more importantly, does not signal them that it happened). I have heard of people fearing harassment to use this technique to get themselves out of a person’s timeline (out of sight, out of mind) without the pseudo-confrontation of a block.  This is correct.When we process a block request from user A -> user B, we remove the follow edges between user A -> user B and user B -> user A, and then add a block edge from user A -> user B. When we process an unblock request from user A -> user B, we remove the block edge from user A -> user B. I imagine that the "Aleksey checked that he was following me again" was either client caching, or eventual consistency latency. There's no nightly batch job or anything doing that. Source: I work on the social graph service at Twitter. Here is the video segment that shows that after I unblocked Aleksey, he was able to see my tweets in his timeline. I don't know what happened after that which made his account unfollow mine again. It might be that since my tweet included a mention to him, Aleksey clicked on the notification and could see my tweets after that, and not in his timeline. We repeated the experiment, and found that after I blocked and unblocked him, he was indeed still unfollowing me. 1) What is the right way to test this? Sending two consecutive tweets from one account and monitoring from another does not seem sufficient to test causal-consistency to me. The tweets are likely partitioned with the user_id, and so just session consistency or per-key ordering would give the impression of causal consistency, right? To offset for this, we tried another test, where I tweet something, Ailidani reads it and presses the button on his drafted tweet, and Aleksey checks his feed for the result. It was no contest again. Twitter is so fast that it displayed my tweet on Aleksey's timeline immediately and Ailidani's tweet didn't stand a chance. (Another thing to note for completeness. Under content preferences, we disable the "show me the best tweets first" to get a time ordered tweet stream.) 2) Where would you put your money? So let's say we set up this 3-person cross continental Twitter causal consistency test. Where would you put your money on? Can we infer something from what we know of Twitter architecture? This is a good start for reasoning. But again, in our 3 person test, per-key/object ordering does not help. Saturday, November 10, 2018 How to be a good machine learning product manager There are a lot of interesting meetups at Seattle, and I try to attend one every couple weeks. Ruben Lozano Aguilera was the speaker for this meetup on Oct 17. Ruben is a product manager at Google Cloud, and before that he was a product manager at Amazon. What is ML? Programming transforms data + rules into answers. Machine learning turns data + answers into rules. When should you use ML? Use ML if the problem: • handles complex logic • scales up really fast • requires specialized personalization • adapts in real-time For example ML is a good fit for the "search" problem. Search requires complex logic, for which it is not easy to develop rules. It scales up really fast in terms of new keywords, combinations and content. It requires personalization depending on the context, and has some real-time adaptation component as well. Another important point is that the problem should have existing examples of actual answers. When you bootstrap from a good enough dataset, you can scale further, because data -> predictions -> customer experience -> more traffic -> more data. Some popular ML problems are ranking, recommendation, classification, regression, clustering, and anomaly detection. Don't use ML when your problem: • can be solved by simple rules • does not adapt to new data • requires 100% accuracy • requires full interpretability/why-provenance The data requires further consideration: Can you use data? Is it available, accessible, and sufficient? Is high quality? relevant, fresh, representative, unbiased? Is it appropriate to use the data: privacy, security concerns? For the following, can you use ML or not? • What apparel items should be protected by copyright? No. This is dangerous financially, you need to get 100% accuracy. • Which resumes should we prioritize to interview for our candidate pipeline? No, this may be based on biased data. • What products should be exclusively sold to Hispanics in the US? No. This is discriminatory and creepy. • Which sellers have the greatest revenue potential? Yes. • Where should Amazon build next head quarters? No. This is not a repeatable problem; there is only one label: Seattle. • Which search queries should we scope for the Amazon fresh store? Yes. What is the ML lifecycle? For productizing ML, you need people, processes, and tools/systems. The people come from two domains: • Math, statistics: ml scientist, applied scientist, resarch scientist, data scientist • Software, programming: business intelligence engineer, data engineer, software engineer, dev manager, technical program manager The ML lifecycle involves 4 phases: problem, data, features, and model. To formulate the problem, you need to clarify what to solve, establish measurable goals, and determine what to predict. For the data phase, you need to • select data: available, missing data, discarding data (data cleaning) • preprocess data: formatting, cleaning, sampling For the features phase, you need to consider scaling, decomposition, aggregation, and discard any features that are not relevant. Finally, for the model phase, you first divide the data set into training data and test data, could be 70+30 or 90+10. Then comes the model training (using whatever algorithm you are using), which produces the ML model. You then test this output ML model with the test data. To productize your model, you should integrate the ML solution with existing software and keep it running over time. At this point considerations about the deployment environment, data storage, security and privacy, monitoring & maintenance come in to play. Some great ML solutions cannot be productized due to high implementation costs or inability to be tested in practice. The product manager is very much involved in the first 2 phases: formulating the problem and selecting the data. Product manager is also involved in feature selection but not much involved with the final model phase. MAD questions 1) Umm, deep learning? Since the presentation didn't mention any deep learning specific problems/tasks/observations, I asked Ruben about what significance deep learning had on the projects he worked on. Turns out, he didn't use any. He said that simpler ML models were enough for the tasks they undertook so he never needed a deep-learning solution. He also said that deep-learning was very expensive up until a couple years ago, and that was also a factor. With TensorFlow, Google is supposedly using deep-learning a lot, likely more for image and voice processing. But, is there a study about the prominence of deep-learning use among ML solutions in the industry? 2) How do you troubleshoot issues with productizing ML? As we covered above there are many things that can go wrong, such as unanticipated bias in your data, in your method, conclusions. How do you check for these? Ruben answered they brainstorm and think very deeply about what could go wrong, and identify these issues. It seems like this needs more processes and tool support. Having seen how TLA+ specifications and model checking create wonders for checking problems with distributed/concurrent systems, I am wondering if similar design level tool support could be developed for ML solutions. 3) How do we learn/teach empathy? Ruben was a great speaker. He used beautifully designed slides. After all he is a product manager and sympathizes with the users/audience. In Q&A session he mentioned that empathy is the most important skill for a ML product manager. I believe empathizing with your audience also goes a long way in public speaking. How do we learn/teach empathy? This is so basic that you expect/hope we learn this as kids. But it looks like we keep forgetting about this and fail to empathize. Also, there is always levels to things. How do we get better at this? 4) Is ML/DL too application-coupled? I have a some understanding of ML/DL domain, since I started learning about it in 2016. I am still amazed at how tightly application-coupled is the ML/DL work. On one hand this is good, this makes ML/DL very practical and very applicable. On the other hand, this makes it harder to study the principles and systematize knowledge. Thursday, November 8, 2018 SDPaxos: Building efficient semi-decentralized geo-replicated state machines In the last decade, the Paxos protocol family grew with the addition of new categories. • Rotating leader: Mencius • Leaderless: EPaxos, Fast Paxos • Paxos federations: Spanner, vertical Paxos • Dynamic key-leader: WPaxos This paper, which appeared in SOCC 18, proposes SDPaxos which prescribes separating the control plane (single leader) from the replication plane (multiple leaders). SD in SDPaxos stands for "semi-decentralized". The motivation for this stems from the following observation. Single leader Paxos approach has a centralized leader and runs into performance bottleneck problems. On the other hand, the leaderless (or opportunistic multileader) approach is fully decentralized but suffers from the conflicting command problems. Taking a hybrid approach to capture the best of both worlds, SDPaxos makes the command-leaders to be decentralized (the closest replica can lead the command), but the ordering-leader (i.e., the sequencer) is still centralized/unique in the system. Below I give a brief explanation of the Paxos protocol categories before I discuss how SDPaxos compares and contrasts with those. Plain vanilla Paxos Paxos provides fault tolerant consensus among a set of nodes. • Agreement: No two correct nodes can decide on different values. • Validity: If all initial values are same, nodes must decide that value. • Termination: Correct nodes decide eventually. Paxos runs in 3 phases: propose (phase-1), accept (phase-2), and commit (phase-3). 1. A node tries to become the leader by proposing a unique ballot number b to its followers with a phase-1a message. The followers acknowledge a leader with the highest ballot seen so far, or reject it with a ballot seen with a number greater than b. Receiving any rejection fails the candidate. 2. In the absence of a rejection, a node becomes leader and advances to phase-2 after receiving a majority quorum of acknowledgments. In this phase, the leader chooses a suitable value v for its ballot. The value would be some uncommitted value associated with the highest ballot learned in previous phase, or a new value if no pending value exists. The leader commands its followers to accept the value v and waits for acknowledgement messages. Once the majority of followers acknowledge the value, it becomes anchored and cannot be revoked. Again a single rejection message (carrying an observed higher ballot number) received in phase-2b nullifies the leadership of the node, and sends it back to phase-1 to try with a higher ballot number. 3. Finally, the leader sends a commit message in phase-3 that allows the followers to commit and apply the value to their respected state machines. It's important to see that after phase-2, an anchored value cannot be overridden later as it is guaranteed that any leader with higher ballot number would learn it as part of its phase-1 before proposing a value in its phase-2. You can find more information on Paxos in this blog, and especially Modeling Paxos in TLA+ could be of interest to you. Single leader approach The traditional single-leader Paxos protocol employs a centralized leader to process all client requests and propose commands. The single leader takes a significantly heavier load than the other replicas, and becomes a performance bottleneck. Moreover, in geo-replication, clients not co-located with the leader need to send the requests to the remote leader, which incurs significantly higher wide-area network latency. Mencius approach Mencius is a multileader version of Paxos that aims to eliminate the single leader bottleneck in Paxos. Mencius achieves load balancing by partitioning the consensus instances among multiple servers. E.g., if we have 3 servers, server 0 is responsible for acting as a leader for consensus instances numbered 0,3,6, server 1 for 1,4,7, and server 2 for 2,5,8, etc. Mencius tries to avoid the straggler problem by making the replicas skip their turns when they fall behind, however, it cannot fully eliminate the slow-down. Since it uses multiple leaders, Mencius also loses out on the "serve reads locally at the leader" optimization possible in Paxos. Leaderless approach EPaxos is a leaderless solution, where every node can opportunistically become a leader for some command and commit it. When a command does not interfere with other concurrent commands, it is committed in a single round after receiving the acks from a fast quorum (which is approximately 3/4ths of all nodes). In a sense, EPaxos compresses the phase-2 to be a part of phase-1 when there are no conflicts. However, if the fast quorum detects a conflict between the commands, EPaxos defaults back to the traditional Paxos mode and proceeds with a second phase to establish order on the conflicting commands. Unfortunately, services like E-commerce and social network can generate high-contention workload, with many interfering commands on the same object from multiple clients. This problem is aggrevated in wide area network deployments: since requests take much longer time to finish, the probability of contention rises. Multileader approach Spanner and CockroachDB are examples of databases that uses a federation of Paxos groups to work on different partitions/shards. These partitioned consensus systems employ another solution on top (such as vertical Paxos) for relocating/assigning data from one Paxos group to another. WPaxos uses sharding of the key space and takes advantage of flexible quorums idea to improve WAN performance, especially in the presence of access locality. In WPaxos, every node can own some objects/microshards and operate on these independently. Unlike Vertical Paxos, WPaxos does not consider changing the object/microshard ownership as a reconfiguration operation and does not require an external consensus group. Instead WPaxos performs object/microshard migration between leaders by carrying out a phase-1 across the WAN with a higher ballot number, and commands are committed via phase-2 within the region or neighboring regions. The SDPaxos approach The root of the inefficiency of leaderless and multileader protocols is the decentralized coordination pattern. Although decentralization addresses the single-leader bottleneck as every replica can propose commands, the replicas still need to agree on a total order on conflicting commands proposed by different replicas to avoid inconsistent state. To address this issue, SDPaxos divides the consensus protocol in 2 parts: durably replicating each command across replicas without global order (via C-instance Paxos), and ordering all commands to enforce the consistency guarantee (via O-instance Paxos). Replicating via C-instance Paxos is completely decentralized where every replica can freely propose commands and replicate them to other replicas. This evenly distributes the load among replicas, and enables clients to always contact the nearest one. On the other hand, as part of O-instance Paxos, one replica is elected as the sequencer and handles the ordering in a centralized manner: the global view enables this replica to always order commands appropriately. Provided that the ordering messages are smaller than replication messages, the load on the sequencer will not be as severe as that on the single leader in Paxos. Fault tolerance is provided as both the replicating and ordering instances are conducted based on Paxos. Each replica proposes commands in a series of C-instances of its own to produce its partial log. The sequencer proposes replicas' IDs in O-instances to produce an assignment log. Based on the assignment log, all replicas' partial logs are finally merged into a global log. Comparison with other protocols The separation between C-instances and O-instances is the source of SDPaxos's advantages over existing protocols. The decentralization distributes load evenly across replicas, and allows clients to always contact the nearest replica in geo-replication to serve as the command leader. The O-instance leader, i.e., the sequencer, provides conflict free operation. So, SDPaxos is like Paxos, but it has local leader in each region. This way it avoids the cost of going to the leader, and back. Also SDPaxos is like EPaxos but with no conflicts, ever! In SDPaxos, the sequencer is one node, but is backed by O-instances. A new sequencer can be chosen easily in a fault-tolerant way using Phase-1 of Paxos over O-instances. This alleviates the availability problems due to the serializer failure in systems that use Paxos for serializing the log in a central region. In such systems (e.g., Calvin) if the single log serializer is Paxos-replicated within a region, then the availability suffers on region failure. Instead, if the serializer is Paxos-replicated across regions then the performance suffers. The protocol In this example, upon receiving a client request for a command, replica R0 becomes the command leader of this command, picks one of its own C-instance and replicates the command to others (using the C-accept, i.e., Accept phase message of the C-instance). In the meantime, this C-accept also informs the sequencer (R2) to start an O-instance for this command. Then R2 proposes R0’s ID in the next (e.g., the jth) O-instance and sends O-accepts to others, to assign this command to the jth global slot. Replicas will then accept these instances and send C-ACKs and O-ACKs to R0; R2 also sends an O-ACK as it has sent an O-accept to itself. The algorithm denotes the ith C-instance of Rn as Cni, and the jth O-instance as Oj. A C-instance can come from any node without a Paxos phase-1a, because each replica has its own distinct replication log for C-instance. The C-instance messages do not conflict with each other and gets accepted immediately. The C-instance messages do not even need a ballotnum; the ballotnum used is that of the O-instance to denote epoch (i.e., which sequencer the sender thinks is still in-charge). A command being ready requires the C-instance and enough number of O-instances be committed. The conditions of an instance being committed and a command being ready are defined in lines 18 through 31, which we discuss next. There are two questions here. • The safety question is: How do we ensure that the replication is anchored (performed at the majority quorum) from the command leader perspective? • The performance question is: How do we achieve consensus in one round while satisfying the safety concern? The 1-round feat In the best case when O-instance of the sequencer overlaps perfectly with the C-instance of the command replication leader, consensus is achieved in one round-trip time ---the optimal possible. But, since the O-instance starts half a round trip later than the C-instance for non-sequencer replicas, it is not always possible to optimize the O-instance completion to just half a round trip to achieve the one-round-trip latency. But the paper shows how this can be achieved for N=3 and N=5 replicas. In groups with more than 5 replicas, the O-instances still need one round trip, thus the overall latency remains at 1.5 round-trips. In 3-replica groups, when the command leader receives O-ACK from the sequencer, the majority (2 out of 3) is readily established for O-instance completion. This provides the one-round-trip consensus. (For the case, the command leader is also the sequencer, sequencing can be done in advance and one round-trip is satisfied as well.) In 5-replica groups, a command can also be ready in one round trip, but unlike the case of three replicas, an O-instance cannot be accepted by a majority in half a round trip. Instead, SDPaxos lets each non-sequencer replica commit an O-instance, upon receiving the O-accept from the sequencer (line 29). Here, the O-instance does not rigorously follow Paxos, which raises a complication for recovery: if this non-sequencer replica and the sequencer fail, we cannot recover this O-instance simply by Paxos because the other alive replicas may have not seen the O-accept yet. However, the paper discusses a way for SDPaxos to correctly recover the O-instances of all replicas' ready commands even in such cases (omitted in my review), and is able to allow for 1-round-trip commits for N=5. Note that the dynamic per key leader approach in WPaxos still has an edge over SDPaxos when there is good locality in the workload (which is often the case in practice) and/or when the number of replicas is greater than 5 (which is often the case for geo-replicated databases). It may be possible to use WPaxos for coordination across regions and integrate SDPaxos for coordination within the region upto 5 replicas. Optimizations As an optimization for reads, SDPaxos uses sequencer leases to authorize the sequencer to directly reply to the read requests. In contrast, such an optimization is not possible for leaderless approaches, as there is no single/dedicated leader to lease and read from. As another optimization, in some cases, it is possible to divide the responsibility of sequencer to all replicas for more load balancing. For example, in a key-value store, we can partition the key space using approaches like consistent hashing, then make each replica order the commands on one partition (commands on different keys can be out-of-order). Again, in this case, it would be possible to use the WPaxos approach for safe key-stealing among the sequencers, and dynamically adapt to the access pattern, rather than being confined to static partitioning. Evaluation They implemented a prototype of SDPaxos, and compared its performance with typical single-leader (Multi-Paxos) and multileader (Mencius, EPaxos) protocols. The experiment results demonstrate that SDPaxos achieves: (1) 1.6× the throughput of Mencius with a straggler, (2) stable performance under different contention degrees and 1.7× the throughput of EPaxos even with a low contention rate of 5%, (3) 6.1× the throughput of Multi-Paxos without straggler or contention, (4) 4.6× the throughput of writes when performing reads, and (5) up to 61% and 99% lower wide-area latency of writes and reads than other protocols. MAD questions 1. Does SDPaxos help with the leader bottleneck significantly? I wrote above that "Provided that the ordering messages are smaller than the replication messages, the load on the sequencer will not be as severe as that on the single leader in Paxos." But on close inspection I don't think I believe that sentence anymore. Ailidani, Aleksey, and I have done a detailed bottleneck analysis of Paxos protocol categories (under submission), and we found that the outcast messages are not the biggest source of bottleneck for the leader, as they are serialized once before being sent out to the replicas. The incast messages contribute most to the bottleneck, as the CPU needs to process them one by one and they queue up. Moreover, the incast messages are ACK messages, which are already small, and SDPaxos does not make them smaller. So, maybe SDPaxos does not improve significantly on the single-leader bottleneck in Paxos. On the other hand, it is true that SDPaxos helps distribute the client request load to the C-instance replicas relieving that bottleneck, and it definitely helps with lowering the latency in WAN. 2. How would you build reconfiguration of participants for this protocol? Reconfiguration of participants is important to re-establish fault-tolerance capability by replacing failed replicas by accepting fresh new replicas to the system. How would reconfiguration work for SDPaxos? Would Raft's 2-phase reconfiguration method apply readily for SDPaxos? 3. What are additional challenges for efficient strongly-consistent geo-replication implementation at scale? I am at Microsoft Azure Cosmos DB for my sabbatical. Cosmos DB recently introduced general-availability of multiple write regions. While providing strong-consistency with multi-master writes allowed from any region has cost across the globe, could SDPaxos ideas help improve efficiency further? Tuesday, November 6, 2018 An unexpected phone call at the elevator Sometime ago, I was visiting an apartment block. This was a relatively new apartment block, I think less than 5 years old. I entered the elevator, and I heard "Hello, hello... Are you there?" booming from the speakers. This never happened to me before of course; elevators don't talk to me often. I thought maybe someone had pressed the call for help button, and left the elevator, and I can help sort this out. I said, "Hi, yes!" The voice continued saying: -"The reason I am calling you is because of your last energy bill". After a second of cognitive dissonance, I started laughing: -"Umm, I don't know what happened, but I am in an elevator now, and your voice is literally streaming from the elevator." The voice stopped for a brief couple of seconds, then continued explaining something about the bill, following the script he was given. -"Dude, this is my floor, and I got to go now. Ok, bye." It turns out most elevator phones ---which are mandated by law to allow you to call for emergency services in case you get stuck in the elevator--- are public network connected phones. And this occasionally leads to receiving phone calls at the elevator. This short youtube video is so funny. Unfortunately I was so shocked by the phone call it didn't occur to me to record it. From now on, this is how I will answer any unwanted spam call I receive: "Dude, I don’t know what happened, but I am in an elevator now, and your voice is literally streaming from the elevator. This is my floor, I gotto go now." MAD questions 1) If we have Internet-of-Things vision realized, do I have this to look forward to? Collection calls at the elevator? Or when trying to drive my car? 2) What is the worst malicious thing you can do if you have (or can find) the phone numbers for elevators at apartment buildings or workplaces? Huh, of course, there is prior art on the Web. Monday, November 5, 2018 Judoing the Dunning-Kruger effect: the "surprisingly-popular option" strategy for crowdsourcing Let's say you want to crowdsource the answer to the question: (Q1) What is the capital of Brazil? The surprisingly-popular option strategy for crowdsourcing suggests piggybacking a control question to the real question Q1: (Q2) What do you think the majority of other people will respond to this question? Many people (non-Brazilian and non-geography-nerds) will answer with Rio for both Q1 and Q2. But there will be some people that will answer with Brasilia for Q1 and yet Rio for Q2. The first set of people that replied with Rio to both questions did not know much about Brazil, and went with what they know as the most prominent city in Brazil. The second set of people not only knew of the correct answer, Brasilia, but they also anticipated that the majority of participants will go wrong by answering with Rio. Rio is the popular option for Q1, but Brasilia is the surprisingly popular option because the respondents for Brasilia had anticipated that Rio would be the popular option. The lesson is: "People who expect to be in the minority opinion deserve some extra attention." Here is an article about this algorithm, and here is the nature paper. (I wish this would have occurred to us 4 years ago, when we were collecting crowdsourced data from our "who wants to be a millionaire" app.) MAD questions 1. What are similar judoing techniques? I am fascinated with this technique, because this is like applying a judo-throw on the Dunning-Kruger and reversing its effects. It takes a weakness observed in the Dunning-Kruger effect (that people of low ability suffer from illusory superiority and mistakenly assess their cognitive ability as greater than it is), and transforms that in to a strength for identifying the expert voices among the participants, the people who knew the answer to Q2 but provided a minority answer for Q1. What are similar judo moves in computer science/technology and in general? 2. What about the conspiracy nuts? This surprisingly-popular option technique has a conspiracy nuts problem, isn't it? There is inevitably a strongly opinionated group that will be in the minority and they will correctly anticipate that they will be in the minority. This technique does not filter for them, right? So, let's take anti-vaxxers. They will say "no" to (Q1) "Should I vaccinate my kids?" and they will correctly guess the popular option "yes" to (Q2) "What do you think the majority of other people will respond to this question?" How do we fix this issue with the method? Maybe we can add a (Q3) What do you think the minority choice is and why it is wrong? Of course we are moving away from the automatic aggregation in the original method, but the essence of the idea still holds: Are you aware of the other opinions on this question, and can you explain and refute them? Can you not only strawman, but also steelman the other arguments and address them? 3. How does this apply to me as I assess my views/positions? For the hygiene of the mind, it is important to reflect on your beliefsets/behaviors/positions and reassess them occasionally. Am I able to explain why I am in the majority or minority side for my positions? Do I understand the other positions and not only refute them but can also appreciate some valuable points in them? Some other related posts from my blog on this are: Monday, October 8, 2018 Everything is broken Last Wednesday, I attended one of the monthly meetings of the "Everything is Broken" meet up at Seattle. It turns out I selected a great meeting to attend, because both speakers, Charity Majors and Tammy Butow, were excellent. Here are some select quotes without context. Observability-driven development - Charity Majors Chaos engineering is testing code in production. "What if I told you: you could test both in and before production." Deploying code is not a binary switch; deploying code is a process of increasing your confidence in your code. "Microservices are hard!" as a caption for a figure comparing the LAMP stack 2005 versus the complexity of the Parse stack 2015. We are all distributed systems engineers and unknowns outnumber the knowns! Distributed systems have an infinite number of almost-impossible failures! Without observability you don't have chaos engineering, you have a chaos. Monitoring systems have not changed significantly in 20 years, from Nagios. Complexity is exploding everywhere, but our tools are designed for a predictable world. Observability for software engineers: can you understand what is happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output? For the LAMP stack monitoring was sufficient for identifying the problems. For microservices, it is unclear what we are supposed to monitor for. We need observability! The hard part is not debugging your code, but to find which part to debug! Facebook's Scuba was ugly, but it helped us slice and dice and improve our debugging! It improved things a lot. I understand Scuba was hacked to deal with MySQL problems. You don't know what you don't know, so dashboards are very limited utility. Dashboards are only for anticipated cases: every dashboard is an artifact of past failures. There are too many dashboards, and they are too slow. Aggregates are the kiss of death; important details get lost. Black swans are the norm; you must care about 99.9%, epsilons, corner cases. Watch things run in production in the normal case; get used to observing your systems when they aren't on fire. Building Resilient Systems Using Chaos Engineering - Tammy Butow Chaos engineering is "thoughtful planned experiments designed to show weak points in the system". Top 5 popular ways to use chaos engineering now: kubernetes, kafka, aws ecs, cassandra, elasticsearch. Fullstack chaos engineering: inject faults at api, app, cache, database, os, host, network, power We are exploring a new direction and collaborating with the UI engineers on ways to hide impact of faults. prerequisites for chaos engineering: 1. monitoring & observability 2. on-call & incident management 3. know the cost of your downtime per hour (British Airlines's 1 day outage costed$150 millon)

How to choose a chaos experiment?
+ identify top 5 critical systems
+ choose 1 system
+ whiteboard the system
+ select attack: resource/state/network
+ determine scope

How to run your own gameday: http://gremlin.com/gameday

Outage post-mortems: https://github.com/danluu/post-mortems

First chaos engineering conference this year: http://twitter.com/chaosconf

Some notes about the venue: Snap Inc

There were fancy appetizers, very fancy. They had a kitchen there at the fifth floor (and every floor?). Do they provide free lunch to snap employees?

At the 5th floor, where the meeting took place, we had a great view of Puget Sound bay. The Snap building is just behind the Pike Market Place. There were about 80-100 people. I think the 30+ folks outnumbered 40+ folks, but not severely. Good show up from female engineers. There was ambient music in the beginning from 6-6:30pm, but it was loud.

By the way, I never used snapchat... I am old. But I don't have a Facebook account, so maybe I am not that old.

1. Do you need to test in production?
The act of sabotaging parts of your system/availability may sound crazy to some people. But it puts forth a very firm commitment in place. You should be ready for these faults, as they will happen in one of these Thursdays. It establishes a discipline that you would test, gets you prepared with writing the instrumentation for observability, and toughens you up. It puts you into a useful paranoid mindset: the enemy is always at bay and never sleeps, I should be ready to face attacks. (Hmm, here is an army analogy: should you train with live ammunition? It is still controversial because of the lives on the line.)

Why not wait till faults occur in production by themselves, they will happen anyways. But when you do chaos testing, you have control in the inputs/failures, so you already know the root cause. And this can be give you much better opportunity to observe the percolation effects.

2. Analogies for chaos engineering
I have heard vaccination used as an analogy. It is a tactful analogy (much better than the live firing analogy). Nobody can argue against usefulness of vaccinations.

Other things chaos testing evokes could be blood letting and antifragility. I had read somewhere that the athletes in ancient Greek would induce a diarrhea on purpose a couple weeks before competitions, so that their body can recover and get much stronger at the time of the competition. I guess the reasoning goes as "too much of a monotone is a bad thing" and it is beneficial to stress/shake the system to avoid a local maxima. That reminds me of this YouTube video I show in my distributed systems class on the topic of resilience.

3. Debugging designs with TLA+
Even after you have a verified design, the implementation can still introduce errors, so using chaos engineering tools is valuable and important even then.

It helps even for "verified" systems for its nonverified parts:
Folks encouraged us to try testing verified file systems; we were skeptical we would find anything, but to our surprise, when we tested MIT’s FSCQ file system, we found it did not persist data on fdatasync()! Apparently they had a bug in the un-verified portion of their code (Haskell-C bindings), which was caught by Crashmonkey! This shows that even verified file systems have un-verified components which are often complex, and which will have bugs.

4. Chaos tag
Turns out I have several posts mentioning chaos engineering, so I am creating a chaos tag to be available for use for future posts.

Sunday, October 7, 2018

Debugging designs with TLA+

This post talks about why you should model your systems and exhaustively test these models/designs with the TLA+ framework. In the first part, I will discuss why modeling your designs is important and beneficial, and in the second part I will explain why TLA+ is a very suitable framework for modeling, especially for distributed and concurrent systems.

Modeling is important

If you have worked on a large software system, you know that they are prone to corner cases, failed assumptions, race conditions, and cascading faults.

There are many corner cases because there are many parameters, and these do interfere in unanticipated ways with each other. The corner cases violate your seemingly reasonable implicit assumptions about the system components and environment, e.g.,"1-hop is faster than 2-hops", "0-hop is faster than 1-hop", and "processes work with the same rate". There are abundant race conditions because today (with the rise of SOA, cloud, and microservices) all systems are distributed systems. Code that is supposedly "atomic block of execution" fails due to other processes executing concurrently. Finally, faults happen and their effects are almost always underestimated pre-deployment. Faults take your system to unanticipated states, and from there on with the interleaving of recovery actions with normal system actions, the system may be thrown to even more unanticipated states.

In large software systems, which are inevitably distributed systems, there are many unknown-unknowns and an infinite number of highly-improbable ways things can go wrong. Human brain and reasoning cannot scale to handle all these possibilities. To alleviate these problems, the industry developed tools for better observability and even testing in production for improving availability. These tools are very important and indispensable. But by the time you figure out some inherent problems with your design it may be too hard and expensive to fix things. What you thought would be the last 10% of the project ends up taking 90% of your time at production and operations.

If you model your designs first and exhaustively test and debug these models for correctness against corner cases, failed assumptions, concurrency, and failures, you can catch errors at the design time and fix them before they develop into problems and become costly to fix.

• Modeling first does not extend your development time, on the contrary it saves you time by reducing futile development attempts. Embarking on development with a flawed design almost always ensures that the implementation is flawed. While having a precise and correct model at hand does not guarantee that your implementation of the model is correct, it helps you avoid the big/intricate problems and also provides a good reference for testing your implementation against.
• Constructing a precise model of your system gives you clarity of thinking and supports your development immensely.  By modeling you discover about the inherent complexities of the problem; that helps you focus your attention and ignore accidental/byproduct complexities.
• The model also helps you to communicate precisely with your team and others as you avoid the ambiguity of natural language and the hand-waving and generalizations involved.
• Finally with the model at hand, you also have a chance to gradually introduce design decisions, and see alternative ways to implement the design.

TLA+ is great for modeling

TLA+ is a formal language for describing and reasoning about distributed and concurrent systems. It is developed by Dr. Leslie Lamport, Turing Award winner 2013. Lamport is a very important figure in distributed systems due to his logical clocks work and Paxos work among many others. For the last decade, he is very involved with improving the TLA+ framework to help make distributed systems more manageable.

TLA+ uses basic math to model and reason about algorithms: practical logic, set theory, and temporal logic are used for specifying  systems. Best of all, the framework integrates a model checker that exhaustively tests your models to the face of corner cases, failed assumptions, concurrency, and failures. The model checker tries all executions possible for your model and tells you for which executions, your invariants and system guarantees break.

Invariant-based reasoning
TLA+ framework promotes invariant-based reasoning to prevent the problems that arise from operational reasoning. In operational reasoning, you start with a "happy path", and then you try to figure out "what can go wrong?" and how to prevent them. Of course, you always fall short in that enumeration of problem scenarios and overlook corner cases, race conditions, and cascading failures. In contrast, invariant-based reasoning focuses on "what needs to go right?" and how to ensure this properties as invariants of your system at all times. Invariant-based reasoning takes a principled state-based rather than operation/execution-based view of your system.

To attain invariant-based reasoning, we specify safety and liveness properties for our models. Safety properties specify "what the system is allowed to do". For example, at all times, all committed data is present and correct. Liveness properties specify "what the system should eventually do". For example, whenever the system receives a request, it must eventually respond to that request. In other words, safety properties are concerned with "nothing bad happens", and liveness properties with "something good eventually happens".

Modeling with TLA+
The TLA+ framework supports you in building a model and figuring out its invariant properties in two major ways. Firstly, the math-based formal language helps you achieve precision while still working with high-level declarative statements. Secondly, the integrated model checker exhaustively debugs your model to the face of concurrency and failures, and produces counterexamples for which your candidate invariants fail. (After years of working with TLA+, I am still surprised about the counterexamples the model checkers spit out for my models: It is very easy to overlook some scenarios, but the model checker sets you straight.) You address these problems by improving your model or sometimes by relaxing your candidate invariants, and after many iterations converge to an exhaustively debugged model which guarantees the invariants.

Building a TLA+ model is beneficial even for systems that are already implemented and running. Through building the model, you learn about your system better, and figure out some latent failure modes and correct them before they occur in production.

Finally, maintaining a TLA+ model of your system provides important benefits for continuous development. While software systems need to be extended with new features frequently, these extensions may interfere in unanticipated way with the system and lead to downtimes. With the TLA+ model at hand, you can first add these features to your model, and catch/debug the problems at the design-level using the model-checker. This way you resolve potential issues before they even become problems.

TLA+ is practical
Since using TLA+ actually saves time for building large software systems, TLA+ modeling is adopted as a practice by many software companies.

I am on sabbatical at Cosmos DB, Microsoft globally distributed cloud-native database. The team has been using TLA+ to model the replication and global distribution protocols and exhaustively tests the designs for correctness against failures. We have recently published the customer-facing part of the model which precisely defines the 5 consistency levels offered by Cosmos DB.

Amazon has also used TLA+ modeling for some of their AWS offerings and has written a nice experience report on this. There are also reports of using TLA+ for modeling hardware systems as well.

For the last 4 years, I have been incorporating TLA+ in my distributed systems classes. TLA+ enables students to learn about concurrency and invariant-based reasoning and it provides them hands-on experience with distributed protocols. I also use TLA+ exhaustively in my research on new distributed algorithms.

In my experience, it is possible to pick up TLA+ up in a couple weeks. This is firstly because TLA+ adopts a very simple state-machine approach to model systems. A system consists of: (1) A set of variables which define the state of the system, and (2) A finite set of assignments/actions that serves to transition the system from one state to another.

Furthermore, PlusCal provides syntactic a sugar for the TLA+, which has a tendency to grow long (due to its low-level state-transition centric syntax) and look cryptic for some people. PlusCal is a pseudocode for writing algorithms at a higher-level of abstraction, and it is translated to the underlying TLA+ specifications for model checking. To give you some idea about the PlusCal, here is an example of a PlusCal code for a database replica process. While this is a straightforward code, you can see a nondeterministic choice construct "either or" in action. The model checker will exhaustively test all possible combinations of these "either or" actions and check if a certain sequence would break one of your safety and liveness specifications.

There is a very active TLA+ forum at Google Groups. Leslie Lamport chimes in several threads.

LearnTLA provides a user-friendly introduction to TLA+/PlusCal.

Wednesday, September 26, 2018

The last mile problem in trust

Blockchains are supposed to solve the trust problem. But blockchains attack only the easy part of the trust problem, and avoid the hard part. The easy part is to store the transactions in a tamper-resistant database. The hard part is to attest to physical world actions and state.

The blockchain is a database technology and it does not attempt to attest to physical world actions/state. It solves the problem of tamper-proofing the state after it is added to the database. It doesn't attempt to validate/test/certify if the state is correct as it is added to the database. If humans create the state, there is inherently a trust problem: Were the lettuce bad before it was loaded to the trucks, or are the truck conditions to blame? Did the farmer or the trucker lie?

If sensors create the state, this is still a very hard problem, but not because the sensors may have been tampered with ---that is a relatively easy problem to solve in hardware. The problem is hard because of the corner-cases involved; how do you even start to pretend that the sensors have complete coverage (or good/fair sampling) and the detection/verdict is accurate? It is really a very complex and messy problem. As far as complete coverage of food supply-chains are concerned, you need DNA-sequencing and metagenomics.

This is a classic last mile problem. The last mile problems are always hardest to solve because of the massive fan-out both in terms of scale and in terms of corner cases to handle. The last mile problems haunted many domains, most notoriously the telecommunications and transportation domains.

Walmart, Lettuce, and Blockchains

A couple days ago there was a lot of hype about Walmart starting to use blockchain in its supply chain, to pinpoint where the lettuce come from in an E.Coli contamination event.

Ok, let's get to the bottom of this. "Walmart, Lettuce, Blockchain." It felt very weird to type this in Google search, but I did it anyways... for science.

See, I knew there was a lot of hype: "The giant retailer will begin requiring lettuce and spinach suppliers to contribute to a blockchain database that can rapidly pinpoint contamination."

De-hyped, this just says Walmart wants the farmers to record transactions in a database.  And actually the article makes sense if you replace blockchain with database:  "Walmart says it now has a better system for pinpointing which batches of leafy green vegetables might be contaminated. After a two-year pilot project, the retailer announced on Monday that it would be using a blockchain, the type of database technology behind Bitcoin, database to keep track of every bag of spinach and head of lettuce."

I blame IBM's over-excitement in blockchains for the hype in the article. Supply-chains is a very complex topic, and this use of a database to record information doesn't come close to scratching the surface of it. There are many automation and logistics problems that remain to be solved. And the dreaded last mile problem of course.

1. What is the nature of trust?

What or who do you trust?

Trusting a deterministic machine with few inputs/environmental-parameters is reasonable. Especially if you verified and validated it, and tested it extensively.

But what would make you trust humans? Humans are complex nondeterministic beings, and the input and environment surrounding humans are also very complex.

Reid Hoffman defines trust as consistency through time. But this is assuming the conditions don't change. If conditions change, that is the inputs/environmental conditions change, the other side can change its actions.

The answer to the trust puzzle has got to do with "consequences", right?

It is easier to trust in a situation where you have little to lose, but the other side has a lot at stake. And ironically, this makes the other side have problems trusting you, since you have little at stake, and she is risking a lot. For mutual trust and better collaboration, all parties should have skin in the game.

So what is at stake? This can be reputation, if reputation is a currency valued by the individual and his environment. What is at stake can be jail time, if one breaks laws and get caught. This is assuming one doesn't enjoy jail. Under certain conditions, people commit crimes to get into jail to get fed and have reliable healthcare, and even not to feel lonely.

I think trust is not complicated, rather the calculation, alignment, and managing of consequences/incentives is complicated. And this again harkens back to the last mile problem in trust.

I believe the parties involved are going to push the limits of what they can get away with as long as the deterrents do not outweigh the incentives.

I don't know if there is a technology solution here.

At a recent A16Z podcast, one speaker was rightfully complaining that we have a lot of trust issues and fight among complementary business rather than substitute/alternative business. For example even though iphone apps and iphone platform are complementary businesses, there is a lot of fight there. Or consider the Yelp versus Google fight. Or the fights Facebook, the platform, picks with the applications it enables. The speaker was implying that with the right incentivization and cuts from cryptocurrencies like ethereum gas, the parties will actually synergize and grow together rather than fight.

This sounds nice and simple, but I don't think I buy this. The fights are due to the greedy nature of humans and companies. To repeat what I said said above, I believe the parties involved are going to push the limits of what they can get away with as long as the deterrents do not outweigh the incentives. Even if cryptocurrencies and Ethereum gas is used between platforms and applications enabled, next we will see fights over how much of the payment is fair etc. I don't know if technology can fix that. Maybe this is supposed to be a dynamic equilibrium with constant push-backs and small-battles erupting from the parties involved.

2. What is the verdict?

I don't hate/despise blockchains, as I have seen some colleagues do. That is a radical and unreasonable position. There are many smart people working on this domain, they cannot be all and completely wrong.

I am still ambivalent about blockchains. I believe there is still a big contribution potential coming from blockchains and smartconracts. But the hype news make things harder to see.