Saturday, April 28, 2018

Book review. How to write a lot: A practical guide to productive academic writing

This book is by Paul J. Silvia, a psychology professor. This book provides a great prescription to enable academics to write more. And let's be frank we all should be writing more.

I give this little book to my graduate students so they get a healthy perspective on writing, an integral part of a researcher's job description, before they develop bad habits and PTSD from bad experiences with writing.

The book prescribes simple straightforward solutions to the writing woes. It has a no BS, let's get to business attitude. It gives some practical writing style advice as well, but if you need more help on developing a good writing/composition style, you should also look into other books, such as "Elements of Style", and on "Writing Well".

Here are some selected pieces from the book. I recommend the book to all struggling academic writers.

On finding time to write (page 12)

You have a teaching schedule, and you never miss it. If you think that writing time is lurking somewhere, hidden deep within your weekly schedule, you will never write a lot. If you think that you won't be able to write until a big block of time arrives, such as spring break or the summer months, then you'll never write a lot. Finding time is a destructive way of thinking about writing. Never say this again. Instead of finding time to write, allot time to write. Prolific writers make a schedule and stick to it. It's that simple.

Scheduling is the way (page 17)

Perhaps you're surprised by the notion of scheduling. "Is that really the trick?" You ask. "Isn't there another way to write a lot?" Nope---making a schedule and sticking to it is the only way. There is no other way to write a lot.

The best kind of self control (page 22)

My wife has fast Internet access in her home office, but I don't have anything. Writing time is for writing, not for checking email, reading the news, or browsing the latest issues of journals. ... The best kind of self-control is to avoid situations that require self-control.

Writing breeds good ideas for writing (page 23)

If you believe that you should write only when you feel like writing, ask yourself some simple questions: How has this strategy worked so far? Are you happy with how much you write? Do you feel stressed about finding time to write or about completing half-finished projects? Do you sacrifice your evenings and weekends for writing?

It's easy to demolish this specious barrier: Research has shown that waiting for inspiration doesn't work. Boice (1990) conducted a study with profound implications for every binge writer who waits for inspiration. He gathered a sample of college professors who struggled with writing, and he randomly assigned them to use different writing strategies. People in an abstinence condition were forbidden from all nonemergency writing; people in a spontaneous condition scheduled 50 writing sessions but wrote only when they felt inspired; and people in a contingency management condition scheduled 50 writing sessions and were forced to write during each session. The dependent variables were the number of pages written per day and the number of creative ideas per day.

This is what Boice found. First, people in the contingency management condition wrote a lot: They wrote 3.5 times as many pages as people in the spontaneous condition and 16 times as much as those in the abstinence condition. People who wrote "when they felt like it" were barely more productive than people told not to write at all--- inspiration is overrated. Second, forcing people to write enhanced their creative ideas for writing. The typical number of days between creative ideas was merely 1 day for people in who were forced to write; it was 2 days for people in the spontaneous condition and 5 days for people in the abstinence condition. Writing breeds good ideas for writing.

Make a list of writing projects (page 47)

After you've committed to a writing schedule, you need to make a list of your project goals and write them down. When you sit down to write, spend a minute thinking about what you want to do that day. Setting priorities among your project goals will take the stress out of managing several projects at once. And monitoring your writing will keep you focused on your goals, motivate you not to miss a day, inform you about how well you're doing, and give you hard facts that you can show to your binge-writing colleagues who are doubters and unbelievers. Anyone who combines the tips in this chapter with a regular schedule will write a lot.

Write first, revise later (page 75)

Generating text and revising text are distinct parts of writing---don't do both at once. The goal of text generation is to throw confused, wide-eyed words on a page; the goal of text revision is to scrub the words clean so that they sound nice and make sense. Some writers---invariably struggling writers---try to write a pristine first draft, one free of flaws and infelicities. The quest for the perfect first draft is misguided. Writing this way is just too stressful: These writers compose a sentence; worry about it for 5 minutes; delete it; write it again; change a few words; and then, exasperated move on to the next sentence. Perfectionism is paralyzing. 

Furthermore, writing sentence by sentence makes your text sound disjointed. The paragraph, not the sentence, is the basic unit of writing.

Tuesday, April 24, 2018

Paper summary. Step by Step Towards Creating a Safe Smart Contract: Lessons and Insights from a Cryptocurrency Lab

This paper, on our seminar reading list, was our first real intro to the "smartcontracts", so we liked that the paper was written to be very accessible.

A smart contract is a program that is executed by all miners and its outputs incorporated to the blockchain. A contract consists of program code, a storage file, and an account balance. The program code of a contract is fixed when the contract is created, and cannot be changed.

The contract's code is executed whenever it receives a message from a user or from another contract. While executing its code, the contract may read from or write to its storage file. The contract's storage file is stored on the public blockchain. A contract can also receive money into its account balance, and send money to other contracts or users. A contract's entire state is visible to the public.

Ethereum uses the concept of "gas" (bought by currency) to discourage over-consumption of resources. During the execution of a transaction, every program instruction consumes some amount of gas. If the gas runs out before the transaction reaches an ordinary stopping point, it is treated as an exception: the state is reverted as though the transaction had no effect, but the Ether used to purchase the gas is not refunded!

This program in Ethereum Serpent language (Ethereum now uses Solidity as the programming language) implements a simple financial "swap" instrument. The contract allows two parties, Alice and Bob, who take opposing bets about the price of a stock at some future time. Both parties initially deposit equal amounts of money (as units of Ether currency). After a deadline has passed, the current price of the stock is queried by interacting with a designated stock price authority (implemented as a smart contract, StockPriceAuthority). Depending on the price at that time, the entire combined deposit is awarded to either Alice or Bob.

On lines 1 and 2, the contract's storage allocates space for the public keys of Alice and Bob, and 2) the deadline and threshold of the swap contract. The contract also defines a function determine outcome, which any party may invoke.

Pitfalls of Smart Contract Programming

Section 4 describes prominent errors made when coding smart contract. It uses a "rock, papers, scissors" playing smartcontract and show a surprising amount of things that can go wrong in a simple problem like that.

In Section 4.1, the concurrency caused errors take the lead place.  As the paper "Blockchains from a distributed computing perspective" argued, smartcontracts have a surprising amount of concurrency in them in the so-called chain serialized transactions. The paper mentioned: "We have seen that the notion that smart contracts do not need a concurrency model because execution is single-threaded is a dangerous illusion." 

In the "rock, papers, scissors" contract, what happens to a third player that attempts to join and sends money to the contract? That money becomes inaccessible to anyone.

Another thing that was surprising to me is that when you are writing a smartcontract you need to take incentives into account as another constraint in the problem. In the "rock, papers, scissors" contract,
For example, one party can wait for the other to open its commitment. Upon seeing that he will lose, that party may elect to abort (after all revealing its committed input costs gas). But, this denies payment to the other player as well. The solution is to use an additional security deposit and add a deadline after which not revealing implies losing.

Step 1: You have a problem
Step 2: You introduce a blockchain to solve problem
Step 3: You now have an incentives problem, a trust distribution problem, a community management problem, a regulation problem, a speculation problem, an upgrade path problem.

Of course there are additional problems due to cryptography, how to keep the committed inputs secret, etc. Finally, Section 4.4. in the paper discusses Ethereum specific bugs. I am not sure how many of these still apply after 3 years, in 2018.

MAD questions

1. Every miner needing to store all contracts' data storage may limit the scalability, right? Is there a way to provide a sharding service for smartcontracts as well? Is it possible to apply the techniques introduced in Aspen for this?

2. Every miner needing to execute the smartcontract for validation is also very wasteful. How can that be relaxed? Is it possible to devise a way where the output can be checked quickly without having to execute the entire code?

PS: Relevant to this paper, here is the DAO smartcontract attack modeling I wrote about earlier.

Friday, April 20, 2018

Book review. Crypto: How the code rebels beat the government---saving privacy in the digital age.

In graduate school, I had read "Hackers: Heroes of the Computer Revolution" from Steven Levy and enjoyed it a lot. (I still keep the dog eared paper copy with affection.) So, I should have read Steven Levy's Crypto book a long time ago. But for some reason, I didn't...even though I was aware of the book. I guess that was due to a stupid quirk of mine; I had some aversion to the security/cryptography research. I don't know why. Maybe it was because I had sat through a couple of bad security/cryptography talks (a similar aversion happened to me after a bad networking course). Another reason, I regret to admit, may be that I had some distributed systems snobbery going on that time. I was so into the distributed systems/algorithms area that I was quick to label AI, security, and this, and that as uninteresting or useless *to me*. I wish I could have been more open minded. I am sure reading this book then would have changed my outlook toward security and cryptography.

(Side Remark: The lesson, kids, is to always keep an open mind. Being a snob is stupid. I have been seeing snobbery against blockchain work among some systems researchers, and that is wrong. I have criticized some parts of blockchain and fully-decentralized approaches many times on this blog, but I know better than being a snob. My brain is open, I am reading about it, and when I find something interesting and suitable for my skillset, I will be happy to work more on it and contribute.)

Coming back to the book, I recommend this book very highly. The book skillfully combines very personal stories of the researchers involved in crypto work with simple explanations of the important technical materials. Levy sure knows how to tell stories.

Now to be hypercritical and to nitpick, I thought the writing felt somewhat rushed in some places. I thought the writing in the Hackers book was more skilled and better refined. My guess is Levy had to rush this to publication. There were occasional ambiguous sentences, which a careful editor and proofreader would have caught.

Some selected parts from the book

I am not going to do a proper book review. Instead I find it more fun to include tidbits from the book without providing any context.

Pages 21&24: Whit Diffie is a careful reader

By now Diffie had finally gotten around to reading David Kahn's The Codebreakers. Since Diffie was a very slow, methodical reader, tackling a book of a thousand densely packed pages was a major undertaking for him. "He traveled everywhere with that book in hand," says his friend Harriet Fell. "If you invited him to dinner, he'd come with The Codebreakers." But Diffie found the hundreds of hours he spent on the book to be well worth the trouble.
"I read it more carefully than anyone had ever read it. Kahn's book to me is like the Vedas," he explains, citing the centuries-old Indian text. "There's an expression I learned: 'If a man loses his cow, he looks for it in the Vedas.' "
Why had Diffie's once-intermittent interest become such a consuming passion? Behind every great cryptographer, it seems, there is a driving pathology. ... "I had been looking all my life for some great mystery. ... I think somewhere deep in my mind is the notion that if I could learn just the right thing, I would be saved."

Pages 31-33. Diffie meets Hellman

"It was a meeting of the minds," says Hellman.
The half-hour meeting went on for an hour, two hours, longer. Hellman simply didn't want it to end, and Diffie, too, seemed eager to continue as long as possible. Hellman had promised his wife he'd be home by late afternoon to watch their two small children while she went off, so finally he asked Diffie back to his house. No problem! Diffie called Mary and she came over to have dinner with Whit and all the Hellmans, and it wasn't until 11:00 or so that night that the dialogue broke up.

Page 67. Diffie's existential crisis

Mary Fischer recalls the lowest point. One day she walked into the McCarthys' bedroom and found Diffie with his head in his hands, weeping. "I asked him what was wrong," she says, "and he told me he was never going to amount to anything, that I should find someone else, that he was --and I remember this exact term-- a broken-down old researcher."

Page 69. Diffie's breakthrough

That spring, Diffie had settled into a routine at the McCarthy house. Every morning he would make breakfast for Mary and Sarah, McCarthy's fourteen-year-old daughter. Then Mary would go off to work, Sarah would go off to school, and Diffie would stay home. One day in May 1975, he spent the morning hours thinking. After a lunch break, he returned to his mental work. For the umpteenth time, he had been thinking about the problem of establishing a secure log-in password on a computer network. Again, there was that old problem of having to trust the administrator with the secret password. How could you shut that third party out of the scheme entirely? Sometime in the afternoon, things suddenly became clear to Diffie: devise a system that could not only provide everything in Diffie's recently envisioned one-way authentication scheme but could also deliver encryption and decryption in a novel manner. It would solve the untrustworthy administrator problem, and much, much more.
He would split the key.

Pages 77-78. Merkle the Berkeley undergraduate student

Instead, for reasons that remain unclear but are probably related to Merkle's unconventional mind, he fixated on what struck him as a weird, somewhat challenging aspect of a more basic dilemma. The essential cryptographic scenario assumed that the channel of communication was vulnerable. ... But what measures could you exploit if you wanted to communicate with someone who wasn't already in possession of a pre-arranged, secure symmetrical key?
Merkle, unpolluted with knowledge about theory or history of crypto, was unaware of the apparent impossibility of his mission. He simply tried to solve the problem.

Pages 90-100. Enter Ron Rivest and the RSA algorithm

Even so, "New Directions in Cryptography" [Diffie-Hellman 1976 paper] turned out to be more than interesting to Rivest: it thrilled him. Ultimately, it changed his life.
The paper appealed to Rivest's heart as well as his head. Rivest was a theoretician, but one for whom simple abstractions were not enough. The ideal for him was actually putting the ethereal mechanics of math to work, of making a tangible difference in the world of flesh and dirt.

On April 3, 1997, a graduate student named Anni Bruce held a Passover seder at her home. Rivest was there, and Shamir, and Adleman. For several hours ideas of mathematical formulas and factoring were put aside for a recapitulation of the escape of the Jewish peopole from Egypt. As is customary with seders, people downed a lot of wine. It was nearly midnight when Rivest and his wife returned home. While Gail Rivest got ready for bed, Ron stretched out on the couch and began thinking about the problem that had consumed him and his colleagues for months. He would often do that---lie flat on the sofa with his eyes closed, as if he were deep in sleep. Sometimes he'd sit up and flip through the pages of a book, not really looking, but reworking the numbers. He had a computer terminal at home, but that night he left it off. "I was just thinking," he says.

That was when it came to him---the cognitive lightning bolt known as the Eureka Moment. He had a scheme! It was similar to some of their more recent attempts in that it used number theory and factoring. But this was simpler, more elegant. Warning himself not to get overexcited---Shamir and Adleman, after all, had broken many of his previous proposals---he jotted down some notes. He did allow himself the luxury of saying to his wife that he'd come up with an idea that just might work. He doesn't remember phoning the guys that night. Adleman, though, insists that he received a call sometime after midnight.

Rivest insisted that it was a joint project, that Shamir's and Adleman's contributions were crucial, that the scheme was the final point in an evolutionary process. To Rivest, it was as if the three of them had been in a boat together, all taking turns rowing and navigating in search of a new land. Rivest might have stepped out of the boat firs, but they all deserved credit for the discovery.

Where is the sequel?

OK, I am tired of typing. The book is 300 pages, and I only wrote a few of the most interesting parts in the first 100 pages. Please go read the book, I think you will like it.

I hope Steven Levy writes a followup book on Bitcoin and blockchain. The book stops around 2000 after discussing David Chaum's Digicash. So cryptocurrencies and blockchain would be a natural sequel to this one. 

MAD questions

Both Diffie's and Rivest's breakthroughs came after many months of intense work and thinking. But after the breakthrough insight, the schemes become easy to derive and explain. That is sort of like a one-way trapdoor function, isn't it? A trapdoor function is a function that is easy to compute in one direction, yet difficult to compute in the opposite direction (finding its inverse) without special information, called the "trapdoor". Trapdoor functions are widely used in cryptography.

I don't think there was chance involved in these discoveries. It seemed like those ideas were up in the air, vaguely hovering, and they had to go through a laborious condensation period before they materialized.

But here is an interesting page on the role of chance/serendipity in scientific discoveries.  Psychologist Kevin Dunbar and colleagues estimate that between 30% and 50% of all scientific discoveries are accidental in some sense.

Here is another interesting page on multiple discoveries / simultaneous invention. Merton believed that it is multiple discoveries, rather than unique ones, that represent the common pattern in science.

Thursday, April 12, 2018

Notes from USENIX NSDI Days 2 & 3

Before talking about Days 2 & 3, here are my notes from NSDI day 1 if you are interested.

Again all of the papers mentioned (and omitted) below are accessible publicly at the NSDI web page. Enjoy!

NSDI day 2 

Day 2 had the following sessions:

  • web and video
  • performance isolation and scaling
  • congestion control
  • cloud

Here are the papers I took detailed notes on. I ordered them by how interesting I found them.

Towards Battery-Free HD Video Streaming

This paper is by Saman Naderiparizi, Mehrdad Hessar, Vamsi Talla, Shyamnath Gollakota, and Joshua R Smith, University of Washington.

This work was mind blowing for me. This is a strong group that has developed the Wisp motes before, but even then the presented battery-free video streaming sensors was very impressive and the talk included a live demo of them.

So here is the deal. The goal of this work is to design sticker form factor, battery-free cameras. But that is crazy right? Video streaming is power hungry, how can you go battery-free?

The paper looks closely at the costs of the components of a video streaming camera. It turns out the image sensor, at 85microWatt, is very low power. On the other hand the radio at 100milliWatt is very high power.

If we could only offload the task of communication away from the sensor, we can pull this off!

The inspiration for the idea comes from the Russian great seal bug. Inside the great seal was a very thin membrane that vibrated when there is talking. Then a directional remote radio was used to receive those analog vibrations and reconstruct noise by the Russian spies. The following is from the Wikipedia page on this seal bug, called "the Thing":
The "Thing" consisted of a tiny capacitive membrane connected to a small quarter-wavelength antenna; it had no power supply or active electronic components. The device, a passive cavity resonator, became active only when a radio signal of the correct frequency was sent to the device from an external transmitter. Sound waves (from voices inside the ambassador's office) passed through the thin wood case, striking the membrane and causing it to vibrate. The movement of the membrane varied the capacitance "seen" by the antenna, which in turn modulated the radio waves that struck and were re-transmitted by the Thing. A receiver demodulated the signal so that sound picked up by the microphone could be heard, just as an ordinary radio receiver demodulates radio signals and outputs sound.
The group applies the same principle to cameras to make them battery-free. They showed on the stage the first demo of analog video backscatter that sends pixels directly to the antenna. The range for the prototype of the system was 27 feet, and it streamed 112*112 resolution of real-time video. A software defined radio was used to recreate the backscattered analog video.

For the analog hardware, the speakers said they got inspiration from human brain signals and created pulse width modulated pixels. More engineering went into performing intra-frame compression leveraging that the adjacent pixels are fairly similar.

Vesper: Measuring Time-to-Interactivity for Web Pages

This paper is by  Ravi Netravali and Vikram Nathan, MIT CSAIL; James Mickens, Harvard University; Hari Balakrishnan, MIT CSAIL.

This work asks the question of "what does it mean for a page to load quickly?" In other words, how should we define the load time?

The way page loads work likes this. The url typed in browser invokes the server to send the html. The browser runs html + javascript (requesting other embedded items as it encounters them in the html, but that is the topic of Ravi and James's other paper in the session). In the browser there is a javascript engine and a rendering engine which constructs the DOM tree. The js engine and dom tree interact via dom api.

Often the page load metric used is the page load time. Of course, this is very conservative, because only some of the page content is immediately visible, the invisible part doesn't matter, so why care about the load time of the invisible parts? Making this observation, Google came up with speed index: time to render above-the-fold, i.e., in the visible part of the browser.

But the speed index is also deficient because it doesn't talk about the js code running time. The js code running time affects the user experience, say via autocomplete, etc. An interactive page is not usable without js, and today a large fraction of the pages are interactive. The talk gave some statistics about median 182 handlers, and 95th percentile 1252 handlers in the pages surveyed.

To improve on the speed index, this work comes up with the ready index, which is defined as page time-to-interactivity in terms of visibility and functionality.

But the challenge is, nobody knows a good way to automatically identify the interactive state: are the java scripts working yet?

The system, Vesper, uses a 2 phase approach

  • identify visible elements event handlers, and state handlers access when fired
  • track loading progress of interactive state from phase 1

As a side note, the speaker, Ravi, spoke in a relaxed and clear way. The talk didn't feel rushed, but covered a lot of stuff. I really loved the delivery.

Performance Analysis of Cloud Applications

This paper is by Dan Ardelean, Amer Diwan, and Chandra Erdman, Google.

This work from Google considers the question of how we evaluate a change before we deploy it in production? Yes, the accepted approach is to use A/B testing over a small fraction of the users, but is that enough?

The abstract has this to say:
"Many popular cloud applications are large-scale distributed systems with each request involving tens to thousands of RPCs and large code bases. Because of their scale, performance optimizations without actionable supporting data are likely to be ineffective: they will add complexity to an already complex system often without chance of a benefit. This paper describes the challenges in collecting actionable data for Gmail, a service with more than 1 billion active accounts.
Using production data from Gmail we show that both the load and the nature of the load changes continuously. This makes Gmail performance difficult to model with a synthetic test and difficult to analyze in production. We describe two techniques for collecting actionable data from a production system. First, coordinated bursty tracing allows us to capture bursts of events across all layers of our stack simultaneously. Second, vertical context injection enables us combine high-level events with low-level events in a holistic trace without requiring us to explicitly propagate this information across our software stack."
The vertical context injection roughly means collecting the trace at the kernel level, using ftrace, where the layers above the kernel injects information into the kernel via stylized syscalls with payload.

The paper concludes with this observations. For meaningful performance experiments:

  • do experiments in production
  • use controlled A-B tests with 10 millions of users (less is not very meaningful)
  • use long-lived tests to capture the changing mix of requests
  • use creative approaches (vertical context injections) for collecting rich data cheaply.

LHD: Improving Cache Hit Rate by Maximizing Hit Density

This paper is by Nathan Beckmann, Carnegie Mellon University; Haoxian Chen, University of Pennsylvania; Asaf Cidon, Stanford University and Barracuda Networks.

Who knew... Cache eviction policies still require work and you can achieve big improvements there.

To motivate the importance of the cache hit rate research, Asaf mentioned the following. The  key-value cache is 100x faster than database. For Facebook if you can improve its cache hit rate of 98%, by just another additional 1%, the performance would improve 35%.

Here is the abstract:
Cloud application performance is heavily reliant on the hit rate of datacenter key-value caches. Key-value caches typically use least recently used (LRU) as their eviction policy, but LRU’s hit rate is far from optimal under real workloads. Prior research has proposed many eviction policies that improve on LRU, but these policies make restrictive assumptions that hurt their hit rate, and they can be difficult to implement efficiently.
We introduce least hit density (LHD), a novel eviction policy for key-value caches. LHD predicts each object’s expected hits-per-space-consumed (hit density), filtering objects that contribute little to the cache’s hit rate. Unlike prior eviction policies, LHD does not rely on heuristics, but rather rigorously models objects’ behavior using conditional probability to adapt its behavior in real time.
To make LHD practical, we design and implement RankCache, an efficient key-value cache based on memcached. We evaluate RankCache and LHD on commercial memcached and enterprise storage traces, where LHD consistently achieves better hit rates than prior policies. LHD requires much less space than prior policies to match their hit rate, on average 8x less than LRU and 2–3x less than recently proposed policies. Moreover, RankCache requires no synchronization in the common case, improving request throughput at 16 threads by 8x over LRU and by 2x over CLOCK.

Poster session

There was a poster session at the end of day 2. I wish there were more of the preliminary but bold work in the poster session, because most of the posters were just a poster accompanying the presented paper in the main track.

I liked these two posters the most. They are both very interesting works in progress.
  • Distributed Test Case Generation using Model Inference. Stewart Grant and Ivan Beschastnikh, University of British Columbia
  • High Performance and Usable RPCs for Datacenter Fabrics. Anuj Kalia, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David G. Andersen, Carnegie Mellon University 

NSDI Day 3

The day 3 had these sessions:

  • Network monitoring and diagnosis
  • Fault-Tolerance
  • Physical Layer
  • Configuration Management

Since the sessions were networking specific, and since I was tired of the firehose of information spewed at me in the first 2 days, I didn't take much notes on day 3.  So I will just include my notes on the Plover paper from the fault-tolerance session.

PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance

This paper is by Cheng Wang, Xusheng Chen, Weiwei Jia, Boxuan Li, Haoran Qiu, Shixiong Zhao, and Heming Cui, The University of Hong Kong.

This work builds on the Remus paper which appeared in NSDI08 and received a test-of-time award this year at NSDI.

The two limitations of the REMUS primary/backup VM replication approach was that:

  • too many memory pages needs to be copied and transferred, and
  • a split brain is possible due to partitioning.

This work, Plover, uses Paxos to address these problems. Paxos helps with both problems. By using 3 nodes, it doesn't suffer from the split brain the primary-backup approach suffers. By totally-ordering the requests seen by replicas, it can avoid copying memory pages. Replicas executing the same sequence of inputs should have the same state --- well, of course, assuming deterministic execution that is.

The drawback with Paxos is: it cannot handle non-determinism in request execution. To fix this, Plover invokes the VM page synchronization periodically before releasing replies.

The good news is that using Paxos to totally-order requests makes most memory pages same: the paper reports 97% of the pages being the same. So the VM synchronization is lightweight because it only needs to take care of the remaining 3% pages.

Plover is available on github.

I wonder, since Plover already does VM synchronization, does it really need to use a 100% totally-ordered requests delivered to the replicas via Paxos? Would it be possible to use a relaxed but faster solution? The Tapir project explored relaxed ordering of operations for storage systems, and some of the lessons may be applicable here.

MAD questions

Ok the MAD questions today picks up the thread from the last time. How do you improve the conference experience? Are conferences cramming to many technical sessions in a day? What can be done differently to improve the interactivity and networking of the conference participants?

A major reason I go to conferences is to meet people doing interesting work and converse with them, learn better about their perspectives and thought-processes.

During the 3 days at NSDI, I have talked with 14 people. That is not a bad number for me. I am not from the networking and NSDI community, so I don't know most people there. I get more chances to interact with people if I go to a conference where I know more people. Unfortunately, since I kept switching research areas (theoretical distributed systems 98-00, wireless sensor networks 00-10, smartphones/crowdsourcing 10-13, cloud computing 13-18) I don't have a home conference, where I know most people.

Out of these 14 people, I only knew 3 of them before. From the remaining, I knew a couple of them from interacting on Twitter, but the remaining majority were cold-hello first-time interactions.

The cold-hello interactions are hard, and as an introvert and shy person (except when I am curious) I had to force myself to have these first-time interactions. I assume the people I approach are also interested in talking to people (that is what conferences are supposed to be about), and we can have nice interesting conversations since we have some shared interest on distributed systems and at least on research. I would say 75% of the conversations I had  were interesting and not-superficial. But sometimes it bombs and that gets awkward. And instead of being happy about the nice interactions you have, it is easy to focus on the awkward ones and feel bad about them.

Although I am happy with meeting 14 interesting people, this is so much lower than the people I meet and talk with at HPTS. If you look at my posts about HPTS, you can see that I made it a point to emphasize how much I enjoyed the interactivity of HPTS.

I think a major way HPTS makes this happen is it sets the intentions clear and states this explicitly the first day. Pat Helland takes the stage and says that "the point of HPTS is to meet other people and interact, and the sessions are just a break from meeting/networking with other people". Since HPTS makes the cold-hello the norm, it does not feel weird anymore. I never had an awkward conversation at HPTS.

I am sure there are many ways to build interactivity and networking in the conferences. Why don't we make the posters session a long session in the afternoon, rather than after 6pm? Are there any ice-breaker activities that the conferences can adapt? I remember that at an activity with 20 people, the moderator asked everyone to say something uniquely quirky about themselves. That broke the ice pretty quickly; I still remember some of the quirks people mentioned. Maybe to scale for larger groups, it may be possible to have open-mike crazy ideas, dangerous ideas, and hot-takes sessions. Maybe we need to get some professionals to help, say from improv comedy people or capable event ice-breaker people. (I assume big companies like Google should have skilled HR people to help tech people interact better, right?)

Ok, this can get long, and I am not really knowledgeable in this area, so I will stop here. But here is my ask. Next time if you see me at a conference, please approach me and say hello. I am sure we will have a nice conversation and we will have things to learn from each other.

Maybe next time I should make a badge to make this explicit: "Please say Hi to me, I love to meet and talk to you."

Wednesday, April 11, 2018

Notes from USENIX NSDI 18 First Day

I have been attending USENIX NSDI, one of the premier conferences on networking, at Seattle, WA. Here are some notes from the first day, Monday, April 9.

Pre-session announcements

NSDI has 40 papers accepted out of 255 papers. There was a mention of 1000 reviews done for the conference. That is a lot of reviews by very highly qualified people. It is a shame those reviews are not shared openly, those reviews could be really useful for the community to learn from, and providing them may also expose if there were any sub-par reviews. There is a movement for open review process, and I hope it catches on at a wider scale.

The best paper is awarded to "NetChain: Scale-Free Sub-RTT Coordination" by  Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.

The community award (best paper whose code and dataset made publicly available) went to "Stateless Datacenter Load-balancing with Beamer" by Vladimir Olteanu, Alexandru Agache, Andrei Voinescu, and Costin Raiciu, University Politehnica of Bucharest.

NSDI makes all papers available publicly. So if any of the papers here interest you, you can download and read the paper.

First session 

The first session was on new hardware. The main theme here was to see how we can get the performance hardware solutions offer with the programmability of software solutions. I provide short summaries of the presentations of two papers. The session also included two other papers titled "PASTE: A Network Programming Interface for Non-Volatile Main Memory" and "Azure Accelerated Networking: SmartNICs in the Public Cloud Microsoft".

Approximating fair queueing on reconfigurable switches

The paper is by Naveen Kr. Sharma and Ming Liu, University of Washington; Kishore Atreya, Cavium; Arvind Krishnamurthy, University of Washington.

Congestion control today done via end-to-end protocols. The switches are dumb. What if they were smarter? That would provide  benefits for the end host and fairness, etc.

But, smart switches are challenging to realize for high-speed networks. In high-speed networks, it is hard to

  • maintain a sorted packet buffer
  • store per-flow counters
  • access and modify current round number.

The work implements simulated fair queuing (fair queueing without per flow queues) in high-speed switches. The approach is based on approximate fair queueing: simulate a bit by bit round robin scheme with key approximations

The approximate flow counters are stored using a variation of count-min sketch. The results show the approximate fair queueing is achieved via 12-14 queues. Evaluation also shows that approximate fair queuing leads to 4-8x improvement in flow completion times.

NetChain: Scale-Free Sub-RTT Coordination

The paper is by Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.

This work received the best paper award. And it is also on the distributed coordination area I am interested in, so I have a relatively long coverage of this work.

Conventional wisdom says coordination is expensive. Netchain aims to provide  lighting fast coordination enabled by programmable switches. Some applications of the coordination service can be configuration management, group membership, locking, barrier synchronization.

Right now these are done over a coordination service running Paxos, which is often implemented over a strongly-consistent, fault-tolerant key-value store. Can we do better in terms of latency and throughput?

The opportunity for using in-network coordination is that distributed coordination is communication-heavy rather than computation-heavy. The idea is to run coordination in the switches using consensus. The design goals are to achieve high-throughput and strong consistency.

How do they build a strongly consistent fault-tolerant in-network (which means in the switches) keyvalue store? They use the chain replication protocol. That is, there is a master configurator managing the chain configuration and the storage nodes on the chain that replicate the key-values.

The in-network (that is, on the switch) key-value storage builds on the SOSP 17 paper titled "NetCache: Balancing Key-Value Stores with Fast In-Network Caching" which leverages register arrays in the switches.

There is a problem of possible out-of-order delivery between the consecutive switches in the chain, which can lead to inconsistency. The presentation says they solve that with serialization with sequence number and dropping the out of order packet. The onus is on the client to retry when its request doesn't get a reply in time.

Of course the complicated part here is for handling a switch failure. Chain replication technique for recovery is adapted, so that the master configurator (which is Paxos maintained) can reconfigure the switch by removing the crashed node. But then to keep the fault-tolerance, later a new node needs to be added. The paper says the master first copies the state to the newly added one and then add that to the chain. Of course, there is a catching-up problem of the newly added switch if the previous node keep getting and inserting new items. There needs to be some blocking to coordinate it, probably via 2-phase commit. I quickly scanned the paper, and didn't see this discussed.

The paper has a TLA+ specification in the extended arxiv version. I checked the TLA+ model, and it assumes copying state to the new switch is done atomically, which seems to be an over-simplification of what needs to be implemented in reality.

The work is evaluated over 4 barefoot tofino switches and 4 commodity servers.  I think the presenter said that upto 100K key-value pairs could be stored on 8Mb storage available on the switches. Compared to Zookeeper, the solution was able to provide 3 orders of magnitude improvement in throughput and 1 order of magnitude improvement in read latency and 2 order of magnitude in write latency.

The presented concluded with asking what kind of new applications could be enabled by faster coordination service available at the datacenters. I am still unclear what subRTT means in the title, but I will read the paper and write a longer review later.

Second session

The second session was on distributed systems, my favorite topic. I have brief summaries of the three papers presented in this session.

zkLedger: Privacy-Preserving Auditing for Distributed Ledgers

This paper is by Neha Narula, MIT Media Lab; Willy Vasquez, University of Texas at Austin; Madars Virza, MIT Media Lab.

Neha presented this paper, and pulled it off without using the word blockchain even once.

Verification provided by distributed ledgers should not mean everything is in the open. The companies require some privacy to keep some business strategies and positions secret. On the other too much secrets is also a bad thing, there needs to be some auditability to prevent bad actors.

zkLedger provides practical privacy and complete auditing.

A big challenge here is that a bad actor bank could omit transactions. zkledger jointly addresses the privacy and auditability requirements by keeping an entry for every bank in every transaction. To hide values, zkLedger uses Pedersen commitments. The key insight is that the auditor audits every transaction. zkLedger also uses an interactive map/reduce paradigm over the ledger with non-interactive zero-knowledge proofs (NIZKs) to compute measurements that go beyond sums.

The abstract of the paper provides a good summary, so here it is:
Banks create digital asset transactions that are visible only to the organizations party to the transaction, but are publicly verifiable. An auditor sends queries to banks, for example "What is the outstanding amount of a certain digital asset on your balance sheet?" and gets a response and cryptographic assurance that the response is correct. zkLedger has two important benefits over previous work. First, zkLedger provides fast, rich auditing with a new proof scheme using Schnorr-type non-interactive zero-knowledge proofs. Unlike zk-SNARKs, our techniques do not require trusted setup and only rely on widely-used cryptographic assumptions. Second, zkLedger provides completeness; it uses a columnar ledger construction so that banks cannot hide transactions from the auditor, and participants can use rolling caches to produce and verify answers quickly. We implement a distributed version of zkLedger that can produce provably correct answers to auditor queries on a ledger with a hundred thousand transactions in less than 10 milliseconds.

Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

This paper is by Yilong Geng, Shiyu Liu, and Zi Yin, Stanford University; Ashish Naik, Google Inc.; Balaji Prabhakar and Mendel Rosenblum, Stanford University; Amin Vahdat, Google Inc.

This work aims to provide accurate timestamping as a primitive (with 10 nanosec accuracy) in datacenters at scale. Tightly-synchronized clocks have applications in distributed systems particularly for distributed databases, such as spanner and cockroachdb.

The system they develop, Huygens, take NIC timestamps (which is supported by most current generation NICs), and doesn't require specialized switches like PTP. Other than the NIC timestamping support, Huygens is software based.

Huygens leverages three key ideas. First, coded probes are used for identifying and rejecting impure probe data which suffer queuing delays. To detect this, Huygens send 2 consecutive probe packets with known gap: 10microsecond (NIC timestamping) and checks the gap between them on the receiving end. Only the probes with the original 10 microsecond gap are accepted as pure, since they most likely have experienced zero queueing delays. Since the queues are changing fast, it is very unlikely both of the consecutive packets were subject to same nonzero queueing delay.

Second, Huygens processes the purified data with Support Vector Machines, a widely-used and powerful classifier, to accurately estimate one-way propagation times. (Huygens assume delay between two servers are symmetric.)

Finally, to detect and correct synchronization errors even further, Huygens exploits the network effect that a group of pair-wise synchronized clocks must be transitively synchronized.

One of the questions asked was about the high probing rate employed by this work. Another question was about if this could be extended to WAN? The presenter mentioned 10 microsecond accuracy in a WAN experiment, but I wonder if it is due to Google datacenters having private and high-speed links.

One very interesting question was if this could be used for measuring the temperature in datacenters? This is a great question, not a crazy one. High temperature means local clock starts to run slower, and there is a predictable linear relationship. So if you have tightly synchronized clocks, you can measure the drift from ideal and infer temperature increase/decrease.

I will read and summarize this paper later, since I am interested in mechanisms and applications of clock synchronization in distributed systems. After our work on hybrid logical clocks (HLC), we had built a distributed monitoring system, Retroscope, using HLC.

SnailTrail: Generalizing Critical Paths for Online Analysis of Distributed Dataflows

This paper is by Moritz Hoffmann, Andrea Lattuada, John Liagouris, Vasiliki Kalavri, Desislava Dimitrova, Sebastian Wicki, Zaheer Chothia, and Timothy Roscoe, ETH Zurich.

I had provided a summary of this work earlier in my blog. Please see that.

Third session

The third section was on traffic management. I am not really a networking guy, but I listened to these to get some more understanding on what is going on in this domain.

The papers presented were:

  • "Balancing on the Edge: Transport Affinity without Network State", 
  • "Stateless Datacenter Load-balancing with Beamer", 
  • "Larry: Practical Network Reconfigurability in the Data Center", and
  • "Semi-Oblivious Traffic Engineering: The Road Not Taken"

Fourth session 

The fourth session was on network function virtualization and hardware.

The papers presented were:

  • Metron: NFV Service Chains at the True Speed of the Underlying Hardware
  • G-NET: Effective GPU Sharing in NFV Systems
  • SafeBricks: Shielding Network Functions in the Cloud
All of these papers are available on the NSDI'18 webpage.

MAD questions

Really, you read this far into this long post, and still expect me to write some MAD questions? Ok, here is just one, so that my promise is not broken.

The afternoon and especially late afternoon isn't really great for listening to conference talks. To follow talks one needs to expend a lot of mental effort, because the deliveries are done very quickly. Moreover the context changes drastically from talk to talk, and that is also depleting attention.

I wonder, if at least the late afternoons are better spent with panels, or more lively and less concentration-demanding activities.

I have heard of this book on these topics "When: The Scientific Secrets of Perfect Timing" by Daniel Pink. I plan to check that book.

Tuesday, April 3, 2018

The Stellar Consensus Protocol: A Federated Model for Internet-level Consensus

Last week in our seminar we discussed the Stellar consensus paper.

The paper is long, 32 pages. It looks like the paper is written the way the protocol is conceived and reasoned about. First comes a section on the federated Byzantine agreement (FBA) model, which talks about quorum slices and the quorums that result from them. The next section talks about the requirements for quorum intersections, and defines the dispensable sets with associated theorems. Then comes the federated voting section, with subsections titled:  Voting with open membership, Blocking sets, Accepting statements, Accepting is not enough, Statement confirmation, and Liveness and neutralization. On page 19, the Stellar Consensus Protocol (SCP) section starts, with more definitions and the proofs intertwined with the protocol description. At this point, the reader is already overwhelmed, trying to maintain in his mind theories about how the previous 19 pages might connect back to this protocol, and is confronted with the task of reading through a 9 pages long SCP protocol section.

The paper would be significantly improved if it was rewritten top-down: not in a way the protocol is conceived and proved, but in a reader-friendly manner prioritizing the clear communication of the protocol basics.

It was hard reading through this paper, and I didn't read it thoroughly. Here is what I understand:
Stellar Consensus Protocol is PBFT which uses quorums derived from quorum slices of federated participants, instead of traditional quorums from a closed set of participants.
It would be nice if SCP provided a mechanism that prevents participants from selecting bad quorum slices that lead to bad quorums.

Background and context

Traditional Byzantine Agreement protocols have closed membership: the number of participating nodes and their identities (via public private keys or via non-transferable symmetric-key MACs) are fixed.

SCP is a Federation-based Byzantine Agreement (FBA) protocol which allows open membership instead of requiring closed membership. In a federated open membership model, we don't know the number of nodes in the entire system or all of their identities. Nodes may join and leave.

One way to deal with open membership is Proof-of-Work based blockchains as in BitCoin. That has problems with excessive energy consumption, inscalability of throughput, and due to probabilistic nature of the commit long wait times to have a good guarantee of irreversibility of a transaction.

SCP does not use proof-of-work based blockchains. It adapts the PBFT protocol to work in an open membership federated environment. PBFT is a 3-phase deterministic byzantine consensus protocol. It has similarities with Paxos: in fact if you extend Paxos which just tolerates crash faults to tolerate byzantine faults, you get pretty much the PBFT protocol.

The federated model

The federated model means that a node can have a PBFT/consensus agreement with a set of nodes it specifies, without involving all the nodes in the network in this agreement.

To this end, each node specifies a quorum slice in its config file. The  quorum slice consists of the nodes it trusts, and hopefully be a diverse well-balanced portfolio. By declaring its quorum slice, this node says that it finds the consortium of these nodes (not necessarily individually each one) trustworthy, and will rely on this consortium to convince itself of the agreement and will rely on them to bless/endorse its transactions. Traditional non-federated Byzantine agreement requires all nodes to accept the same slices, in FBA the key innovation is enabling each node to chose its own quorum slice set.

These quorum slices are used for constructing quorums. A quorum is a set of nodes sufficient to reach agreement.

For safety, any two quorums in the network need to intersect, and the intersection should contain nonByzantine nodes. If the quorum intersection consists entirely of Byzantine nodes, then SCP cannot guarantee safety.

The onus is on the users to provide good quorum slices. SCP does not provide a way to check the soundness/integrity of quorum slices which give rise to quorums. Again if the quorum intersection consists entirely of Byzantine nodes, safety is violated and SCP doesn't accept responsibility of that. To quote the paper: "SCP can only guarantee safety when nodes choose adequate quorum slices."

The SCP protocol starts with a nomination phase, which if run long enough, eventually produces the same set of candidate values at every intact node, which means nodes can combine the candidate values in a deterministic way to produce a single composite value for the slot. Upon predicted/approximated convergence of nomination phase, the nodes start the ballot phase to perform federated voting (PBFT) to commit and abort ballots associated with composite values.

When intact nodes agree to commit a ballot, the value associated with the ballot will be externalized for the slot in question. When they agree to abort a ballot, the ballot's value becomes irrelevant. If a ballot gets stuck in a state where one or more intact nodes cannot commit or abort it, then nodes try again with a higher ballot; they associate the new ballot with the same value as the stuck one in case any node believes the stuck ballot was committed.

Safety results from ensuring that all stuck and committed ballots are associated with the same value. Liveness follows from the fact that a stuck ballot can be neutralized by moving to a higher ballot. The good news about SCP is that provided that the quorum condition is satisfied, any committed decision is instantly irreversible.

Here are some videos on SCP (well mostly the motivation and setup of Federated Byzantine Agreement without the SCP protocol description): and

MAD questions

1. What are the scalability limits of SCP? 
That the quorums need to intersect is a limitation. If the quorum selections are not done carefully, you may need majority quorums for the system, and PBFT based protocol would suffer after 20 nodes in quorum and 40 nodes in the system. But there are better ways to select your quorum slices: Instead of a flat system if you use a hierarchical system, with tier 1 nodes, tier 2 nodes, tier 3 nodes, and choose your quorums through this hierarchy you can satisfy the quorum property with about log(N) nodes in contrast to N/2 nodes. Hierarchies actually work pretty well for scaling. Think of a tree with 10 children per node, at level 4 there will 10,000 nodes, and level 5 100,000 nodes.

2. There has been a lot of work on quorum systems and probabilistic quorum systems. Can those be employed to help with the scalability problem in SCP? 

Maybe even some graph algorithms can be relevant, like the "Sparse Partitions" work by Baruch Awerbuch and David Peleg. 

3. Is it possible to come up with a service for SCP that provides checks and prevents nodes from selecting bad quorum slices that lead to bad quorums? 
But why would you tryst that service, that service itself should be built in a trustless way.

4. How can we implement sharding per services in SCP?
In the SCP model described in the paper all transactions are related and potentially interfering/dependent on each other since there is no sharding per services considered. How can we implement sharding support for SCP that provides parallelism inside the services but also allows occasional cross service transactions and prevents double spending. Would it be possible to build something similar to the Aspen model for SCP?

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...