Thursday, July 25, 2019

Paper summary. CORFU: A shared log design for flash clusters

By: Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, John D. Davis, Appeared in NSDI'2012
This paper applies VPaxos ideas (of using an auxiliary Paxos box for reconfiguration) and chain replication ideas in the context of Flash SSDs. The vision is that Corfu's novel client-centric design eliminates storage servers in favor of simple, efficient and inexpensive flash chips that attach directly to the network. The clients directly write to storage nodes, similar to what happens in Dynamo/Cassandra/Voldemort replication, but linearizability is still guaranteed.

Previously I had summarized the Tango paper, for maintaining distributed data structures over a shared log. Tango builds on the Corfu log abstraction.

Corfu involves three main functions:

  • A mapping function (maintained at the VPaxos box) from logical positions in the log to flash pages on the cluster of flash units
  • A tail-finding mechanism (using a sequencer node) for finding the next available logical position on the log for new data
  • A replication protocol (chain replication!) to write a log entry consistently on multiple flash pages

Mapping in Corfu

Each Corfu client maintains a local, read-only replica of a data structure called a projection that carves the logical log into disjoint ranges. Each such range is mapped to a list of extents within the address spaces of individual flash units.

The example above maps each log position to a single flash page; for replication, each extent is associated with a replica set of flash units rather than just one unit. For example, for two-way replication the extent F0: 0:20K would be replaced by F0/F0′:0:20K and the extent F1:0:20K would be replaced by F1/F1':0:20K.

When some event occurs that necessitates a change in the mapping --for example, when a flash unit fails, or when the tail of the log moves past the current active range-- a new projection (a new view with a new epoch number) has to be installed on all clients in the system.

To maintain and reconfigure this mapping, Corfu uses VPaxos. There is a mapping from logical log to physical SSD extents/ranges. VPaxos keeps that mapping, and updates that mapping on failures, and on extent full.

This VPaxos-based auxiliary-driven reconfiguration involves two distinct steps:
1. Sealing the current projection: When a client Cr decides to reconfigure the system from the current projection Pi to a new projection Pi+1, it first seals Pi; this involves sending a seal command to a subset of the flash units in Pi. Sealing ensures that flash units will reject in-flight messages --writes as well as reads-- sent to them in the context of the sealed projection.

2. Writing the new projection at the VPaxos box: Once the reconfiguring client Cr has successfully sealed the current projection Pi, it attempts to write the new projection Pi+1 at the (i + 1)th position in the VPaxos box. If some other client has already written to that position, client Cr aborts its own reconfiguration, reads the existing projection at position (i + 1), and uses it as its new current projection.

Finding tail in Corfu

To eliminate contention at the tail of the log, Corfu uses a dedicated sequencer that assigns clients 'tokens', corresponding to empty log positions. To append data, a client first goes to the sequencer, which returns its current value and increments itself. The sequencer is merely an optimization to reduce contention in the system and is not required for either safety or progress.

Replication in Corfu

Corfu uses a simple chaining protocol (a client-driven variant of Chain Replication) to achieve safety-under-contention and durability. When a client wants to write to a replica set of flash pages, it updates them in a deterministic replica order, waiting for each flash unit to respond before moving to the next one. If two clients attempt to concurrently update the same replica set of flash pages, one of them will arrive second at the first unit of the chain and receive an error overwrite.

To read from the replica set, clients go to the last unit of the chain. If the last unit has not yet been updated, it will return an error unwritten.

To fill holes (which is important for RSM maintenance from log), the client starts by checking the first unit of the chain to determine if a valid value exists in the prefix of the chain. If such a value exists, the client walks down the chain to find the first unwritten replica, and then completes the append by copying over the value to the remaining unwritten replicas in chain order. Alternatively, if the first unit of the chain is unwritten, the client writes the junk value to all the replicas in chain order.

MAD questions

1. Do SSDs still work this way?
I am not current on my SSD knowledge. The paper makes use of properties of Flash SSDs: it assumes specific error codes to be returned for "no item", "item", and "junk", and in effect, it treats the SSDs as write-once registers for the purpose of the log. In return, it also tries to account for some of its limitations like uneven wear problem, and tries to load-balance the wear.

Did anything change in the way SSDs work that change these assumptions/requirements?

2. How can we improve on some drawbacks?
A big drawback in Corfu is that any time a fault occurs, everything stalls and a reconfiguration is performed before reads/writes can proceed on the active extent. This is also a problem with chain replication based protocols in general.

Would there be some simple solutions to amend Corfu to address this?
For example, would it be possible to come up with a more clever, single node crush tolerant mapping? Ceph had a clever hierarchical hashing called Crush, maybe something along those lines.

As I have mentioned in the previous blog post, MAD questions, Cosmos DB has operationalized a fault-masking streamlined version of replication via nested replica-sets deployed in fan-out topology. Rather than doing offline updates from a log, Cosmos DB updates database at the replicas online, in place, to provide strong consistent and bounded-staleness consistency reads among other read levels. On the other hand, Cosmos DB also maintains a change log by way of a witness replica, which serves several useful purposes, including fault-tolerance, remote storage, and snapshots for analytic workload.

Sunday, July 21, 2019

Dissecting performance bottlenecks of strongly-consistent replication protocols

Dissecting performance bottlenecks of strongly-consistent replication protocols
Ailidani Ailijiang, Aleksey Charapko, and Murat Demirbas.

Hey, this is our paper! This appeared in Sigmod 2019 couple weeks back. This paper came out of the dissertation work of Ailidani Ailijiang. He has build the Paxi framework in Go, available on GitHub, to prototype any Paxos flavor quickly. His dissertation is called: "Strongly Consistent Coordination for Wide Area Networks". 

Writing blog posts about one's own papers is harder than writing posts about others' papers. When you write a summary of your work, you want to include everything, and cannot detach yourself from specifics easily. I found that I neglected posting about many of our papers, even though it is important to provide brief and accessible summaries of these papers to enhance their reach. It is important to reach more people, because then we can see whether the paper can stand the test of time and push the state of our understanding a bit further. "Science happens only when published work resists critique, otherwise it is speculative fiction. -Frank McSherry"


The reason Paxos is popular is due to its excellent fault-tolerance properties. Paxos (and Paxos flavors) preserve safety to the face of fully-asynchronous environment, any sequence of faults, and even to the face of network partitions. Paxos and its derivatives are often used for replication in strongly-consistent databases, e.g., CockroachDB, Spanner, YugaByte, PaxosStore. As such, performance of Paxos protocols become important for the performance of distributed databases/systems. However, Paxos variants have widely different performance inherently, and even for the same protocol, different workload, topology, and network conditions result into widely varying performance.

In this work, we study the performance of Paxos protocols. We take a two-pronged approach, and provide both analytic and empirical evaluations which corroborate and complement each other. We then distill these results to give back-of-the-envelope formulas for estimating the throughput scalability of different Paxos protocols.

Paxos background

Before we discuss Paxos bottlenecks, here is a brief refresher for the Paxos protocol and variants.

The figure illustrates a single leader (vanilla) Paxos protocol. Yes, three phases look expensive, but in Multi-Paxos, things get a lot better because Phase1 is skipped in the presence of a stable leader. In Multi-Paxos, for upcoming slots (i.e., consensus instances), the leader skips Phase1 and just goes with Phase2. As another optimization, Phase3 messages are piggybacked to the Phase2 messages of upcoming slots rather than being sent separately.

But, even in Multi-Paxos (which we consider henceforth), there is an obvious bottleneck at the leader. The leader is doing a disproportionately large amount of the work, while the followers are slacking off. The followers receive one message and send one message back for each slot. In contrast, the poor leader needs to send N messages in Phase2a, and receive at least a quorum of messages from followers in Phase2b. It turns out, in practice, the overhead of Phase2b is worse than that of Phase2a. For sending Phase1a messages, the leader serializes the message once, and the network card takes care of sending them. For receiving the messages in Phase2b, the leader node needs to deserialize and process each message separately.


Of course many researchers noticed this bottleneck at the leader, and they proposed Paxos flavors to alleviate this issue. EPaxos used opportunistic leaders: any node becomes a leader when it receives a request, and tries to get a ~3/4ths quorum of nodes accept the request to finalize it. In EPaxos, a conflict is possible with concurrent and noncommutative commands, and that requires another round to resolve.

WanKeeper deploys Paxos groups hierarchically. This helps for scalability because key-ranges are sharded to Paxos groups. Using a simpler versions of this idea, Spanner and CockroachDB statically assign keyranges to Paxos groups, and use another service (such as Movedir) to modify the assignments.

WPaxos provides a more decentralized version of sharding.  It uses multileaders, and partitions the object-space among these multileaders. Unlike statically partitioned multiple Paxos deployments, WPaxos is able to adapt to the changing access locality through object stealing. Multiple concurrent leaders coinciding in different zones steal ownership of objects from each other using Phase1 of Paxos, and then use Phase2 to commit update-requests on these objects locally until they are stolen by other leaders. To achieve fast Phase2 commits, WPaxos adopts the flexible quorums idea in a novel manner, and appoints Phase2 acceptors to be close to their respective leaders. Here is the link to the journal version of our WPaxos paper for more details.

Analytical modeling using Queueing Theory

Ok, after that brief detour, we continue with analytical modeling of Paxos protocol performance. We use modeling with queueing theory for our analytical results. To fit a queueing model on Paxos protocols, we bootstrap from our experimental results with Paxi. We first fit the model on  Multi-Paxos, and then use the queueing model simulations to come up with performance evaluations for other Paxos variants. Then to cross-validate the queueing theory model results, we compare them with the experiment results for the corresponding Paxos variants.

But what is there to validate/corroborate our experimental results in the first place? For this we compare experimental results of our Paxi MultiPaxos implementation with etcd/Raft implementation. We find that both implementations reach the throughput bottleneck around the same point.

Another thing to observe in this Paxos throughput graph is that, as the throughput approaches system limit, the latency starts to grow exponentially. Different Paxos flavors would have different limiting throughput, and a protocol is more scalable if it has a higher limiting throughput.

If we find a way to plot the increase of latency of protocols, we can determine the limiting throughput of those protocols. The question then becomes: "For a given throughput, what is the average latency for each request?" Queueing theory comes handy for addressing this question, and that is why we employed it for our analytical modeling.

We find that this corresponds to a simple M/D/1 queueing model. M/D/1 represents the queue length in a system having a single server, where arrivals are determined by a Poisson process (occurring at rate $\lambda$) and job service times are fixed and deterministic (serving at rate $\mu$ = 1/s).

Using M/D/1, the model for Multi-Paxos is set up as follows. (For other flavors of Paxos, we extend the model accordingly, and get simulation results respectively.) Latency consists of 3 parts, W + s + rtt, where
  • W: average waiting time in queue
  • s: request service time (determined by the size of quorum the leader manages)
  • rtt: network latency to reach the quorum
Under M/D/1, after a protocol is chosen, s and rtt becomes fixed. The formula for $W$ is given as $W=\rho / (2 * \mu * (1-\rho)$, where $\rho = \lambda / \mu$ : utilization of the server. That means, as $\lambda$ (request arrival rate) increases, $W$ increases, and this contributes to the exponential growth. Moreover, different Paxos flavors would have different $s$, which plays in the $W$ formulas since $\mu=1/s$, and leads to the different limiting throughput.

Empirical modeling

We compare/cross-validate the results from queueing theory with the experimental results we obtain from our Paxi framework.The diagram shows main components in our Paxi framework. The developer can easily prototype a distributed coordination/replication protocol by filling in the messages and replica components, shown as shaded blocks.

To facilitate getting experiment results,  the Paxi benchmarker can (1) generate workloads by tuning the read-to-write ratios, creating hot objects, conflicting objects, & locality of access, (2) measure latency & throughput, (3) test scalability by adding more nodes & by increasing dataset size, (4) test availability by injecting faults, and (5) verify the serializability of protocol output utilizin a simple offline read/write linearizability checker.

Evaluation results

The figure shows modeled throughput with queueing theory, with 50% write, 50% read, on 1000 objects. As throughput increases the conflict probability also increases, and EPaxos starts to suffer from that. WPaxos shows better scalability than Multi-Paxos.

This figure shows experimental throughput from Paxi implementations under the same conditions. This matches the modeled throughput.

The above figure shows modeled throughput in WANs. EPaxos is hurt by the increase of conflict ratio, and with increased conflict ratio may even perform worse than Multi-Paxos. WPaxos achieves high scalability and low-latency by sharding the key-ranges to leaders and doing this in an access-locality adaptive way.

The figure below shows the experimental results in a WAN deployment, evaluating latency under increased conflict ratios. Paxos is unfazed by conflicts, because the single leader does not experience any conflicts. EPaxos latency remains lower than that of Paxos up to 40% conflict ratio. Conflicts in WPaxos means key-ranges need to be relocated from one leader to another, which involves WAN latency. So as the relocation ratio increases WPaxos latency increases gradually. Similarly VPaxos and WanKeeper also has a gradual increase of latency with respect to increased need to relocate key-ranges.

Please see our paper for many more graphs and results.

Forecasting throughput scalability

Here we give approximate back-of-the-envelope formulas for predicting the limiting throughput of different protocols. These formulas are not the whole story, as they don't show how latency changes as throughput increases as does our queueing theory and empirical results show. But they are useful to relatively rank the scalability of Paxos variants with respect to each other.

Through our analytical modeling and Paxi experiments, we find that the throughput of a protocol is inversely proportional to the load on the busiest node, which is by definition the leader or a leader. The throughput scalability of protocols improve as the load decreases.

Let $L$ be the number of leaders, and Q be quorum size, and c the conflict probability. Then we can approximate Load as follows:
Load = ( 1 + c ) * ( Q-1 ) *  1/L +  ( 1 + c ) *  1-1/L

The first part in the summation denotes that a leader is responsible for only 1/L of the requests, and it needs to process messages by Q-1 nodes. Moreover if there is a conflict, with conflict probability c, those fraction of requests incur another round of load.

The second part says that a leader is also responsible for serving as participant in other leaders protocol, adding one message processing cost for the 1-1/L of the requests. Furthermore we also account for $c$ fractional load due to conflicts.

The above formula simplifies to Load = ( 1 + c ) * ( Q + L - 2 ) / L

Recall that, the lower the load, the more scalable is the protocol. Therefore to improve scalability, increase the number of leaders, L. This way each leader get to deal with only a fraction of the requests. However, while increasing L, it is important to make sure this does not increase the conflict rate, c, because each conflict means additional work for the leaders.

For Multi-Paxos, L=1,  and the leader is responsible for all requests. But the good news is c=0, because there is no conflicts when there is only one leader. Therefore for each request, the load on the leader is Q-1.

If we take N=9, the load for Multi-Paxos comes to 4. For EPaxos, the load comes to 4/3 *(1+c). For c=1 the load becomes 8/3, and for c=0, the load becomes 4/3.

For WPaxos, c=0, and L=3 and Q=3, so the load comes to 4/3. That means, if EPaxos has no conflict workload, it can has as high throughput as WPaxos, otherwise, WPaxos would have higher throughput. For WanKeeper, c=0, L=3, and Q=3, and a group does not do extra/side work for another, so the load comes to 1.

Note that the load for these protocols matches with the relative throughput scalability of the corresponding protocols.

MAD questions

1. What would you prototype with Paxi?
The Paxi framework is general, and it is possible to implement more than Paxos protocols using Paxi. For example, we implemented the ABD protocol. We haven't implemented any Byzantine Paxos solution, but it would be possible to implement and get results from Byzantine Paxos protocols. It would even be possible to implement gossip protocols with Paxi, maybe the Avalanche protocol.

If you have an idea to implement and benchmark a protocol with Paxi, and have questions, let us know.

2. What are other techniques for alleviating the bottlenecks in Paxos protocols?
Of course you can circumvent Paxos bottlenecks, by not using Paxos. For example, by using  chain replication (which has its own drawbacks) you can employ Paxos only in the control path for maintaining the replication topology, and achieve strongly-consistent replication without a bottleneck at leader. Cosmos DB further avoids the downsides of chain replication, and achieves high-throughput, WAN scalable, fault-masking strongly-consistent replication by using nested replicasets in a fanout topology.

Coming back to our question of techniques for alleviating the bottlenecks in Paxos, we have some new promising ideas for improving the dissemination/aggregation paths. Aleksey is exploring these ideas, and we hope to report on them when we get results.

3. Paxos jokes
Here is the cat tax for this long technical post on Paxos. Yes, these are Paxos cats.

Also, if you made it this far, here are some Paxos jokes you might enjoy.

Recently, Aleksey went to the trouble of buying the domain, and building up a website, which you can read more jokes, and submit your jokes about Paxos protocols and distributed systems in general.

Friday, July 12, 2019

Paxos jokes

For some reason Aleksey finds any joke about Paxos hilarious, whereas Ailidani and I are indifferent to Paxos jokes. However, today out of nowhere I came up with some decent Paxos jokes, and shared them on Twitter.  In the evening I attended the USENIX ATC reception and shared these jokes with Benjamin Reed and Alexander Shraer of ZooKeeper and ZooNet fame, and cracked them up.

Here they are for perpetuity. It is a bad idea to explain jokes. But for pedantic purposes, and to get people interested in Paxos, I provide some explanations. Paxos jokes made simple... err.. moderately complex.

I think the jokes get funnier if you read them in a Russian accent. So give that a try.

Leader - I tell you Paxos joke, if you accept me as leader.
Quorum - Ok comrade.

Leader - Here is joke! (*Transmits joke*)
Quorum - Oookay...

Leader - (*Laughs* hahaha). Now you laugh!!
Quorum - Hahaha, hahaha.

The conversation corresponds to phase 1, phase 2, and phase 3. In Paxos, the leader commits first, and tells the participants to commit via a phase 3 message.

If you followed the basic protocol, here is a riff on that.

Leader - I tell you Paxos joke, if you accept me as leader.
Quorum - Ok, but I heard "this one Paxos joke" before.

Leader - Here is joke! (*Transmits "this one Paxos joke" back.*)
Quorum - Oookay...

Leader - (*Laughs* hahaha). Now you laugh!!
Quorum - Hahaha, hahaha.

The joke here is that the leader should re-propose the same value (with the highest ballot number) that is pointed out from phase 1b message.

Here is another one.

Leader1 - I tell you Paxos joke, if you accept me as leader.
Quorum - Ok comrade.

Leader2 - I tell you Paxos joke, if you accept me as leader.
Quorum - Ok comrade.

Leader1 - No! I tell you Paxos joke.
Leader2 - No! I tell you Paxos joke.
(*dueling leaders* ... ad infinitum)

This can happen in the presence of a fully asynchronous environment, where even the diamond W failure detector is not implementable.

I have nerdier jokes for other Paxos variants, but yeah, they get more convoluted. So let's not go there.

To end the post, I have a Raft joke for you.

Well, it is really a Paxos joke, but easier to follow.

At least to some people... mostly programmers.

UPDATE: Aleksey went to the trouble of buying the domain, and building up a website, where you can read more jokes, and submit your own jokes about Paxos protocols and distributed systems in general.

Tuesday, June 18, 2019

Book review. Range: Why Generalists Triumph in a Specialized World

This is a very recent book, released on May 28, 2019. I got drawn to this book due to its interesting and controversial title: "Why Generalists Triumph in a Specialized World". The blurb about the book says:
"If you take a closer look at the world's top performers, from professional athletes to Nobel laureates, you'll find that early specialization is the exception, not the rule.
[David Epstein] discovered that in most fields--especially those that are complex and unpredictable--generalists, not specialists, are primed to excel. Generalists often find their path late, and they juggle many interests rather than focusing on one. They're also more creative, more agile, and able to make connections their more specialized peers can't spy from deep in their hyperfocused trenches. As experts silo themselves further while computers master more of the skills once reserved for highly focused humans, people who think broadly and embrace diverse experiences and perspectives will increasingly thrive."

Another thing that drew me to the book is the author. Epstein's previous book was "Sports Gene". Epstein is very careful and diligent with his research. He always goes to the source and reads many journal papers as part of his research. The notes at the end of the book take the last 30% of the book. Epstein doesn't refrain from questioning the validity of popular understanding and beliefs. In this book, he does that to the popularization of the grit research results and the 10,000 hour rule results.

The book was a very good read, but it is long, at 351 pages. It is 12 chapters, in addition to introduction and conclusion. I thought I learned a lot from the book, and I was only at Chapter 5. The book could have been sweeter had it been shorter. I think it would be better to remove 3-4 chapters toward the end. To keep my post short and manageable, I left out majority of my highlights. Out of the 14 chapters I only provide highlights from 5 chapters.

The book clearly communicates the dangers of overstating/overemphasizing hyperspecialization and the benefits of being an effective generalist, but there aren't actionable lessons on how to go about that. I highly recommend you read this book, and make your own mind about it. At the end of this post, I discuss what I learned from the book and my take and opinions on this subject.

Highlights from the book

Introduction: Roger vs. Tiger

Tiger [Woods] has come to symbolize the idea that the quantity of deliberate practice determines success—and its corollary, that the practice must start as early as possible.

The push to focus early and narrowly extends well beyond sports. We are often taught that the more competitive and complicated the world gets, the more specialized we all must become (and the earlier we must start) to navigate it.

Moving high-ranking government officials between departments, he wrote, “is no less absurd than rotating Tiger Woods from golf to baseball to football to hockey.” Except that Great Britain’s massive success at recent Summer Olympics, after decades of middling performances, was bolstered by programs set up specifically to recruit adults to try new sports and to create a pipeline for late developers--“slow bakers,” as one of the officials behind the program described them to me. Apparently the idea of an athlete, even one who wants to become elite, following a Roger [Federer] path and trying different sports is not so absurd.
Eventual elites typically devote less time early on to deliberate practice in the activity in which they will eventually become experts. Instead, they undergo what researchers call a “sampling period.” They play a variety of sports, usually in an unstructured or lightly structured environment; they gain a range of physical proficiencies from which they can draw; they learn about their own abilities and proclivities; and only later do they focus in and ramp up technical practice in one area.
One study showed that early career specializers jumped out to an earnings lead after college, but that later specializers made up for the head start by finding work that better fit their skills and personalities.

I dove into work showing that highly credentialed experts can become so narrow-minded that they actually get worse with experience, even while becoming more confident --a dangerous combination. And I was stunned when cognitive psychologists I spoke with led me to an enormous and too often ignored body of work demonstrating that learning itself is best done slowly to accumulate lasting knowledge, even when that means performing poorly on tests of immediate progress. That is, the most effective learning looks inefficient; it looks like falling behind.

Starting something new in middle age might look that way too. Mark Zuckerberg famously noted that “young people are just smarter.” And yet a tech founder who is fifty years old is nearly twice as likely to start a blockbuster company as one who is thirty, and the thirty-year-old has a better shot than a twenty-year-old. Among the fastest-growing start-ups, the average age of a founder was forty-five when the company was launched.
One revelation in the aftermath of the 2008 global financial crisis was the degree of segregation within big banks. Legions of specialized groups optimizing risk for their own tiny pieces of the big picture created a catastrophic whole. “No one imagined silos like that inside banks,” a government adviser said later. Overspecialization can lead to collective tragedy even when every individual separately takes the most reasonable course of action.
Highly specialized health care professionals have developed their own versions of the “if all you have is a hammer, everything looks like a nail” problem. Interventional cardiologists have gotten so used to treating chest pain with stents—metal tubes that pry open blood vessels—that they do so reflexively even in cases where voluminous research has proven that they are inappropriate or dangerous. A recent study found that cardiac patients were actually less likely to die if they were admitted during a national cardiology meeting, when thousands of cardiologists were away; the researchers suggested it could be because common treatments of dubious effect were less likely to be performed.

Increasing specialization has created a “system of parallel trenches” in the quest for innovation. Everyone is digging deeper into their own trench and rarely standing up to look in the next trench over, even though the solution to their problem happens to reside there.
Scientists and members of the general public are about equally likely to have artistic hobbies, but scientists inducted into the highest national academies are much more likely to have avocations outside of their vocation. And those who have won the Nobel Prize are more likely still. Compared to other scientists, Nobel laureates are at least twenty-two times more likely to partake as an amateur actor, dancer, magician, or other type of performer.
... electrical engineer Claude Shannon ... launched the Information Age thanks to a philosophy course he took to fulfill a requirement at the University of Michigan. In it, he was exposed to the work of self-taught nineteenth-century English logician George Boole, who assigned a value of 1 to true statements and 0 to false statements and showed that logic problems could be solved like math equations. It resulted in absolutely nothing of practical importance until seventy years after Boole passed away, when Shannon did a summer internship at AT&T’s Bell Labs research facility.

Learning, fast and slow

One of those desirable difficulties is known as the “generation effect.” Struggling to generate an answer on your own, even a wrong one, enhances subsequent learning. Socrates was apparently on to something when he forced pupils to generate answers rather than bestowing them. It requires the learner to intentionally sacrifice current performance for future benefit.

Metcalfe and colleagues have repeatedly demonstrated a “hypercorrection effect.” The more confident a learner is of their wrong answer, the better the information sticks when they subsequently learn the right answer. Tolerating big mistakes can create the best learning opportunities.
Struggling to retrieve information primes the brain for subsequent learning, even when the retrieval itself is unsuccessful. The struggle is real, and really useful.

... there was a group of Calculus I professors whose instruction most strongly boosted student performance on the Calculus I exam, and who got sterling student evaluation ratings. Another group of professors consistently added less to student performance on the exam, and students judged them more harshly in evaluations. But when the economists looked at another, longer-term measure of teacher value added—how those students did on subsequent math and engineering courses that required Calculus I as a prerequisite—the results were stunning. The Calculus I teachers who were the best at promoting student overachievement in their own class were somehow not great for their students in the long run. “Professors who excel at promoting contemporaneous student achievement,” the economists wrote, “on average, harm the subsequent performance of their students in more advanced classes.” What looked like a head start evaporated.
The economists suggested that the professors who caused short-term struggle but long-term gains were facilitating “deep learning” by making connections. They “broaden the curriculum and produce students with a deeper understanding of the material.” It also made their courses more difficult and frustrating, as evidenced by both the students’ lower Calculus I exam scores and their harsher evaluations of their instructors.
Desirable difficulties like testing and spacing make knowledge stick. It becomes durable. Desirable difficulties like making connections and interleaving make knowledge flexible, useful for problems that never appeared in training. All slow down learning and make performance suffer, in the short term.

The trouble with too much grit

Malamud’s conclusion: “The benefits to increased match quality outweigh the greater loss in skills.” Learning stuff was less important than learning about oneself. Exploration is not just a whimsical luxury of education; it is a central benefit.
It should come as no surprise that more students in Scotland ultimately majored in subjects that did not exist in their high schools, like engineering. In England and Wales, students were expected to pick a path with knowledge only of the limited menu they had been exposed to early in high school. That is sort of like being forced to choose at sixteen whether you want to marry your high school sweetheart. At the time it might seem like a great idea, but the more you experience, the less great that idea looks in hindsight. In England and Wales, adults were more likely to get divorced from the careers they had invested in because they settled down too early. If we treated careers more like dating, nobody would settle down so quickly.
For professionals who did switch, whether they specialized early or late, switching was a good idea. “You lose a good fraction of your skills, so there’s a hit,” Malamud said, “but you do actually have higher growth rates after switching.” Regardless of when specialization occurred, switchers capitalized on experience to identify better matches.

In 2004, at the beginning of Beast, Duckworth gave 1,218 plebes in the incoming class the grit survey. They were asked to pick from five ratings how much each of twelve statements applied to them. Some of the statements were plainly about work ethic (“I am a hard worker”; “I am diligent”). Others probed persistence or singular focus (“I often set a goal but later choose to pursue a different one”; “My interests change from year to year”). Where the Whole Candidate Score failed to predict Beast dropouts, the Grit Scale was better. Duckworth extended the study to other domains, like the finals of the Scripps National Spelling Bee. She found that both verbal IQ tests and grit predicted how far a speller would get in the competition, but that they did so separately. It was best to have a ton of both, but spellers with little grit could make up for it with high verbal IQ scores, and spellers with lower verbal IQ scores could compensate with grit.

“I worry I’ve contributed, inadvertently, to an idea I vigorously oppose: high-stakes character assessment,” she wrote. That is not the only way in which grit research has been extended or exaggerated beyond its evidence.
The fact that cadets are selected based on their Whole Candidate Score leads to what statisticians call a “restriction of range.” That is, because cadets were selected precisely for their Whole Candidate Score, a group of people who are very alike on Whole Candidate Score measures were siphoned from the rest of humanity. When that happens, other variables that were not part of the selection process can suddenly look much more important in comparison. To use a sports analogy, it would be like conducting a study of success in basketball that included only NBA players as subjects; the study might show that height is not an important predictor of success, but determination is. Of course, the NBA had already selected tall men from the wider population, so the range of height in the study was restricted. Thus height appears not to matter as much as it truly does. Similarly, the relative predictiveness of grit and other traits in West Point cadets and spelling bee competitors may not look quite the same in less restricted populations. If a truly random sample of high school graduates was assessed for Whole Candidate Scores, not just those who were accepted to West Point, physical fitness, grades, and leadership experiences may well predict their Beast persistence, and perhaps more so than grit. Duckworth and her coauthors, to their credit, point out that by studying highly preselected groups, “we have necessarily limited the external validity of our investigation.”
The vast majority of plebes complete Beast, no matter their grit scores. In the first year Duckworth studied them, 71 out of 1,218 dropped out. In 2016, 32 of 1,308 plebes dropped out. The deeper question is whether dropping out might actually be a good decision.
Godin argued that “winners”—he generally meant individuals who reach the apex of their domain—quit fast and often when they detect that a plan is not the best fit, and do not feel bad about it. “We fail,” he wrote, when we stick with “tasks we don’t have the guts to quit.” Godin clearly did not advocate quitting simply because a pursuit is difficult. Persevering through difficulty is a competitive advantage for any traveler of a long road, but he suggested that knowing when to quit is such a big strategic advantage that every single person, before undertaking an endeavor, should enumerate conditions under which they should quit. The important trick, he said, is staying attuned to whether switching is simply a failure of perseverance, or astute recognition that better matches are available.

In return for a five-year active-duty service commitment, every West Point cadet gets a taxpayer-funded scholarship valued at around a half million dollars. That’s why it is particularly vexing to the Army that since the mid-1990s, about half of West Point graduates leave active military service after five years, which is as soon as they are allowed. It takes about five years just to offset the development costs for a trained officer. Three-quarters are gone before the twenty-year mark, which would bring them to their early forties having earned a lifetime pension.
The more likely the Army is to identify someone as a successful future officer and spend money on them, the more likely they are to leave as soon as possible. The Army’s goal is developing career senior officers, not simply Beast survivors. From the military’s perspective, this is all a major backfire.

Obviously, neither the academy nor ROTC are teaching cadets to leave. Did cadets suddenly lose the grit that had gotten them through Beast? It’s not that either. The authors of the monograph—a major, a retired lieutenant colonel, and a colonel, all current or former West Point professors—pinpointed the problem as a match quality conundrum. The more skilled the Army thought a prospective officer could become, the more likely it was to offer a scholarship. And as those hardworking and talented scholarship recipients blossomed into young professionals, they tended to realize that they had a lot of career options outside the military. Eventually, they decided to go try something else. In other words, they learned things about themselves in their twenties and responded by making match quality decisions.

The Army began offering retention bonuses—just cash payments to junior officers if they agreed to serve a few more years. It cost taxpayers $500 million, and was a massive waste. Officers who had planned to stay anyway took it, and those who already planned to leave did not. The Army learned a hard lesson: the problem was not a financial one; it was a matching one.
The Officer Career Satisfaction Program was designed so that scholarship-ROTC and West Point graduates can take more control of their own career progression. In return for three additional years of active service, the program increased the number of officers who can choose a branch (infantry, intelligence, engineering, dental, finance, veterinary, communication technology, and many more), or a geographic post. Where dangling money for junior officers failed miserably, facilitating match quality succeeded. In the first four years of the program, four thousand cadets agreed to extend their service commitments in exchange for choice.
A recent international Gallup survey of more than two hundred thousand workers in 150 countries reported that 85 percent were either “not engaged” with their work or “actively disengaged.” In that condition, according to Seth Godin, quitting takes a lot more guts than continuing to be carried along like debris on an ocean wave. The trouble, Godin noted, is that humans are bedeviled by the “sunk cost fallacy.”
Van Gogh was an example of match quality optimization, Robert Miller’s multi-armed bandit process come to life. He tested options with maniacal intensity and got the maximum information signal about his fit as quickly as possible, and then moved to something else and repeated, until he had zigzagged his way to a place no one else had ever been, and where he alone excelled. Van Gogh’s Grit Scale score, according to Naifeh’s assessment, was flush with hard work but low on sticking with every goal or project. He landed in the 40th percentile.

No one in their right mind would argue that passion and perseverance are unimportant, or that a bad day is a cue to quit. But the idea that a change of interest, or a recalibration of focus, is an imperfection and competitive disadvantage leads to a simple, one-size-fits-all Tiger story: pick and stick, as soon as possible. Responding to lived experience with a change of direction, like Van Gogh did habitually, like West Point graduates have been doing since the dawn of the knowledge economy, is less tidy but no less important. It involves a particular behavior that improves your chances of finding the best match, but that at first blush sounds like a terrible life strategy: short-term planning.

Lateral thinking with withered technology

They titled their study Superman or the Fantastic Four? “When seeking innovation in knowledge-based industries,” they wrote, “it is best to find one ‘super’ individual. If no individual with the necessary combination of diverse knowledge is available, one should form a ‘fantastic’ team.” Diverse experience was impactful when created by platoon in teams, and even more impactful when contained within an individual.
Novelist, screenwriter, and comics author Neil Gaiman has a similarly expansive range, from journalism and essays on art to a fiction oeuvre encompassing both stories that can be read to (or by) the youngest readers as well as psychologically complex examinations of identity that have enthralled mainstream adult audiences. Jordan Peele is not a comics creator, but the writer and first-time director of the extraordinarily unique surprise hit Get Out struck a similar note when he credited comedy writing for his skill at timing information reveals in a horror film. “In product development,” Taylor and Greve concluded, “specialization can be costly.”
Charles Darwin “could be considered a professional outsider,” according to creativity researcher Dean Keith Simonton. Darwin was not a university faculty member nor a professional scientist at any institution, but he was networked into the scientific community. Darwin only personally carried out experiments “opportune for experimental attack by a scientific generalist such as he was.” For everything else, he relied on correspondents, Jayshree Seth style. Darwin always juggled multiple projects, what Gruber called his “network of enterprise.” He had at least 231 scientific pen pals who can be grouped roughly into thirteen broad themes based on his interests, from worms to human sexual selection. He peppered them with questions. He cut up their letters to paste pieces of information in his own notebooks, in which “ideas tumble over each other in a seemingly chaotic fashion.” When his chaotic notebooks became too unwieldy, he tore pages out and filed them by themes of inquiry. Just for his own experiments with seeds, he corresponded with geologists, botanists, ornithologists, and conchologists in France, South Africa, the United States, the Azores, Jamaica, and Norway, not to mention a number of amateur naturalists and some gardeners he happened to know. As Gruber wrote, the activities of a creator “may appear, from the outside, as a bewildering miscellany,” but he or she can “map” each activity onto one of the ongoing enterprises. “In some respects,” Gruber concluded, “Charles Darwin’s greatest works represent interpretative compilations of facts first gathered by others.” He was a lateral-thinking integrator.

Conclusion: Expanding your range

When I began to write and speak about data indicating that athletes who go on to become elite are usually not early specializers, the reactions (particularly from parents) reliably fell into two categories: (1) Simple disbelief, can’t be true; and (2) “So, in one sentence, what is the advice?” What one sentence of advice can encapsulate the embrace of breadth and the journey of experimentation that is necessary if you want, like Van Gogh or Andre Geim or Frances Hesselbein, to arrive at a place optimized for you alone? Like the paths of those individuals, my exploration of breadth and specialization was inefficient, and what began as a search for one sentence of advice ended in this book.
That’s how it goes on the disorderly path of experimentation. Original creators tend to strike out a lot, but they also hit mega grand slams, and a baseball analogy doesn’t really do it justice. As business writer Michael Simmons put it, “Baseball has a truncated outcome distribution. When you swing, no matter how well you connect with the ball, the most runs you can get is four.” In the wider world, “every once in a while, when you step up to the plate, you can score 1,000 runs.” It doesn’t mean breakthrough creation is luck, although that helps, but rather that it is hard and inconsistent. Going where no one has is a wicked problem. There is no well-defined formula or perfect system of feedback to follow. It’s like the stock market that way; if you want the sky highs, you have to tolerate a lot of lows. As InnoCentive founder Alph Bingham told me, “breakthrough and fallacy look a lot alike initially.”
Finally, remember that there is nothing inherently wrong with specialization. We all specialize to one degree or another, at some point or other. My initial spark of interest in this topic came from reading viral articles and watching conference keynotes that offered early hyperspecialization as some sort of life hack, a prescription that will save you the wasted time of diverse experience and experimentation. I hope I have added ideas to that discussion, because research in myriad areas suggests that mental meandering and personal experimentation are sources of power, and head starts are overrated. As Supreme Court justice Oliver Wendell Holmes wrote a century ago, of the free exchange of ideas, “It is an experiment, as all life is an experiment.”

MAD questions

1. What is my take on this?

Without doubt, this was a very interesting read. There were many interesting anecdotes. But I can't just congratulate myself by having read a bunch of interesting things. What is the lesson I learned here? The book was too long and unfortunately there wasn't a resounding actionable lesson.

What the book communicates can be summed as "hyperspecialization can be dangerous and ineffective, and being an effective generalists is beneficial in many cases." But how can we become an effective generalist? Should we shun going deep and mastering a topic in favor of adding more breadth to our skillset? What is the balance there?

I think depth and mastery is important in any case. Without mastery in anything, it is hard to amount to something. You need to get acquainted with the process of mastery in some domain. And if you have mastery in one domain, it may be possible to apply and develop it in the context of another domain. So, maybe, the way to go about this is to master some tools in one domain, but then transfer it and apply to other domains as well. Don't just be a one-trick pony and don't be hyperspecialized. After you build mastery in some domain and tools, use your breadth/range to define the new problem/niche for you and dig your well there. When you go deep in new terrain, you can explore a new area, and differentiate and excel there.

I was looking forward to reading this book I think because I hoped it would validate and reaffirm my choices. I had hopped from topic to topic in my career: theory of distributed systems, self-stabilization, wireless sensor networks, crowdsource sensing and collaboration, cloud computing, distributed consensus and coordination. But of course, waiting for the book to reaffirm me was in vain. There is no overarching principle that covers every body and every case. It is important to keep questioning yourself and be deliberate about your choices. Life is about introspection and experimentation.

2. How much sharper can we make the book's thesis?

Heinlein famously said:
A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyse a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.
— Robert Heinlein, Time Enough for Love

This is a sharp hypothesis. Especially the "specialization is for insects" part. I wonder what would this book look like if it started with this more speculative thesis and try to prove this. Of course Epstein is a careful researcher, so he didn't want to be speculative. But I wonder if it is possible to make a case for this stronger thesis.

3. Should one do analogical thinking or not?

The book dedicates a whole chapter about how Kepler used analogical thinking, and promotes the importance of analogical thinking when working on a new domain.

I am a highly analogical/intuitive thinker. But I have always been very insecure about it. Dijkstra hated analogical thinking and wanted to shun it. He made a case against it in "The cruelty of teaching computer science." But, on the other hand, Feynman was a highly intuitive thinker and employed analogical thinking a lot. I think the jury is out on this one. I don't know if there has been rigorous research on the effectiveness and dangers of analogical thinking.

In any case, it is hard to fight your natural tendencies and peculiarities. So maybe instead of being insecure about this and fighting it, I should embrace my analogical thinking as a strategic advantage.

4. Omission errors are the hardest to catch. Did this book omit important stuff pertaining to this topic?

If you read something you don't agree with, you flag it immediately. But if it is not there and it is omitted, you may not catch it. Did this book omit any important stuff pertaining to the subject of being generalists versus specialists and achieving success one way or another?

I think community aspect is omitted. Without clusters, flowers don't spring. I think it is essential to be involved with a vibrant community to thrive and excel on a topic. There are many examples of this, the climbing community discussed in the "Valley Uprising" documentary, the hacker community discussed in Steven Levi's book.

Maybe, being involved with a vibrant community can give you both depth and width.
If you look back closely at history, many of the people who we think of as lone geniuses were actually part of “a whole scene of people who were supporting each other, looking at each other’s work, copying from each other, stealing ideas, and contributing ideas.” *Scenius* doesn’t take away from the achievements of those great individuals; it just acknowledges that good work isn’t created in a vacuum, and that creativity is always, in some sense, a collaboration, the result of a mind connected to other minds.

5. Why do I keep reading productivity/management books?

Recently I had read and discussed about "Loonshots", which had some parallel to this book. And before that I had discussed Creativity Inc., which is also very related.

I don't know why I keep reading these management(?) books. This is concerning... Help... I should be reading science fiction. Please suggest me good science fiction books. I think I should be checking out Neal Stephen's new book "Fall".

Monday, June 10, 2019

Is this consensus?

The specification for consensus is as follows. The first two are safety properties, the last one a liveness property.
  • Agreement: No two node can commit different decisions.
  • Validity (Non-triviality): If all initial values are same, nodes must commit that value.
  • Termination: Nodes commit eventually.
Below I talk about whether ABD or chain replication solve consensus, and whether it would be possible to implement state machine replication using them.

Does ABD solve consensus?

No, ABD does not solve consensus. I had written a summary of the ABD protocol in a 2012 post. And I had talked about why ABD is not consensus in a 2015 post. Below is a short recap of that followed by a discussion of whether ABD can still be employed to solve the state machine replication problem.

Consensus is prone to the FLP impossibility result, and it may lose progress under FLP conditions. In particular, for Paxos, if we can't determine whether the incumbent leader has failed in an asynchronous environment, then we may run into the dueling leader problem, which may continue until failure detection and consequently leader election stabilizes.

In contrast, ABD is not affected by the FLP result. This is because ABD is memoryless and hedonistic. ABD is happy with unresolved, partial acceptances in the past. Heck, it will completely overwrite a value that is accepted by all nodes if another write comes with a higher timestamp.

In comparison, Paxos is obsessed with the past. For each consensus instance, Paxos clings onto the value a node has seen (with the highest ballot). In the first and second phase of write in ABD there is no rejection/restart of the operation. In Paxos, both in the first phase and second phase, a leader/candidate may receive a rejection and goes back to retry phase 1.

Since ABD is memoryless and hedonistic, we can not implement replicated state machines (RSMs) with ABDs. But let's push on this a bit. So far, we were treating timestamps as ballots within the same instance. What if we treat the timestamps in ABD as slots in multiple instances of accepting values? By using "timestamp = counter + leaderID", we can implement total order on slots and have linearizability and strong consistent reads with multiple clients.

Can we then implement RSM over ABD participant nodes using these slot numbers? No! Because ABD skips past some slot numbers as unresolved since it is always eager to move ahead in a hedonistic manner. ABD goes with the highest timestamp of majority read, and it is not possible to go back to earlier slots and resolve them in case of ambiguity. In ABD, the nodes accept independently, and there is no commit/resolution phase, and the logs in the nodes diverge. We can't make a resolution about a slot's outcome even in God-view mode, where we can look at the values in a majority of the nodes (not all nodes, because up to a minority of nodes may be unavailable per fault-model). Let's say we see a value at node A, and no value at the other nodes constituting majority for this slot. The ambiguity that remains is as follows. Maybe node A was part of the majority that had this value and the other nodes are not reachable now. Or maybe A was the only node that had received this value. We cannot determine the difference. If we go with the first possibility, it is possible that, another God-view to a different majority may find that indeed the opposite was the case.

In contrast,  Paxos has a commit phase that marks that the value for that slot is resolved and finalized. Any commit (even read from one node's committed value) is a valid commit because the leader has observed that majority has accepted/stored it. So it is guaranteed that other nodes will also know (or learn) that commit. So Paxos relearns and does not leave any gap in the slot numbers while committing, because those slots numbers get executed as they are committed.

As the closing word on ABD, we should note that ABD is still useful for storage and linearizability, it solves the atomic storage problem. Here comes the difference between stateless operations (register operations put and get) versus stateful operations (commands in general that mutate state, which by definition depends on the state they are invoked/executed). For storage, we don't need stateful operations. Using ABD we achieve linearizability, and can serve strong-consistency reads via using ABD even with multiple clients.

Does chain replication solve consensus?

I had written up a description of the chain replication protocol in a 2011 post. Chain replication employs a Paxos-backed configuration box that maintains the configuration/topology of the chain nodes, and the chain nodes just replicate the values in a streamlined fashion. The beauty here is that Paxos is kept out of the data path, so it is not involved with any replication request. Paxos is employed in the control path, and is consulted only when a fault happens.

Does chain replication solve consensus? I haven't seen this question addressed in previous work neither in the original chain replication work, nor in any of the followup work. The answer is, no, chain replication does not solve the consensus problem! This is a trickier point to appreciate than the ABD case.

Chain replication does not violate agreement/safety property: for a given instance, no two nodes will have different commits because they copy the head of the chain. But chain replication will violate progress for the consensus instance in that slot if the chain topology changes. Let's say only the head and another node committed a value and they died or get disconnected, and as a result the chain topology is reconfigured by the config-box. No other node can commit another value for that instance because the epoch-number has changed with the configuration decision from the config-box. This is both good news and bad news. Safety is not violated but we lost progress/termination for that slot: the remaining nodes are not able to resurrect and resolve this particular consensus instance to termination. So although chain replication solves consensus in the absence of failures, in the presence of failures it deserts the consensus instance without culminating it to resolution and moves on. After the config-box appoints a new chain topology, the progress and safety are both satisfied for the next consensus instance (with the incremented epoch number).

To recap, chain replication gets things resolved/finalized and keeps the same log in the absence of faults, but in the presence of faults, the logs in participants may diverge. Consider a node that accepts a value, and then due to failure and chain reconfiguration it has been pushed out of the chain. How does that node learn whether what it has accepted before it crushed is skipped over or finalized? There is no commit in chain replication (of course the ack-backpropagation in the CRAQ optimization may work as commit)... Even with the plain chain replication, we can argue that, that node is now an incorrect node as marked by the config-box, so we don't care about its consistency. And if that node joins the chain again, it will join as tail, learn the same log as the other nodes. From that perspective, and by seeding off of the Paxos-backed config-box, we can argue that RSM can be implemented over chain replication.


The way to understand these things is by sparring with colleagues. I am grateful for Ailidani Ailijiang and Aleksey Charapko for the discussion. It is not easy to reason about distributed systems --but is certainly rewarding after the fact. It took us two or three animated discussion sessions over coffee to get to the bottom of this.

Tuesday, June 4, 2019

Book review. Loonshots: How to nurture the crazy ideas that win wars, cure diseases, and transform industries

This book, by Safi Bahcall, is about how to nurture radical breakthroughs in science and technology.

The book draws inspiration from the innovations Vannevar Bush made possible Office of Scientific Research and Development (OSRD), created in 1941, and the innovations Theodore N. Vail enabled at Bell.

OSRD's portfolio of accomplishments is impressive indeed. The war against Nazis is won through superiority in the field of science. The bombers' microwave radar cut through darkness and fog to detect German U-Boats, and rendered them ineffective in a matter of weeks.

The book compiles insights from the organizational principles Bush and Vail employed as Bush-Vail rules. The main concept here is of a dynamic equilibrium, where the organization maintains well-separated and equally strong loonshot and franchise groups (phase separation) continuously exchanging projects and ideas in both directions.

Summary of the The Bush-Vail rules

1. Separate the phases
  • separate your artists and soldiers
  • tailor the tools to the phase
  • watch your blind side: nurture both types of loonshots
2. Create dynamic equilibrium
  • love your artists and soldiers equally
  • manage the transfer, not the technology: be a gardener, not a Moses
  • appoint and train project champions to bridge the divide

3. Spread a system mindset
  • keep asking why the organization made the choices it did
  • keep asking how the decision-making process can be improved
  • identify teams with outcome mindset and help them adopt system mindset

4. Raise the magic number (Dunbar's number 150)
  • reduce return-on-politics
  • use soft equity (nonfinancial rewards)
  • increase project-skill fit
  • fix the middle (reduce perverse incentives for middle managers)
  • bring a gun to knife fight (engage a chief incentives officer)
  • fine-tune the spans (wide for loonshots groups; narrow for franchise groups)

Other examples in the book include: Peniciline citrium by Akira Endo, cancer drugs by Judah-Falkman, and Pan-Am's story. All of these were very engaging stories and I didn't know any of these before. One of my favorite quotes in the book is: "It is not a good drug unless it's been killed by three times."

Toward the end, the book talks about the Joseph Needham question: "Why didn't the Scientific Revolution take place in China (or India or Ottoman Empire), despite all its advantages?"

The book attributes the emergence of the scientific method in Europe to the ripe loonshot conditions in Europe.

  1. phase separation: separate loonshot and franchise groups
  2. dynamic equilibrium: seamless exchange between the two groups
  3. critical mass: a loonshot group large enough to ignite

MAD questions

1. What are the new things I learned from this book?
My 20 years in academic circles instilled in me the impression that you cannot manage innovation and research, instead you can only hope to cultivate it. My informal Twitter poll returned the following result. (I don't know of the ratio of the academic versus industrial people that voted on this.)
The book doesn't take an explicit position on my question above, but as the title "Bush-Vail rules" suggests, it tries to formulate rules for nurturing the loonshot/innovation process. But what good are these rules? I am certain that they are not sufficient for producing a successful loonshot. I am not sure if they are even necessary. But I agree that they would help increase your chances of success. And I also agree that they are more concrete than just suggesting "form strong teams and get out of their way." The question is how much more concrete advice is this from that bottomline?

In comparison, the book I read last month, "Creativity, Inc.: Overcoming the Unseen Forces That Stand in the Way of True Inspiration" focused on a much narrower domain, that of the loonshots accomplished within Pixar, but delivered more concrete advice for managing the creative process.

Lest you think I didn't like this book, I did enjoy the Loonshots book a lot and recommend it to anyone interested in building organizations that nurture creative work.

2. On a micro scale, does this explain the draft and revise principle in writing?
Drafting is the artist side. Revising is the soldier side. You can't have good writing unless you love both sides equally, and unless both sides interact with each other in a dynamic equilibrium. At some point, a phase transition occurs and you get to the correct narrative for your writing.

As Hemingway said: "The first draft of anything is shit."

Tuesday, May 28, 2019

Paper summary. Cloud programming simplified: A Berkeley view on Serverless Computing

This position paper by UC Berkeley RISE lab is about serverless computing, its shortcomings, and its potential. It is easy reading, and is still useful even if you have a pretty good understanding about serverless computing due to some insights and forecasts in the paper. As you will read below, the paper provides a very strong endorsement for serverless computing.

Instead of explaining the paper in my terms, I quote some of my highlights from the paper below, and at the end, in the MAD questions section, I discuss some of my thoughts on serverless computing.


We believe the main reason for the success of low-level virtual machines was that in the early days of cloud computing users wanted to recreate the same computing environment in the cloud that they had on their local computers to simplify porting their workloads to the cloud.

To set up your own environment in cloud (using virtual machines), you need to address these 8 issues.

  1. Redundancy for availability, so that a single machine failure doesn't take down the service. 
  2. Geographic distribution of redundant copies to preserve the service in case of disaster.
  3. Load balancing and request routing to efficiently utilize resources.
  4. Autoscaling in response to changes in load to scale up or down the system.
  5. Monitoring to make sure the service is still running well.
  6. Logging to record messages needed for debugging or performance tuning. 
  7. System upgrades, including security patching.
  8. Migration to new instances as they become available.

Compared to what it takes to set up the servers with the proper environment to run the code, the code to accomplish application logic might be dozens of lines of JavaScript.

In our definition, for a service to be considered serverless, it must scale automatically with no need for explicit provisioning, and be billed based on usage. Cloud functions are the general purpose element in serverless computing today, and lead the way to a simplified and general purpose programming model for the cloud.

While we are unsure which solutions will win, we believe all issues will all be addressed eventually, thereby enabling serverless computing to become the face of cloud computing.

Emergence of Serverless Computing

Serverless programming provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. Automated memory management relieves programmers from managing memory resources, whereas serverless computing relieves programmers from managing server resources.

There are three critical distinctions between serverless and serverfull computing:

  1. Decoupled computation and storage. The storage and computation scale separately and are provisioned and priced independently. In general, the storage is provided by a separate cloud service and the computation is stateless.
  2. Executing code without managing resource allocation. Instead of requesting resources, the user provides a piece of code and the cloud automatically provisions resources to execute that code.
  3. Paying in proportion to resources used instead of for resources allocated. Billing is by some dimension associated with the execution, such as execution time, rather than by a dimension of the base cloud platform, such as size and number of VMs allocated.

We believe serverless computing represents significant innovation over platform as a service (PaaS) and other previous models. Among these factors, the autoscaling offered by AWS Lambda marked a striking departure from what came before. It tracked load with much greater fidelity than serverful autoscaling techniques, responding quickly to scale up when needed and scaling all the way down to zero resources, and zero cost, in the absence of demand. It charged in a much more fine-grained way, providing a minimum billing increment of 100 ms at a time when other autoscaling services charged by the hour.

Cloud functions, or functions as a service (FaaS), provide general compute and are complemented by an ecosystem of specialized Backend as a Service (BaaS) offerings such as object storage, databases, or messaging.

Unlike serverless computing, Kubernetes is a technology that simplifies management of serverful computing. Kubernetes can provide short-lived computing environments, like serverless computing, and has far fewer limitations, e.g., on hardware resources, execution time, and network communication. It can also deploy software originally developed for on-premise use completely on the public cloud with little modification. Serverless computing, on the other hand, introduces a paradigm shift that allows fully offloading operational responsibilities to the provider, and makes possible fine-grained multi-tenant multiplexing.

Recent surveys found that about 24% of serverless users were new to cloud computing and 30% of existing serverful cloud customers also used serverless computing.

\\ Murat's note: While 24% is an impressive number, what is the control here? Maybe traditional cloud computing is also getting new users at that rate?

\\ Murat's note: Chat bots are very popular use case of serverless, even more than IoT in total. They are sneaking under the radar, but are worth watching for their future ubiquitous applications. 

Limitations of today's serverless platforms

In this section, we present an overview of five research projects and discuss the obstacles that prevent existing serverless computing platforms from achieving state-of-the-art performance, i.e., matching the performance of serverful clouds for the same workloads.

Serverless SQLite: Databases. A strawman solution would be to run common transactional databases, such as PostgreSQL, Oracle, or MySQL inside cloud functions. However, that immediately runs into a number of challenges. First, serverless computing has no built-in persistent storage, so we need to leverage some remote persistent store, which introduces large latency.  Second, these databases assume connection-oriented protocols, e.g., databases are running as servers accepting connections from clients. This assumption conflicts with existing cloud functions that are running behind network address translators, and thus don't support incoming connections. Finally, while many high performance databases rely on shared memory, cloud functions run in isolation so cannot share memory. While shared-nothing distributed databases do not require shared memory, they expect nodes to remain online and be directly addressable.

Lack of fine-grained coordination. Applications are left with no choice but to either (1) manage a VM-based system that provides notifications, as in ElastiCache and SAND, or (2) implement their own notification mechanism, such as in ExCamera, that enables cloud functions to communicate with each other via a long-running VM-based rendezvous server. This limitation also suggests that new variants of serverless computing may be worth exploring, for example naming function instances and allowing direct addressability for access to their internal state (e.g., Actors as a Service).

Networking challenges. There may be several ways to address this challenge:

  1. Provide cloud functions with a larger number of cores, similar to VM instances, so multiple tasks can combine and share data among them before sending over the network or after receiving it.
  2. Allow the developer to explicitly place the cloud functions on the same VM instance. Offer distributed communication primitives that applications can use out-of-the-box so that cloud providers can allocate cloud functions to the same VM instance.
  3. Let applications provide a computation graph, enabling the cloud provider to co-locate the cloud functions to minimize communication overhead. 

Summary and predictions

By providing a simplified programming environment, serverless computing makes the cloud much easier to use, thereby attracting more people who can and will use it. [This is] a maturation akin to the move from assembly language to high-level languages more than four decades ago.

We predict that serverless use will skyrocket.

The first step is Serverless Ephemeral Storage, which must provide low latency and high IOPS at reasonable cost, but need not provide economical long term storage. A second class of applications would benefit from Serverless Durable Storage, which does demand long term storage. New non-volatile memory technologies may help with such storage systems. Other applications would benefit from a low latency signaling service and support for popular communication primitives.

Two challenges for the future of serverless computing are improved security and accommodating cost-performance advances that are likely to come from special purpose processors.

The future of serverful computing will be to facilitate BaaS. Applications that prove to be difficult to write on top of serverless computing, such as OLTP databases or communication primitives such as queues, will likely be offered as part of a richer set of services from all cloud providers.

MAD questions

1.  Is a very strong endorsement for serverless warranted?
The paper gives very strong endorsements for serverless:
We predict that serverless use will skyrocket.
While we are unsure which solutions will win, we believe all issues will all be addressed eventually, thereby enabling serverless computing to become the face of cloud computing.
Remember, when we read papers, we should fight vigorously with the claims, and play the devil's advocate. So let's challenge this claim. What could be the reasons this claim may not hold?

First of all, we need to quantify and limit the claim. What does skyrocket mean? What does it mean for serverless to become the face of cloud computing? And finally what does serverless mean? Is this claim true of today's cloud functions? If we don't have a stable definition of serverless, this claim is prone to the No True Scotsman fallacy. If serverless use does not skyrocket, it will be because we don't have "true" serverless yet.

Ok, assuming that the claim is quantified, what may be some reasons it could fail?

Serverless improves greatly on ease of use, and that alone may warrant a lot of use for serverless. But ease-of-use is not necessarily exclusive to serverless. BaaS managed services, like distributed databases, can get even easier to use. And some even support stored procedures, which helps meet some of the serverless needs.

When comparing with PaaS, the paper said that serverless differentiates itself due to its very quick autoscaling. But, this may not be such a strong differentiator for the customers. Most customers may not have very bursty  workloads that require quick and extreme scaling.

Another contender for the serverless lunch may be software as a service (SaaS), like instagram, icloud, etc. SaaS can be even simpler to use than serverless, and may be programmed with visual workflows using mouse clicks. SaaS may steal users from serverless would work if SaaS services play well with each other so customers can pipe output from one as input to others.

2. Could serverless ever work for stateful services?
It is easy to make FaaS serverless because it is stateless. But FaaS scalability is limited by the BaaS scalability it depends on. It is easy to scale storage, because it is also stateless. But, the story becomes murkier when it comes to scalability of stateful services. At the limits, this is likely to be impossible: You can't have extreme scalability and extreme state (requiring incessant coordination). But outside the extremes, with good engineering we can get quick scalability for stateful services.

3. "Berkeley view" papers
If you are into this stuff, here are two other Berkeley view papers.

A Berkeley view of systems challenges for AI

Above the Clouds: A Berkeley View of Cloud Computing

Also there was a recent CIDR paper by another group of UC Berkeley researchers on serverless computing titled: "Serverless Computing: One Step Forward, Two Steps Back", which I had covered before. This paper is worth reading for another perspective on serverless.

Friday, May 24, 2019

Paper summary. Scalable Consistency in Scatter

Here is the pdf for the paper. It is by Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy, and Thomas Anderson, Department of Computer Science & Engineering University of Washington.

This paper is about peer-to-peer (P2P) systems. But the paper is from 2011, way after the P2P hype had died. This makes the paper more interesting, because it had the opportunity to consider things in hindsight. The P2P corpse was cold, and Dynamo had looted the distributed hash tables (DHT) idea from P2P and applied it in the context of datacenter computing. In return, this work liberates the Paxos coordination idea from the datacenter world and employs it in the P2P world. It replaces each node (or virtual node) in a P2P overlay ring with a Paxos group that consists of a number of nodes.

Ok, what problem do Paxos groups solve in the P2P systems? In the presence of high churn, DHTs in P2P systems suffer from inconsistent routing state and inconsistent name space partitioning issues (see Figure 1). By leveraging the Paxos group abstraction as a stable base to build these coordination operations (split, merge, migrate, repartition), Scatter achieves linearizable consistency even under adverse circumstances.

Group coordination 

Scatter supports the following multi-group operations:

  • split: partition the state of an existing group into two groups
  • merge: create a new group from the union of the state of two neighboring groups
  • migrate: move members from one group to a different group
  • repartition: change the key-space partitioning between two adjacent groups

Each multi-group operation in Scatter is structured as a distributed transaction. The paper calls this design pattern as nested consensus, and says: "We believe that this general idea of structuring protocols as communication between replicated participants, rather than between individual nodes, can be applied more generally to the construction of scalable, consistent distributed systems."

Nested consensus uses a two-tiered approach. At the top tier, groups execute a two-phase commit protocol (2PC), while within each group Paxos is used for agreeing on the actions that the group takes. Provided that a majority of nodes in each group remain alive and connected, the 2PC protocol will be non-blocking and terminate. (This is the same argument Spanner uses as it employs 2PC over Paxos groups.) For individual links in the overlay to remain highly available, Scatter maintains an additional invariant: a group can always reach its adjacent groups. To maintain this connectivity, Scatter enforces that every adjacent group of a group A has up-to-date knowledge of the membership of A.

Multi-group operations are coordinated by whichever group decides to initiate the transaction as a result of some local policy. The group initiating a transaction is called the coordinator group and the other groups involved are called the participant groups. This is the overall structure of nested consensus:

  1. The coordinator group replicates the decision to initiate the transaction.
  2. The coordinator group broadcasts a transaction prepare message to the nodes of the participant groups.
  3. Upon receiving the prepare message, a participant group decides whether or not to commit the proposed transaction and replicates its vote.
  4. A participant group broadcasts a commit or abort message to the nodes of the coordinator group.
  5. When the votes of all participant groups is known, the coordinator group replicates whether or not the transaction was committed.
  6. The coordinator group broadcasts the outcome of the transaction to all participant groups.
  7. Participant groups replicate the transaction outcome.
  8. When a group learns that a transaction has been committed then it executes the steps of the proposed transaction, the particulars of which depend on the multi-group operation.

Figure 5 shows an example of this template for group-split operation. After each group has learned and replicated the outcome (committed) of the split operation at time t3, the following updates are executed by the respective group: (1) G1 updates its successor pointer to G2a, (2) G3 updates its predecessor pointer to G2b, and (3) G2 executes a replicated state machine reconfiguration to instantiate the two new groups which partition between them G2's original key-range and set of member nodes.

The storage service (discussed next) continues to process client requests during the execution of group transactions except for a brief period of unavailability for any reconfiguration required by a committed transaction. Also, groups continue to serve lookup requests during transactions provided that the lookups are serialized with respect to the transaction commit.

Storage service

To improve throughput for put and get operations on keys, Scatter divides the key range assigned to the Paxos group into sub-ranges and assigns these sub-ranges to nodes within the Paxos group. Each key is only assigned to one primary and is serialized by that primary. The group leader replicates information regarding the assignment of keys to primaries using Paxos, as it does with the state for multi-group operations. Once an operation is routed to the correct group for a given key, then any node in the group will forward the operation to the appropriate primary. The primaries can run Paxos on the keys assigned to themselves concurrently with each other because this does not result in a conflict: it is OK to have different keys updated at the same time, since linearizability is a per key property.

Scatter provides linearizable storage within a given key and does not attempt to linearize multi-key application transactions.  A read is served by a primary within the Paxos group which is responsible for that key. The primary uses leader lease with the rest of the nodes. It is possible to provide weaker consistency reads, as is default in ZooKeeper, by reading from one node in the group.

Figure 7 plots the probability of group failure for different group sizes for two node churn rates with node lifetimes drawn from heavy-tailed Pareto distributions observed in typical peer-to-peer systems. The plot indicates that a modest group size of 8-12 prevents group failure with high probability. The prototype implementation in the paper demonstrates that even with these very short node lifetimes, it is possible to build a scalable and consistent system with practical performance. This was surprising to me.


They evaluate Scatter in a variety of configurations, for both micro-benchmarks and for a Twitter-style application. Compared to OpenDHT, Scatter provides equivalent performance with much better availability, consistency (i.e. linearizability), and adaptability even in very challenging environments. For example, if average node lifetimes are as short as 180 seconds, therefore triggering very frequent reconfigurations to maintain data durability, Scatter is able to maintain overall consistency and data availability, serving its reads in an average of 1.3 seconds in a typical wide area setting.

This is good performance, but to put things in context of datacenter computing, the evaluation is done with "small data". When you have many gigabytes (if not terabytes) of data assigned to each node, just to copy that data at line speed may take more time than the churn rate of the the nodes in a P2P environment.

The paper also compares Scatter against statically partitioned ZooKeeper groups. Here, the key-space partitioning was derived based on historical workload characteristics, but the inability to adapt to dynamic hotspots in the access pattern limits the scalability of the ZooKeeper-based groups deployment. Further, the variability in the throughput also increases with the number of ZooKeeper instances used in the experiment.

In contrast, Scatter's throughput scales linearly with the number of nodes, with only a small amount of variability due to uneven group sizes and temporary load skews. This is because Scatter uses ring and group operations to adapt to change in access patterns. Based on the load balancing policy in Scatter, the groups repartition their keyspaces proportionally to their respective loads whenever a group's load is a factor of 1.6 or above that of its neighboring group. As this check is performed locally between adjacent groups, it does not require global load monitoring, but it might require multiple iterations of the load-balancing operation to disperse hotspots.

Hat tip for @DharmaShukla for recommending the paper to me. The paper has inspired some design decisions in Cosmos DB.

MAD questions

1. What could be some alternative designs to solve this problem?
Instead of arranging the Paxos groups in a ring, why not have a vertical-Paxos group overseeing the Paxos groups? The vPaxos box would be assigning key ranges to Paxos groups, coordinating the group operations (split, merge, load-balance) and maintaining the configuration information of the Paxos groups. This would allow adapting to changes in workload and reconfiguring in reaction to node availability in a much faster manner than that of the P2P ring, where load-balancing is done by adjacent groups dispersing load to each other in multiple iterations.

Another problem with Scatter is that it lacks WAN locality optimization. A client may need to go across the globe to contact a Paxos group responsible for keys that it interacts with the most. WPaxos can learn and adopt to these patterns. So, while we are at it, why not replace the vanilla Paxos in the Paxos group with WPaxos to achieve client access locality adaptation in an orthogonal way. Then the final set up becomes VPaxos over-seeing groups of WPaxos deployments.

2. Would it ever be possible to replace datacenters with P2P technologies?
The paper in the introduction seems fairly optimistic: "Our interest is in building a storage layer for a very large scale P2P system we are designing for hosting planetary scale social networking applications. Purchasing, installing, powering up, and maintaining a very large scale set of nodes across many geographically distributed data centers is an expensive proposition; it is only feasible on an ongoing basis for those applications that can generate revenue. In much the same way that Linux offers a free alternative to commercial operating systems for researchers and developers interested in tinkering, we ask: what is the Linux analogue with respect to cloud computing?"

I am not very optimistic...

3. Why don't we invest in better visualizations/figures for writing papers?
This paper had beautiful figures for explaining concepts. Check Figure 4 below, it shows two groups considering different operations concurrently, visualized with thought bubbles. These figures go a long way. It is a shame we don't invest any effort in standardizing and teaching good illustration techniques to support exposition. It is even discouraged to use colors because they look faded/blended when printed in black and white. For God's sake, it is 2019, and we should level up our illustration game.

What are some other examples of papers with beautiful figures illustrating concepts? Please let me know. They are a treat to read.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...