Tuesday, October 8, 2019

Recent book diet

The last couple of months I have been listening through books using the Libby app. I highly recommend the Libby app: It connects you to your public library, and let's you search, borrow, and download audiobooks from your library easily. I used to listen to a lot of podcasts, but after I downloaded Libby, I have been listening to books mostly. Here are some of those books I listened to.

The Science of Discworld III: Darwin's Watch (2005)

This book is by Terry Pratchett, Ian Stewart and Jack Cohen. It is set on the Discworld world, and it teaches solid science while being entertaining at the same time. The book not only talked about evolution but also a lot about quantum physics, string theory and time travel (yes, the science behind time travel).

The book gives a good account of Darwin's life as well as his contemporaries, like Wallace. I think Darwin's superpower was writing. This was the Victorian era where suddenly the writers started to get and control mindshare of large fraction of population. Darwin seized on this opportunity well, as he was a talented and prolific writer.

In contrast to the myth around him, Darwin was not an atheist. The wikipedia article on the topic tells that he had a Unitarian background, but "from around 1849 Darwin stopped attending church, but Emma (his wife) and the children continued to attend services. On Sundays Darwin sometimes went with them as far as the lych gate to the churchyard, and then he would go for a walk. During the service, Emma continued to face forward when the congregation turned to face the altar for the Creed, sticking to her Unitarian faith." Darwin still believed that "God was the ultimate lawgiver" even after writing the Origin of Species. He wrote "In my most extreme fluctuations I have never been an atheist in the sense of denying the existence of a God.— I think that generally (& more and more so as I grow older) but not always, that an agnostic would be the most correct description of my state of mind."

I recommend this book highly. I think the book could have been shorter and sweeter though. Also the book was not very well organized, but it managed to stay engaging most of the time.

Ready Player One (2011)

This book was written by Ernest Cline. I liked this book, but it had an amateurish feel to it. It turns out this was Cline's first novel.

The book is written in the first person narrative (which is also the case for the Hunger Games book, which I discuss below). Do amateur writers prefer first person narrative because it is harder for them to pull a third person narrative? Is the first person narrative supposed to be more engaging? Maybe the first person narrative acts like a self-hypnosis session for the reader to role-play along.

In the book, the female protagonist, Artemis, is caricaturized to please the male target audience. She came across very submissive in her dialogs with Parzival and this bugged me a lot.

The book plot is significantly different than that of its movie adoptation. The book is all about 80s cultural references, which the author seems to be very comfortable in. I think he set up a nice futuristic world which still enabled him to make the book about 80s nerd-dom.

Overall, this is an OK book. When talking about the haptic VR suit, and the nerd culture of obsessing about stuff, the book gives a realistic but not very desirable view of the future.

Hunger Games (2008)

This book was written by Suzanne Collins. This book also felt rushed and underdeveloped, but it was engaging. The parts where the book talked about hunger and sisterly love tugged hard on the heartstrings. Maybe a bit too strongly.

(SPOILERS!)
There were big gaping plot holes in the book. In the beginning of the games, Peeta was collaborating with Careers. Then he was a ruthless killer, he finished a girl with a knife. He was also portrayed as very good with throwing knives. Later in the game, he comes off as a wimpy pacifist again. He felt very sad when he inadvertently kills Fox-face when she swiped the poisonous wild-berries he collected. What was that all about? Did no one proof-read the book before publishing?

Katniss is comically ignorant of others' feelings towards her. She is smart for other things but very dumb when it comes to reading social cues and especially romantic interests of others. The only plausible alternative to bad writing (which I can't rule out) is that she has Aspergers or Autism. (Google produces many hits for this, but the book did not develop on this line.)

The setting of the book was promising, but I don't think this got developed/analyzed enough either. I realize there are two sequels, so maybe the other books picked up on these. So I recommend to avoid the book or listen/read it with low expectations to pass time. This is easy reading/listening. I listened to this book only because I had a long drive and needed to pass time.

"What Technology Wants" (2010) and "Inevitable" (2016) by Kevin Kelly

I had high expectations going into these books. But I was disappointed by both. And frankly both books are pretty much about the same topics, and they merged in my mind into one (very very long) book.

The first book talks about technium, the ecology/culture around technology, and explores the characteristics of technium and the path of technium. It tries to make a case that technium is an organic/emergent organism, and as a child outgrowing its parents, it will leave home one day. The book already started talking about inevitabilities for technium, leading the way to the second book. The discussion about DNA being a miracle molecule was interesting. The book also mentioned that there are inevitable optimal valleys for evolution to gradient-descent into and these lead to convergent evolution. This was then connected to theories about directed evolution, and whether technium was inevitably somehow coded in our genes. Thought provoking stuff maybe, but it stalls there at the metaphoric level, without any further development, supporting arguments, or critical evaluation.

The inevitable book talks about "12 technological forces that will shape our future". These are outlined as below as quoted from the wikipedia entry:
  1. Becoming: Moving from fixed products to always upgrading services and subscriptions
  2. Cognifying: Making everything much smarter using cheap powerful AI that we get from the cloud
  3. Flowing: Depending on unstoppable streams in real-time for everything
  4. Screening: Turning all surfaces into screens
  5. Accessing: Shifting society from one where we own assets, to one where instead we will have access to services at all times
  6. Sharing: Collaboration at mass-scale 
  7. Filtering: Harnessing intense personalization in order to anticipate our desires
  8. Remixing: Unbundling existing products into their most primitive parts and then recombining in all possible ways
  9. Interacting: Immersing ourselves inside our computers to maximize their engagement
  10. Tracking: Employing total surveillance for the benefit of citizens and consumers
  11. Questioning: Promoting good questions is far more valuable than good answers
  12. Beginning: Constructing a planetary system connecting all humans and machines into a global matrix
The problem with these books are they are very long, with a lot of unnecessary filler text, and they get very boring and dry as the narrative drags on. If I were reading them I could skim and skip ahead, but listening to them I didn't get this opportunity. Another problem with the books is that the topics/ideas concerned are explored in a general and abstract manner.

Kevin Kelly is a very interesting guy. I pay attention to his short writing. But these books were not good. They would be improved a lot by cutting two thirds of them. So my recommendation is to avoid these books and read summaries/highlights from them.

MAD questions

Can we write nonfiction books with story narratives?

You can easily spoil a fictional work/story, but you cannot do that with nonfiction. I was wondering whether we can write nonfiction work with such an engaging story that readers would get angry if you gave spoilers to it.

I think good science/technology journalists already do this. Instead of a generic instance, they focus on one individual and tell the story of a disease or invention in a more personalized way. I liked the "Undoing Project" and "Flash Boys" book by Michael Lewis a lot. I would have complained hard if someone gave me spoilers about those books while I was reading them. Also the Hackers book by Steven Levy was somewhat like that.

Long ago one of my colleagues told me that he is tired of writing papers in the straightforward boring format. He said he tried to write a research paper where he developed gradually to the punchline and gave that at the end and this made the reviewers very unhappy. Today I asked him about the fate of that paper again. He told me that he made "the paper very formal and hideously complicated, so it got published".

There is a benefit to the traditional predictable format, because it makes the readers' and reviewers' job easier. But it also spoils the fun of reading the paper. I sometimes lose interest after an introduction where everything is revealed. Sometimes the only reason I read the rest of the paper is to learn some techniques in detail, or see where the authors cut corners and cheat by sneaking in assumptions/limitations not mentioned in the introduction.

They say the obvious UI is the best, but is our paper writing format where we give the spoilers in the abstract and the introduction the obviously better way to do things?

Sunday, October 6, 2019

Paper review: Comprehensive and efficient runtime checking in system software through watchdogs

This paper by Chang Lou, Peng Huang, and Scott Smith appeared in HotOS 2019. The paper argues that system software needs intrinsic detectors that  monitor internally for subtle issues specific to a process. In particular, the paper advocates employing intrinsic watchdogs as detectors. Watchdogs (also known as grenade timers) have been widely used in embedded devices. Watchdogs use a decrementing timeout counter which resets the processor when it reaches zero. To prevent a reset, the software must keep restarting the watchdog counter after performing/clearing some sanity checks.


Table 1 summarizes the comparison of crash failure detectors, intrinsic watchdogs, and error handlers. Failure detectors are too general, they just make "up-or-down" decisions. They are only good for achieving liveness, as they are too unreliable for making safety decisions.

The disadvantage with error handlers, the paper argues, is  that liveness-related failures often do not have explicit error signals that can trigger a handler: there is no signal for e.g., write being blocked indefinitely or some thread deadlocking or infinitely looping.

The mimicking approach for writing watchdog timers



The paper prescribes the mimicking approach, where the checker selects important operations from the main program, and mimics them for detecting any errors. Since the mimic checker exercises similar code logic in a production environment, it has the potential to catch and locate bugs in the program as well as faults in the environment.

The challenge with this approach is to systematically select important/representative operations from the main program. To solve this problem, the paper proposes a method using static analysis to automatically generate mimic-type watchdogs, which it calls program logic reduction.

We instead propose to derive from P a reduced but representative version W, which nonetheless retains enough code to expose gray failures. Our hypothesis that such reduction is viable stems from two insights. First, most code in P need not be checked at runtime because its correctness is logically deterministic --such code is better suited for unit testing before production and thus should be excluded from W. Second, W’s goal is to catch errors rather than to recreate all the details of P’s business logic. Therefore, W does not need to mimic the full execution of P. For example, if P invoked write() many times in a loop, for checking purposes, W may only need to invoke write() once to detect a fault.

The authors have built a prototype, called AutoWatchdog, and applied it to
ZooKeeper, Cassandra and HDFS, generating tens of checkers for each. Figure 2 shows an example from ZooKeeper.

MAD questions


1. If we detect the problems by mimicking the program, why don't we do the detection as part of the program rather than using a separate watchdog?

The watchdog detectors have a countdown timer for detecting getting stuck at any point due to some operation. This, of course, could have been added to the program as well, but maybe that makes the program look complicated. I guess having this in the watchdog detectors provide modularity, as in aspect-oriented programming.

I guess another benefit of having a separate watchdog detector is that we have a flexibility to locate it more centrally, rather than with the execution in one process. The gray failures paper made a case of asymmetry of information, and that there is a need for more end-to-end detection. Having a separate watchdog detector, we can maybe put parts of it or copies of it in different processes for being able to detect/check faults from different perspectives.

2. How do we make watchdogs compose?
One challenge that will surface when AutoWatchdog creates multiple watchdog detectors for a program is interference among the watchdogs. A reset triggered by a watchdog detector may lead to another reset triggered on another detector. And this may even get continued as the two watchdog detectors trigger resets for each other. Of course, this is more about correction that detection, so this is outside the scope of the paper.

However, even for just detection, care must be taken that the mimicking in the watchdog detectors do not have side effects and mess up the correctness of the program. The paper cautions about this problem: "Executing watchdog checkers should not incur unintended side-effects or add significant cost to the normal execution. For example, in monitoring the indexer of kvs, the checkers may try to retrieve or insert some keys, which should not overwrite data produced from the normal execution or significantly delay normal request handling." I am not sure how it is possible to avoid this problem completely with automatically generated watchdogs. If sandboxes are used, the watchdogs would not be testing/monitoring the real production environment.

Friday, October 4, 2019

Book review: Three body problem

This book, by Cixin Liu, easily secures a place among the top 20 sci-fi books I have read. The book was 400 pages, but was very engaging. I felt sad when I found that I had finished book. I was happy again when I learned the book has two sequels. I am looking forward to reading the sequels.

The book started in a very engaging way with the discussion of cultural revolution. This was engaging for me because 1) I didn't know much about this period in Chinese history, and 2) I am an academic and I was aghast by what has been done to the academics and intellectuals at that period. I couldn't fathom such an ignorance was possible. But given the path Turkey has been going down during the past 10+ years, and given the descent of US politics to such a path, I  followed these descriptions in horror. I couldn't imagine things could get that bad and stay that bad that long.

The importance of basic science was very well emphasized in this book. The book had a meta message. It starts with cultural revolution where science was shunned/repressed, and after the events escalate in present day earth, the world is also in a place where science is repressed/undervalued. I did enjoy those parallels.

The book included cutbacks and forth to a VR game the protagonist played, called "Three body" game. This reminded me of the back-and-forth cuts in "Ender's Game" with the game Ender was playing on his device, and in Diamond Age with the stories the girl's book told, and the prelude sections in Godel-Escher-Bach book. The cut-back techniques used here was excellent. The fictional characters in the game referred to prominent scientist, and the game showed some parallels to the progression of science on Earth as well.

Cultural revolution 

Since I didn't know much about China's history, I first thought the cultural revolution descriptions in the book were fictional. It has got to be because it is so surreal how people bought into this stuff and how the entire country chose illiteracy over literacy and ceased to function for 10 years. Secondly, I thought, how come Cixin Liu, a Chinese author, write freely about this period and criticize it sharply. It turns out it is OK to criticize cultural revolution and Red Guards provided that you do not blame Chairman Mao for it. Even though he started the cultural revolution to purge intellectuals (who must be corrupted by Western influence), apparently he was innocent and unaware of the extent of the purge. (Four ministers in the party got blamed for the events afterwards.) This is of course absolutely ridiculous. What is sad is that crimes and madness of this magnitude goes unpunished and everyone was (or had to be) OK with it. But such is life. Unfortunately, as I get older, I shed my unfounded faith about the good in humanity----which is also a theme explored in this book.

But for this mass struggle session, the victims were the reactionary bourgeois academic authorities. These were the enemies of every faction, and they had no choice but to endure cruel attacks from every side. Compared to other “Monsters and Demons,” reactionary academic authorities were special: During the earliest struggle sessions, they had been both arrogant and stubborn. That was also the stage in which they had died in the largest numbers. Over a period of forty days, in Beijing alone, more than seventeen hundred victims of struggle sessions were beaten to death. Many others picked an easier path to avoid the madness: Lao She, Wu Han, Jian Bozan, Fu Lei, Zhao Jiuzhang, Yi Qun, Wen Jie, Hai Mo, and other once-respected intellectuals had all chosen to end their lives.
“Relativity is part of the fundamental theories of physics,” Ye answered. “How can a basic survey course not teach it?” “You lie!” a female Red Guard by his side shouted. “Einstein is a reactionary academic authority. He would serve any master who dangled money in front of him. He even went to the American Imperialists and helped them build the atom bomb! To develop a revolutionary science, we must overthrow the black banner of capitalism represented by the theory of relativity!” 
- “Should philosophy guide experiments, or should experiments guide philosophy?” Ye’s sudden counterattack shocked those leading the struggle session.   
- “Of course it should be the correct philosophy of Marxism that guides scientific experiments!” one of the male Red Guards finally said. “Then that’s equivalent to saying that the correct philosophy falls out of the sky. This is against the idea that the truth emerges from experience. It’s counter to the principles of how Marxism seeks to understand nature.”                

The revolution eats its children first. The Red Guards didn't get punished, but they didn't prosper either.
“Of the four of us, three had signed the big-character poster at the high school attached to Tsinghua. Revolutionary tours, the great rallies in Tiananmen, the Red Guard Civil Wars, First Red Headquarters, Second Red Headquarters, Third Red Headquarters, Joint Action Committee, Western Pickets, Eastern Pickets, New Peking University Commune, Red Flag Combat Team, The East is Red—we went through every single milestone in the history of the Red Guards from birth to death.” 
“Then, we were sent to the wilderness!” The thickset woman raised her arms. “Two of us were sent to Shaanxi, the other two to Henan, all to the most remote and poorest corners. When we first went, we were still idealistic, but that didn’t last. After a day of laboring in the fields, we were so tired that we couldn’t even wash our clothes. We lay in leaky straw huts and listened to wolves cry in the night, and gradually we woke from our dreams. We were stuck in those forgotten villages and no one cared about us at all.” The one-armed woman stared at the ground numbly. “While we were down in the countryside, sometimes, on a trail across the barren hill, I’d bump into another Red Guard comrade or an enemy. We’d look at each other: the same ragged clothes, the same dirt and cow shit covering us. We had nothing to say to each other.”  
“Tang Hongjing was the girl who gave your father the fatal strike with her belt. She drowned in the Yellow River. There was a flood that carried off a few of the sheep kept by the production team. So the Party secretary called to the sent-down students, ‘Revolutionary youths! It’s time to test your mettle!’ And so, Hongjing and three other students jumped into the river to save the sheep.
The cold winter of the Cultural Revolution really was over, and everything was springing back to life. Even though the calamity had just ended, everything was in ruins, and countless men and women were licking their wounds. The dawn of a new life was already evident.  
Science and technology were the only keys to opening the door to the future, and people approached science with the faith and sincerity of elementary school students.



BIG SPOILERS FROM NOW ON!!

From here on there are big spoilers. Read on the rest, only if you have read the book. It is a great book, and you shouldn't spoil your enjoyment by reading my discussion of the plot.

The cultural revolution days set the background for Ye Wenjie, one of the key figures in the story, and her choice which affects the fate of humanity. From Ye Wenjie's days working on the radio observatory at Red Coast, the book skips to the present day, to another protagonist Wang Miao, a nanotechnology professor who was contacted by a secretive "Frontiers of Science" group. After that Wang starts seeing a countdown in his camera and then in his eyes/retinas. At this point things escalated quickly. The pace of the book went from slow to uncomfortably fast. At this point, I was convinced  the only explanation for this "miracle" is the simulation hypothesis (that we are living in a simulation), because no natural force can pull of this miracle.

Later I learned that Ye Wenjie did invite Trisolarians (an alien race living on a planet 4 light years away) to invade the Earth. Then I started to wonder how the book will be able to explain this countdown timer miracle as a technology of the alien race. But the book delivered on this at the end. The answer was the sophon project, a 2-D supercomputer folded into a 11-d proton! This was simply awesome!

I really liked the reveal of three suns as the cause of chaotic and stable eras in the Trisolarian planetary system in the Three Body VR game. The game also revealed details about the Trisolarian society. Their society was interesting. It is an autocratic and very socialist society (maybe sort of hinting back to the Chinese government again).

Shi Qiang, or detective Da Shi, was a very interesting character. The ship slicing scene with nano-wires was unreal. It was very gory, and I don't think it was warranted. (After I read this, I had a nightmare about it.) But how could the people on the ship not detect the problem earlier. If they see other people getting sliced 10 meters ahead, they can easily run back, sound an alarm and warn others. Also why didn't the sophons warn the ship about the trap? Given that they are super-AI and omnipresent and omniseeing, they had a good chance to detect this, no?

Selected notes from the book

“In China, any idea that dared to take flight would only crash back to the ground. The gravity of reality is too strong.” 
“Theory is the foundation of application. Isn’t discovering fundamental laws the biggest contribution to our time?” 
“At the same time, they want to ruin science’s reputation in society. Of course some people have always engaged in anti-science activities, but now it’s coordinated.” 
“Everyone is afraid of something. The enemy must be, too. The more powerful they are, the more they have to lose to their fears.”
After calming himself and walking to the other end of the long table, Wang said, “It’s actually pretty simple. The reason why the sun’s motion seems patternless is because our world has three suns. Under the influence of their mutually perturbing gravitational attraction, their movements are unpredictable—the three-body problem. When our planet revolves around one of the suns in a stable orbit, that’s a Stable Era. When one or more of the other suns move within a certain distance, their gravitational pull will snatch the planet away from the sun it’s orbiting, causing it to wander unstably through the gravitational fields of the three suns. That’s a Chaotic Era. After an uncertain amount of time, our planet is once again pulled into a temporary orbit and another Stable Era begins. This is a football game at the scale of the universe. The players are the three suns, and our planet is the football.”
Pan studied everyone meaningfully, and then added in a soft voice, “How would you feel if Trisolaran civilization were to enter our world?” “I would be happy.” The young reporter was the first to break the silence. “I’ve lost hope in the human race after what I’ve seen in recent years. Human society is incapable of self-improvement, and we need the intervention of an outside force.” 
“And it is this: The human race is an evil species. Human civilization has committed unforgivable crimes against the Earth and must be punished. The ultimate goal of the Adventists is to ask our Lord to carry out this divine punishment: the destruction of all humankind.” 
“Pan-Species Communism. It’s an ideology I invented. Or maybe you can call it a faith. Its core belief is that all species on Earth are created equal.”
They are sealing off the progress of human science. Because of the existence of these two protons, humanity will not be able to make any important scientific developments during the four and a half centuries until the arrival of the Trisolaran Fleet. Evans once said that the day of arrival of the two protons was also the day that human science died. 
Charcoal inside a filter is three-dimensional. Their adsorbent surfaces, however, are two-dimensional. Thus, you can see how a tiny high-dimensional structure can contain a huge low-dimensional structure. 
You will never be as good at it as criminals, masters of out-of-the-box thinking.                
“That flower may be delicate, but it possesses peerless splendor. She enjoys freedom and beauty in the ease of paradise.” 
The science consul said, “Project Sophon, to put it simply, aims to transform a proton into a superintelligent computer.” 
“This is a science fantasy that most of us have heard about,” the agricultural consul said. “But can it be realized? I know that physicists can already manipulate nine of the eleven dimensions of the micro-scale world, but we still can’t imagine how they could stick a pair of tiny tweezers into a proton to build large-scale integrated circuits.
A particle seen from a seven-dimensional perspective has a complexity comparable to our Trisolaran stellar system in three dimensions. From an eight-dimensional perspective, a particle is a vast presence like the Milky Way. When the perspective has been raised to nine dimensions, a fundamental particle’s internal structures and complexity are equal to the whole universe. As for even higher dimensions, our physicists haven’t been able to explore them, so we cannot yet imagine the degree of complexity.” 
The science consul said, “A sophon has been born. We have endowed a proton with wisdom. This is the smallest artificial intelligence that we can make.”
             
“As soon as we increase the dimensionality of this proton, it will become very small.       
“Princeps, the sphere we see now is not the complete sophon. It’s only the projection of the sophon’s body into three-dimensional space. It is, in fact, a giant in four-space, and our world is like a thin, three-dimensional sheet of paper.
             
Yes. I can see the control center, everyone inside, and the organs inside everyone, even the organs inside your organs.
             
“A sophon observing three-space from six-space is akin to us looking at a picture on a two-dimensional plane. Of course it can see inside us.”      
“It’s not exactly going ‘through.’ Rather, it’s entering from a higher dimension. It can enter any enclosed space within our world. This is again similar to the relationship between us, existing in three-space, and a two-dimensional plane. We can easily enter any circle drawn on the plane by coming in from above. But no two-dimensional creature on the plane can do such a thing without breaking the circle.” 
“After the two sophons arrive on Earth, their first mission is to locate the high-energy particle accelerators used by humans for physics research and hide within them.
             
The study of the deep structure of matter is the foundation of the foundations of all other sciences. If there’s no progress here, everything else—I’ll put it your way—is bullshit.”

Wednesday, October 2, 2019

Frugal computing

Companies care about cheap computing.

Well, the first thing they care about is usability: Is it easy for average programmers to develop solutions using the framework?

And the second thing they care about is cheap computing: Does this break the bank? Can I do this cheaper?

Speed, efficiency, elegance, technical strength... These are often not of interest. People are OK with their analytic jobs return results hours later. We watched aghast, at the beginning of Hadoop epidemic, people use MapReduce for doing analytics in a wasteful and slow manner.

I was recently thinking about this question: How can we trade-off speed with monetary cost of computing? If the constraints are that the user will not need the results before a couple hours but it would be nice to get the results in a day, what is the cheapest way to get this analytics job done in the cloud?

While distributed systems may be the answer for a lot of questions (such as providing fault-tolerance, low-latency access for geo-distributed deployment, scalability, etc.), it is not very advantageous for cheap computing. This is because distributed processing often comes with a large overhead for state synchronization, which takes a long time to get compensated, as the "Scalability at what COST paper" showed. With frugal computing, we should try to avoid the cost of state synchronization as much as possible. So work should be done on one machine if it is cheaper to do so and the generous time budget is not exceeded. Other machines should be involved when it is cheaper to involve them, and when there is not a lot of state to synchronize.

Primitives for frugal computing

Memory is expensive but storage via local disk is not. And time is not pressing. So we can consider out-of-core execution, juggling between memory and disk.

Communication costs money. So batching communication and trading off computation with communication (when possible) would be useful for frugal computing. If something is computationally heavy, we can have lookup tables stored in disk or S3 (if still feasible monetarily).

We may then need schemes for data-naming (which may be more sophisticated then simple key), so that a node can locate the result it needs in S3 instead of computing itself. This can allow nodes to collaborate with other nodes in an asynchronous, offline, or delay-tolerant way. For this, maybe the Linda tuplespaces idea could be a good fit. We can have an await based synchronization. This may even allow a node or process to collaborate with itself. The node may might switch to other threads when it is stuck waiting for a data item, and then come pick up that thread when the awaited thing becomes ready through work of the other threads and continue execution from there.

In frugal computing, we cannot afford to allocate extra resources for fault-tolerance, and we need to do in a way commensurate with the risk of fault and the cost of restarting computation from scratch. Snapshots that are saved for offline collaboration may be useful for building frugal fault-tolerance. Also self-stabilizing approach can also be useful, because it can provide forward error correction instead of a costly roll-back error correction.

This is all raw stuff, and in the abstract. I wonder if it would be possible to start with an opensource data processing framework (such as Spark), and customize it to prioritize frugality over speed. How much work would that entail?

Thursday, September 26, 2019

Do leases buy us anything?

Consider the atomic storage problem, which is a simpler problem than consensus. CAP theorem says that when there is a partition, you will need to sacrifice strong-consistency or high-availability.

Using leases, it is possible to sacrifice availability for sometime (until lease expires), and reconfigure the storage system (often by shrinking the quorum) to keep providing availability. Consistency is preserved throughout, and availability is sacrificed only during lease expiration time. This is a good tradeoff. (I am going to bring up the question of whether this may help circumvent CAP at the end of the post.)

But is this power unique to leases? Is there a way to explain this in an alternate way using only failure detectors instead of leases? This is the question I want to explore here.

Many distributed systems implement leases with just countdown timers and without NTP timestamps. This is because in the short-term the rate of clocks at processes don't drift too much.

So maybe, we can simulate a lease expiration by a node suspecting itself as failed. If the other nodes have knowledge of the timeout on the failure detector of this expired node, they can wait that time out, and start reconfiguration of the storage system after that. While a failure detector requires only unilateral decision, this explanation requires that other nodes know about the bound on the failure detector at the expired node. Let's see if we can do without that requirement.

For reconfiguration, two options are possible. One is decentralized reconfiguration implemented over the nodes themselves, and the other is by using a reconfiguration box (often implemented by Paxos) as the arbiter.

An example of the former is the dynamic atomic storage without consensus (2011) setup. There, majority is needed to pass reconfiguration. But there, even for a consistent read, majority is needed. So we don't really need a failure-detector at the partitioned node to drop itself of the system. It won't be able to serve reads or writes by itself anyways.

An example of the latter is chain replication. Consider that the tail node is partitioned. The original chain cannot complete serving writes anymore, because the tail is partitioned. The tail can still serve reads though for the clients that contact it directly for some time.

Here the reconfiguration box can reconfigure the chain to remove the partitioned tail. To explain this with failure detector terminology, let's say the reconfiguration box has its failure detector itself. The failure detector suspects the tail, and passes a reconfiguration with a higher epoch and takes the tail off. But before reconfiguring the chain to remove the original tail, the reconfiguration box should make sure that the tail stops serving reads to client. How can this be accomplished? The reconfiguration message will not reach the partitioned old tail. So the tail should know about the reconfiguration box's failure detector timeout duration. Without this knowledge, without tying the reconfiguration server's failure detector about the tail to the tail's failure detector about itself, we wouldn't know when it is safe to switch from the old configuration and start the new configuration. (The alternative is that the tail checks with the reconfiguration box for each operation, so it confirms its status in the configuration. Even with this one, due to asymmetric message delay, the reconfiguration box may need to wait some duration before reconfiguring.)

Leases buy us time and optionality

Using leases, a node does not have to check its status in a configuration for each operation. Provided that the lease holds, the nodes status in the configuration is unchanged. Using leases on the acceptors, a Paxos leader can serve reads locally, without checking with a quorum. And using leases, a tail in chain replication can serve reads without checking if it is still the tail. This translates to efficiency because checking your status in the configuration is not done for each operation, but rather batched and done once for each lease renewal.

In terms of trading off availability to get availability, it seems like leases provides more information than a unilateral failure detector and can buy you consistency in the presence of a partitioned node. This comes with the loss of some availability because reconfiguration for quorum shrinking needs to wait the lease time to expire.

Leases also provide an advantage for reconfiguration in the presence of partitions. Leases sacrifice availability to restore availability (while preserving safety) for the storage system. This functionality requires more than the unilateral decision taken by failure detectors, but rather a bilateral information on the expiration of the timeouts.

MAD questions

1. Did we circumvent the CAP result?
CAP defines availability as "Each request for read or write eventually receives a response." Even with leases and reconfiguration, there will be a request that does not receive a response. In my example, the old tail will not respond to read request after the expiration of the lease. But since at that point the old tail is not part of the system anymore, why does that count against the availability of the system? But the formulation of CAP is too strict and defines the system as the initial set of nodes in the system. That formulation prohibits any reconfiguration of the system even when there are no partitions.

I think we need more refined versions of CAP. It has a very rough granularity formulation.

Tuesday, September 24, 2019

Some of my peculiarities

Daniel Lemire recently wrote about his kindergarten experience, how he couldn't tie his shoes and couldn't memorize numbers for counting. I had similar experiences. I was a smart kid (not just my mom's opinion :-), but I couldn't memorize how to count to 5 until first grade (which I started at 5 years old --which is another story). My mom was a primary school teacher, and she was worried that she couldn't teach me how to count. She invented some rhyming words for numbers to help me memorize them, but apparently that didn't work because I would say the rhyming words instead of the numbers.

Even up to third grade, I would occasionally put on my shoes the wrong foot. I couldn't learn how to tie my shoes properly till middle/high school. In middle/high school I started doing a single loop tie. I learned how to do double loop tie only after university. On the other hand, I had decent coordination. I played soccer and basketball fine.

I had a very hard time learning days of the week and months in a year (both in Turkish and later in English). It took me much much later than other kids to learn these. I am still not good with that stuff and mix the months occasionally. For some holidays, I still don't know which month they fall in. And overall I don't have a good appraisal of time.

I am clumsy and awkward in a peculiar way. I activated the fire alarm in my rental apartment at my sabbatical by mistake.

I can also be dense in stuff I don't like to learn, such as stock market, investing, etc. Once I label a topic as bad/boring/useless/tedious, it is as if I make it a point not to understand anything about it.

I procrastinate on some things for no good reason. In my twenties, I was very bad with paying bills, and got many late fee charges. This was a stupid form of procrastination. I would be stressed about these, but not do anything about them. (This wasn't depression. Luckily I have been able to avoid depression so far.) In my thirties, I learned how to be organized thanks to Emacs org-mode.

Being in academia, I was in the presence of people with similar peculiarities, and didn't stick out much. One of my lab mates at MIT kept procrastinating paying the utility bills in his apartment and his services disconnected couple of times. Then his mom started paying them instead. This guy was one of the smartest kids I knew, but he just couldn't pay his bills on time. Another person I know, kept his  car wash subscription in a city he moved out of for 6 months, because he wouldn't make a simple phone call---even after I started pleading him to make the call. (I also hate making phone calls.)

Ok, I am on a roll, let's keep going.

I can't touch peaches or apricots. Touching them makes me feel freezing cold. The mere mention of them used to make every hair on my arms stand up. I got better at tolerating their mention and appearance. Similar with touching chalk.

I have sensitive inner-ears. I get dizzy in an elevator and in a plane during takeoff. I cannot read anything in a car or bus.

When multiple conversations are going on in a dinner party, I cannot concentrate and understand what people say to me, because I can't tune out other conversations. When there is a TV in a room, I can't function, even thought the TV is on mute. Long flights are my nightmare: every seat there is a TV playing, and I go into a crazed/tired state after a couple hours.

I cannot function well in an open-office environment. I get depleted when I work in an open-office for a long duration. I love love love my private office.

I am not a very logical learner and a clear/organized thinker. I have a messy brain, and a messy learning/understanding style. I have much better intuition than rot logic learning power. I can form relationships/analogies with different things quickly and effortlessly.

I am good at catching typos. It is as if typos on a paper jump from the page to my attention. And in general I have a good visual memory.

I think I am good at reading emotions, and I am a pretty (probably too) sensitive person for emotions of others.

I think my peculiarities are just peculiarities. Like the stereotypical absentminded/clumsy professor. But, maybe, I am somewhere on the spectrum. I am not too bothered by this. But this occasionally leads to problems. My wife thinks I am feigning ignorance about some of our house routines (like whereabouts of things) and some of my behavior at social gatherings, even though I am genuinely confused and lost. Some of these, unfortunately, I cannot fix, even though I try hard. Probably some people find me peculiar, clumsy, or child-like, though I do a very good job of hiding this from people I interact infrequently.

MAD questions

1. What are your peculiarities?
I expect that some of you share some of my peculiarities. And some has other quirks. I like people with quirks, as long as they are harmless and not infringing to others.

2. Here is another good thing going for me, but probably this is a trick and not a peculiarity
I can stop my hiccups by just noticing that I have a hiccup and thinking it is silly to have a hickup. This works reliably and within one second for me. I picked this trick up 5-6 years ago after I saw a mention of this on a subreddit. Ever since that I didn't have a hiccup session with more than 2 hiccups (at which point I notice the hiccup and use the trick). Let's not jinx this, this is one good thing going for me.

Sunday, September 22, 2019

Teaching Paxi

Paxos family of protocols (which I refer to as Paxi) is immensely useful for building distributed systems due to their excellent fault-tolerance properties. Many cloud computing services and distributed databases employ Paxi for state machine replication (SMR). Paxi preserve the safety of consensus problem (no two nodes commit different values for the same slot) even to the face of a fully asynchronous execution, crash faults, message losses, and network partitions. Paxi satisfy liveness of consensus problem (some value is eventually committed for the slot) when the system moves outside the realm of the coordinated attack and FLP impossibility results.

Paxi are perennially misunderstood and their sophistication underrated. While there has been a lot of work on Paxi, we have been able to explore only a fraction of the algorithm design space. A striking evidence of this arrived in 2016, where we had a flexible quorum breakthrough after 30 years, which no one had anticipated.

There is a need to unpack Paxi and explain them better for both students and practitioners alike. Paxi provide a great opportunity for teaching the principles of distributed systems and we should seize on this opportunity.

Problems with teaching Paxi

Most coverage of Paxos in courses is dry and superficial: the Paxos protocol is described and the students memorize the protocol. While the Paxos protocol looks simple, it has a lot depth and subtleties. It is not possible to appreciate these and truly understand distributed consensus by just memorizing the Paxos protocol. To understand Paxos, you should not only understand how it works, but also why it works, and what cornercases it prevents, and how else it could be realized.

Raft has been proposed as a simple explanation of consensus and Paxos. While many developers love the operationalized explanation style of Raft and the implementation that accompany it, tying the explanation to a constrained implementation is unnecessarily restrictive. The generality of Paxos family of protocols are lost, and the context and principles of distributed consensus is not communicated satisfactorily.

We need better explanations of not just Paxi but the context and derivation of these protocols, explaining why each action is needed and why this is a hard problem. However, explaining the protocol solely in a declarative way using derivation is also hard to follow for students, and some intuition should be provided as well. The students should also be provided with ample opportunity to get a lot of hands-on exercise and experience with the protocols, their implementation, and their integration into practical applications/systems.

Every year my favorite part of the distributed systems class is when I get to teach Paxos for two weeks. Every year, I am able to do a better job of it by gradual/gradient-descent improvement. But these days, I am planning an overhaul of how I teach Paxos, and put some real effort behind this to realize my ideal setup for teaching Paxi. Here is my explanation of this setup in terms of the content of the module and supporting tools for it.

Course module on Paxi

The Paxi course module will provide context and teach the principles and derivation of the protocols.

To teach about the context, the module will cover questions such as: What makes the distributed consensus problem so hard?  How has our understanding of distributed consensus change after "Attacking Generals" and FLP impossibility results? What are the cornercases that haunt correctness and liveness? Covering these will help the students appreciate the depth and ingenuity of Paxi.

To teach about derivation and principles of the protocol, the module will employ a stepwise refinement from the high-level consensus specification to an intermediate round-based abstraction (for which Heidi's generalized consensus framework is a good candidate), and then to the Paxos algorithm.  The module will explore both leader-based (as in Paxi) and non-leader-based (as in Ben-Or and Texel) refinements of this round-base intermediate specification, and will discuss the advantages and disadvantages of each approach.

The module will also relate distributed consensus to decentralized protocols in the blockchain domain. By showing a Texel to Avalanche transition, the module will tie consensus (where Texel shows an alternative solution to leader based consensus as in Paxos) to blockchains (where Avalanche shows how to scale and operationalize Texel to large-scale decentralized environments).

Supporting tools

To allow the students to experiment with the protocols at the design level, we will provide TLA+/Pluscal modeling of Paxos and variants. With these, the students will be able to model-check Paxi protocols and experiment with modifications to see which safety and progress properties are satisfied under different environments.

To enable the students to experiment at the implementation level, we will use our Paxi framework implemented in Go (available as opensource at https://github.com/ailidani/paxi). Our Go Paxi framework provides a leveled playground for protocol evaluation and comparison. The protocols are implemented using common building blocks for networking, message handling, quorums, etc., and the developer needs to only fill in two modules for describing the distributed coordination protocol. Paxi includes benchmarking support to evaluate the protocols in terms of their performance, availability, scalability, and a linearizability checker to check the protocol for consistency. We have a dozen Paxos variants already implemented in Paxi. We will invite students to implement more, especially Byzantine versions of Paxos protocols, and consider tie-ins to permissioned blockchain protocols. In order to link the high-level design to the implementation, we will provide a mapping from TLA+ model to the Paxi implementation of distributed consensus.

In order to showcase the integration of Paxi protocols to distributed systems applications, we will use the globally distributed database FleetDB (https://github.com/acharapko/fleetdb) as the hands-on application. We will first extend FleetDB to be instantiable with different Paxi protocols as plugins, and make the Paxi protocols exchangeable based on workload and environment. FleetDB can lead the way and help distributed databases for integrating Paxi protocols in their replication operation. Currently only a handful databases (including Spanner and CockroachDB) use Paxos as part of their replication. Although Paxos provides excellent fault-tolerance properties and prevent any loss of consistency, the vanilla Paxos protocol is not a good fit for WAN deployments, and performs poorly under certain topologies and workloads.

Couple other tools are worth mentioning. One is the DSLabs tool from UW for model checking distributed systems projects implementations. Another is the DistAlgo tool from SUNY Stonybrook.

MAD questions

1. What is your favorite teaching technique for Paxi?
While teaching Paxos, I bring 5 students to the board to perform live reenactments the Paxos consensus algorithm (each one simulating a Paxos node) under several fault scenarios. This part is the most fun one and most beneficial in my distributed systems course. I do this twice in two different classes. Through this exercises the students get to learn how the protocol works, and see how it deals with the cornercases.

Another favorite moment for me is to watch the students think they understand the protocol, and then forget about it and get confused again. This is inevitable, and a big part of how learning works. Learning needs self evaluation and self correction. Without doing work yourself, you can't truly learn anything. Easy come, easy go. I also like watching the students learn the application of the single-instance Paxos in the MultiPaxos protocol, and see them learn about some Paxos variants. The students then realize that what they learned was only the tip of the iceberg.

2. Are there other algorithms/systems you suggest can serve as capstone projects in a distributed systems class?

Thursday, September 19, 2019

Avalanche is a descendent of Texel

When I first skimmed the Texel paper (which was released on August 28 2019), I could see parallels between Texel and Avalanche, and noted those in the third paragraph of my review. But I had missed the footnote in the Texel paper which said that Texel was originally written in 2010. I thought Texel was new, and was speculating that it may be a more deterministic version of Avalanche, that is applied to crash tolerant distributed consensus. After writing two more blog posts on modeling Texel in TLA+ and understanding it better, I now think Texel formed a basis that Avalanche descended from.

Texel provided an asynchronous and truly-leaderless solution to consensus. Instead of appointing a leader to bring nodes to consensus (as in Paxos), Texel shows how each node can make its own mind and still achieve consensus in an asynchronous system. By adopting a leaderless solution to asynchronous consensus, Texel avoids the disadvantages of solutions that appoint a leader for achieving consensus. In a leader-based solution, what if the leader fails? In order to avoid getting stuck forever, the nodes should use a failure detector to suspect if the leader is unavailable. Failure detectors are a liability in large-scale systems with a lot of turn-over. Another big drawback with having a leader is that for large-scale systems, the leader becomes a performance bottleneck.

Avalanche operationalized Texel's leaderless approach for large-scale decentralized consensus. It extended Texel's leaderless consensus approach in terms of scalability and quick finality (by using sampling and metastability), and applied the resultant decentralized algorithm in the blockchain domain.

But Texel did not consider Byzantine faults

Avalanche considers Byzantine faults which Texel did not consider. The question is, what can a Byzantine node do in blockchains? Answer: it can try to perform double-spending. That translates to the node proposing two different transactions with the same UTXO for itself (the transactions need to be signed by the private-key of the initiator).

The safety (i.e., agreement) property of the Texel approach says that no node in the system can decide different things for the same value (transaction). This, translated to Avalanche terms, means that no two correct nodes will decide two different transactions with the same UTXO. And this rules out double-spending for a Byzantine initiator. Even when other Byzantine nodes in the system may try to conspire with the Byzantine initiator and push some correct nodes to adopt different supporting values, with the threshold for supporting value adoption higher than the number of possible Byzantine nodes, Texel approach and its respective adaptation in Avalanche can avoid this problem.

Avalanche also does a very clever judo move on Texel's liveness problem and turns it into a feature. In my reviews for Texel, I mentioned that liveness (termination of consensus) is a problem for Texel. In the blockchain domain, Avalanche adopts a similar approach to supporting value selection, and runs into liveness problem when two different values are competing to be decided on for the same consensus instance. In the blockchain domain, this corresponds to a Byzantine node pushing two different transaction with the same UTXO. And in this case the liveness violation is a feature not a bug. Since the correct clients follow the protocol as prescribed (and avoid double-spending), they are guaranteed both safety and liveness. In contrast, the protocols do not guarantee liveness (but still guarantees safety) for double-spending transactions submitted by Byzantine clients, which conflict with one another. As the Avalanche paper says "such decisions may stall in the network, but have no safety impact on virtuous transactions."

What about the Sybil nodes problem? Avalanche deals with Sybil problem  using a PoS solution. It can even adopt a PoW solution as well, because dealing with Sybil nodes is an orthogonal problem to solving consensus.

Scaling Texel

In Texel for adopting a supporting value, you need to read it from more than F nodes. In the decentralized consensus setting Avalanche considers, N and F can be huge, thousands of nodes. So for finding a supporting value, a node in Avalanche samples a bunch of nodes, which is much smaller than F nodes. But the random sampling of nodes still enables tolerating the F faulty nodes. Since F is a fraction of N, it cannot have too much effect in the sampling based selection of a supporting value for a node in Avalanche.

How does Avalanche deal with the cancellation of experimentation problem in Texel? Again sampling and the use of metastability concept helps with this. Having a large scale system becomes an advantage here because the likelihood/risk of reading from inconsistent cuts of each other from overlapping experiments and getting affected by this diminishes. This way Avalanche avoids the agreement violation problem due to inconsistent snapshot read (if concurrent and overlapping experiments are not canceled).

Avalanche also applies the metastability concept to make the consensus finalization decision faster, and without the need to contacting N-F nodes.

Closing the loop

I will assign Texel as part of a TLA+ modeling project in my distributed systems class. My distributed systems class topics are as follows:
  1. Introduction, 2 phase-commit
  2. Reasoning about distributed programs, safety/progress
  3. Consensus, Paxos
  4. Failure detectors, Faults and fault-tolerance
  5. Time: logical clocks, State: distributed snapshots
  6. Datacenter computing, Cloud computing
  7. NoSQL databases, CAP theorem, Distributed databases
  8. Big data, Big data analytics
  9. Decentralized ledgers and blockchains
The course starts with consensus and ends with blockchains. Showing a Texel to Avalanche transition is a good way to tie consensus (where Texel shows an alternative solution to leader based consensus as in Paxos) to blockchain (where Avalanche shows how to scale and operationalize Texel to large-scale decentralized environments).

MAD questions

1. Could it be that Avalanche and Texel are unrelated?
No. Absolutely not!

Tobler's first law of geography says "everything is related to everything else, but near things are more related than distant things."

"Everything is related", so Avalanche and Texel are related :-)

"Near things are more related than distant things". Since both Texel an Avalanche have strong Cornel links, they are even more related.

Banter aside, I noticed that Texel in turn has several parallels to Ben-Or. Nothing comes out of void. So you can also make an argument that Avalanche is a descendent of Ben-Or as well. But, as the law said "everything is related", so I am still in the right. Here are the similarities I see between Texel and Ben-Or.
  1. Texel does not use rounds, but consistent-cuts. Ben-Or uses rounds, but the rounds in Ben-Or are leaderless rounds: they are non-client-restricted rounds, in contrast to client-restricted rounds in leader-based solutions.
  2. Similar to a node experimenting in Texel to find a supporting value by talking to F+1 nodes, in Ben-Or a node goes through the first phase of a round to identify a supporting value by talking to F+1 nodes.
  3. Similar to the node finalizing a decision by finding it at N-F nodes in Texel, in Ben-Or a node finalizes its decision by finding it at N-F nodes.
As one difference, upon failure to get a decision, Ben-Or makes some nodes to change their vote before the next experimenting phase. This helps jolt/tilt the system toward a decision, so that eventually probabilistically the system converges to consensus.

Tuesday, September 17, 2019

Modeling a read-write version of Texel: an asynchronous consensus algorithm without rounds

In the previous post I gave a model of atomic Texel, where a node can atomically read all other nodes' decision and update its own decision. Here is a refined version of that, where a node can atomically read the state of *one* other node and update its decision. This refined model shows why it is important for the nodes to read from consistent cuts, and how when multiple nodes are experimenting they can violate this requirement, and Agreement property is violated as a result.

The model

This builds and extends over the previous model. N stands for number of nodes, and F denotes the number of nodes that can crash. We use f to keep track of actual number of nodes that crash. In addition to the *decision* array that tracks the decision of each node, we now have an *exp* array that denotes the experimentation status of each node. Initially each node is in the experimenting state.

Each node starts with t=FALSE (the decision is not finalized), pollSet= Procs \{self} (the node can poll all nodes except self), and tally=<<0,0>> (the number of votes from the polled nodes for "a" and "b" is initally 0 and 0).


Each node has three actions that it can choose from and execute as long as the node has not finalized its decision or crashed.

The first action starts in Line 21. This action is enabled if the node is in experimenting state. It picks (and removes) a node k from its pollSet. If k's decision is "a", it increases the tally for "a" and if k's decision is "b", it increases the tally for "b". After this, if any of these tallies is a supporting decision, i.e., is greater than F (which means it is a majority of N-F nodes), then the node adopts it as its own decision.

Line 32 starts the second action. If f, the actual number of crashes is still less than F, the allowed number of classes, then a process can crash, by setting its decision to crash permanently.

Line 36 starts the third action. If a node finds that its current decision is shared by at least N-F processes, then that decision is "anchored", and the node can finalize its decision by setting its t=TRUE. If no such anchor is in place, and the node is not in experimenting state, the node switches to experimenting state (resetting pollSet and tally). By experimenting again, the node can potentially change its decision to another supporting decision, which may lead to progress and finalization of consensus.

Safety violation

When I model-check this protocol for N=4, F=1, this model violates the Agreement property. Two nodes can finalize their decisions with different values, because they experiment concurrently, and one of them reads from an inconsistent cut. In the trace below node 2 builds its supporting decision on an inconsistent snapshot involving 1, which changes its state after being read by 2.

Here are the steps to the violation of Agreement. Initially the decision array of nodes is "a","b","b","a".
  1. Node 1 reads from node 2 the value "b".
  2. Node 2 reads from node 1 the value "a". (Note that two nodes are concurrently experimenting and reading state from each other which will get inconsistent soon.)
  3. Node 1 reads from node 3 the value "b", and since tally for "b" is more than F=1, node 1 changes its decision to "b", and concludes its experimentation.
  4. Node 1 finalizes its decision of "b", because it sees an anchored quorum (cardinality >= N-F) for "b".
  5. Node 2 reads node 4 the value "a", and since tally for "a" is more than F=1 (including the now invalid vote from Node 1), node 2 changes its decision to "a", and concludes its experimentation.
  6. Node 3 reads from node 2 the value "a".
  7. Node 3 reads node 4 the value "a", and since tally for "a" is more than F=1, node 3 changes its decision to "a", and concludes its experimentation.
  8. Node 2 finalizes its decision of "a", because it sees an anchored quorum (cardinality >= N-F) for "a". This decision violates Agreement, because Node 1 has finalized its decision to "b", and we have conflicting decisions.

To fix the safety violation, we should disallow concurrent experimentation when it may lead to reading from inconsistent snapshots. This is possible by making the reads preemptive/destructive. (If, instead of using preemptive reads, we try constraining the nodes to read from only non-experimenting nodes, deadlock would happen.) In the above trace, when node 2 reads from node 1, this should have halted node 1's already ongoing experiment. This is easy to achieve by extending/modifying the model above, and when I fixed this problem, I found that safety is always satisfied for N>=3*F+1. (I don't provide that model, because I am considering assigning modeling Texel and Ben-Or as a TLA+ project in my distributed systems class.)

Liveness violation

Liveness is also an interesting story. Even with F=0 and starting state of <<"a","b","b","b">>, we can have a liveness violation. With F=0, reading from one node is enough to change your vote. So the value of "a" may be circulated in the system, since it can keep getting adopted by another minority of processes. The system may not be able to anchor the majority value as the consensus value, and as a result cannot finalize a decision. Side note: When you appoint a leader for consensus (as in Paxos) this vote looping does not become an issue, because the leader will break the symmetry by dictating the value it picks (or a suitable value) to the other nodes for adoption.

In that same setup (with <a,b,b,b>), if I make it F=1, liveness is satisfied, because no node will copy a, as it will need to see another node with a before passing threshold. So, in this case, increasing F did help for liveness. This suggests that maybe we should introduce another free parameter to serve as threshold for value adoption, and not tie that strictly to F the potential number of faults.

By restricting the problem to binary (with only two values) consensus and with proper selection of the threshold for adoption, Texel may solve the problem of having a minority value circulating in the system forever and breaking progress. But even then, we have another progress violation problem. When we introduce experiment cancellation to satisfy the Agreement property, nodes that keep interfering and canceling each others' experiments will violate progress.

This is another place where having a leader for consensus provides advantage. The leader by definition reads from a consistent state, as for that round the other nodes are passive. When you have each node polling for itself, coordinate these distributed transactions to read from clean consistent states becomes very difficult (maybe requires consensus itself).

MAD questions

1. What are the advantages and disadvantages of appointing a leader for solving consensus?
Picking up from the thread of previous discussion on comparing Texel and Paxos, here are pros and cons of appointing a leader node for solving consensus.

There may be symmetry, and no clear winner, when there are multiple initial values present in the system. Using a leader breaks the symmetry, because nodes go with whatever the leader proposes as the vote to decide on. So using a leader, you can solve more than just binary consensus. Even with binary consensus, as we have seen in Texel, liveness can still be jeopardized due to experiment cancellation. And in Ben-Or, liveness is facilitated by jolting the system by using random changes of some values, so that the system will eventually probabilistically converge to a consensus. On the other hand, using a leader boosts liveness in the presence of multiple initial values. (Errr... when things go right. See below.)

On the other hand, trusting a leader to finish a round introduces a problem. What if the leader is dead? (2PC blocking problem!) In order to avoid getting stuck forever, nodes should use a failure detector. Then, upon suspicion of the leader's death, any node can start a new round to lead the rest of the nodes. But what if the suspicion is wrong? The FLP impossibility result strikes again! Fortunately, there is a way to circumvent the impossibility result by postponing liveness and still preserving safety. For example, Paxos preserves safety even with multiple leaders concurrently trying to lead rounds.

Another drawback with having a leader is, if N is large, the leader is a performance bottleneck in the deterministic and instant consensus protocols, like Paxos.

Sunday, September 15, 2019

Modeling an atomic version of Texel: an asynchronous consensus algorithm without rounds

I had written about my preliminary impressions about Texel, an asynchronous consensus algorithm without rounds.

Over the last couple of nights, I have modeled a shared memory version of the algorithm in TLA+, so that I can understand the algorithm better and learn more about the approach. I started with a very rough atomic version of the algorithm, where a node can atomically read the state of all other nodes and update its own state. This is not practical, but it is good for highlighting the essence of the Texel algorithm. In this post, I will talk about this atomic Texel model.

After I got this model down, it was easy for me to refine the atomicity. In the refined model, a process can atomically read from one other process and update its state. That refined model is just one step removed from the message passing Texel algorithm presented in the paper, and demonstrates the tricky issues that arise when multiple nodes are concurrently trying to update their states. In that read-write atomicity model, we see the need for reading states from a consistent-cut, and why some concurrent experiments should be aborted to satisfy that condition. But that read-write atomicity Texel specification is the topic of my next post. Today we just focus on the atomic Texel model.

The atomic Texel model

N stands for number of nodes, and F denotes the number of nodes that can crash. At model checking time, TLA+ toolkit asks you to enter values for N and F. I tried for N=4, and F=0 and F=1. I also tried for N=7, and F=0,1,2. My Texel model satisfies both agreement and progress always for N>=3F+1. Progress is always satisfied, because Choose is deterministic. A node will choose "a" when both "a" and "b" meets SupDecision criteria, which is a supporting value that a node can adopt based on its querying of other nodes. A supporting value is one that is shared by at least F+1 nodes. Note that, for N>=3F+1, it is always possible to find a supporting value for binary consensus, even when up to F nodes fail.

I use f to keep track of actual number of nodes that crash. The model ensures that f=<F. The variable decision tracks of the decision of each node. I hardwire it for N=4 in my run. When I try for N=7, I change this initial assignment.


In my specification, each node has three actions.

Line 23 gives the first action. A node reads the state of other nodes to find a supporting value and adopts it as its own decision value. But the decision is not finalized until, the node sets its finality flag t for the decision to TRUE.

Line 24 starts the second action. If f, the actual number of crashes is still less than F, the allowed number of crashes, then the node can crash by setting its decision to crash permanently.

Line 28 starts the third action. If a node finds that its current decision is shared by at least N-F processes, then that decision is "anchored", and the node can finalize its decision by setting its t=TRUE.



Here are the Agreement and Progress properties I check. Agreement says that if two nodes j and k finalized their decisions, then their decisions cannot differ. Progress says that eventually any non-crashed node will finalize its decision.

For N>=3F+1, both Agreement and Progress are satisfied. Since the atomicity is too rough (a node can read the states of all other nodes atomically), Agreement holds without installing extra mechanisms for reading states from a consistent-cut and aborting other nodes' concurrent experiments, because each experimentation is done atomically and hence in an interleaving manner. Progress holds because the CHOOSE in SupDecision is deterministic, and helps the nodes to converge to one the binary consensus values.

This is a very simple model, but it helped me to come up with the refined read-write atomicity Texel specification quickly, and in that refined model it becomes easy to see what could go wrong when we don't have additional mechanisms in place to enable nodes read from concurrent states.

MAD questions

1. How does Texel compare with Paxos and Ben-Or and how does failure-detectors fit in this picture?

In Paxos, there are rounds, and the rounds are client-restricted. This means that a higher round preempts the lower rounds. A leader leads the rest of the nodes through a round, and this means that the nodes are held hostage by a leader which may be dead. Hence, failure-detectors need to be utilized so that the nodes do not wait forever for a dead leader in an asynchronous model. However, if the failure detectors at nodes are trigger happy, the nodes will suspect whoever is the leader currently for no reason, and will start their own rounds, which preempts the leader's round. This leads to the dueling leaders problem, and violation of liveness even when we have a bound on F, (i.e., F< N/2).

In Texel, there is no need for a failure detector if we have a bound on F (i.e., F<N/3). This is because Texel is a decentralized consensus algorithm, and the nodes do not need to rely/wait on a leader to lead a round; instead all nodes do their own polling and deciding. But as we will discuss in the read-write Texel model (wait for the next post), if the nodes are very snoopy and keep interfering with each others’ experiments, then liveness is still violated. This is where having a leader to lead a round (as in Paxos) provides advantage: the leader by definition reads from a consistent state, as for that round the other nodes are passive.

What if we had non-client restricted rounds as in Fast-Paxos? That is an opportunistic-leader-based solution, and progress is not guaranteed if multiple opportunistic-leaders clash. Then we need to default to Paxos.... which is subject to the failure-detectors result as above for progress! Back to square one.

In the Ben-Or algorithm, there is no need for failure detectors if we have a bound on F (i.e., F<N/2), because that is also a decentralized algorithm. Ben-Or has rounds but the rounds are not client-restricted and do not preempt each other. Also it seems like a node does not interfere/cancel other nodes' progress in querying/experimenting. Ben-Or does not have the disadvantages of Paxos or Texel. So what gives? Ben-Or is a probabilistic algorithm. By using randomization, the system eventually and probabilistically converges to a consensus decision.

(While writing the read-write model of Texel algorithm, I found several parallels between Ben-Or and Texel. Those will also be interesting to investigate more closely.)

Friday, September 13, 2019

Paper review. Gray Failure: The Achilles' Heel of Cloud-Scale Systems

This paper (by Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao) occurred in HotOS 2017. The paper is an easy read at 5 pages, and considers the fault-tolerance problem of cloud scale systems.

Although cloud provides redundancy for masking and replacing failed components, this becomes useful only if those failures can be detected. But some failures that are partial and suble failures remain undetected and these "gray failures" lead to major availability breakdowns and performance anomalies in cloud environments. Examples of gray failures are performance degradation, random packet loss, flaky I/O, memory thrashing/leaking, capacity pressure, and non-fatal exceptions.



The paper identifies a key feature of gray failure as differential observability. Consider the setup in Figure 2. Within a system, an observer  gathers information about whether the system is failing or not. Based on the observations, a reactor takes actions to recover the system. The observer and reactor are considered part of the system. A system is defined to experience gray failure when at least one app makes the observation that system is unhealthy, but observer observes that system is healthy.

This setup may seem appealing, but I take issue with it. Before buying into this setup did we consider a simpler and more general explanation? How about this one? Gray failures occur due to a gap/omission in the specification of the system. If we had formally defined the "healthy" behaviors we want from the system, then we would be able to notice that our detectors/observers are not able to detect the "unhealthy" behaviors sufficiently. And we would look into strengthening the observers with more checks or by adding some end-to-end observers to detect the unhealthy behaviors. In other words, if we had cared about those unhealthy behaviors, we should have specified them for detection, and developed observers for them.

The paper states that the realization about differential observability as the cause of gray failures implies that, "to best deal with them, we should focus on bridging the gap between different components' perceptions of what constitutes failure."
Even for cases where the underlying problem is simply that the observer is doing a poor job of detecting failures (so the gray failures are extrinsic and could be avoided by fixing the observer), such distributed observation can also be helpful. Where to conduct such aggregation and inference is an interesting design question to explore. If it is done too close to the core of a system, it may limit what can be observed. If it is near the apps, the built-in system fault-tolerance mechanisms that try to mask faults may cause differential observability to be exposed too late. We envision an independent plane that is outside the boundaries of the core system but nevertheless connected to the observer or reactor.
But this is a tall order. And this doesn't narrow the problem domain, or contribute to the solution. This is the classic instrumentation and logging dilemma. What should we log to be able to properly debug a distributed system? The answer depends on the system and what you care about the system, what you define as healthy behavior for the system.

I think for dealing with gray failures, one guiding principle should be this: don't ever mask a fault silently. If you mask a fault, do complain about it (raise exceptions and log it somewhere). If you mask failures without pointing out problems, you are in for an unexpected breakdown sooner or later. (This is also the case in human relationships.) Communication is the key. Without complaining and giving feedback, backpressure, the system may be subject to a boiling frog problem, as the baselines gradually slip and degrade.

Cloud anomalies with gray failures

In Section 2, the paper gives examples of cloud anomalies with gray failures, but I have problems with some of these examples.
The Azure IaaS service provides VMs to its customers using highly complex subsystems including compute, storage, and network. In particular, VMs run in compute clusters but their virtual disks lie in storage clusters accessed over the network. Even though these subsystems are designed to be fault-tolerant, parts of them occasionally fail. So, occasionally, a storage or network issue makes a VM unable to access its virtual disk, and thus causes the VM to crash. If no failure detector detects the underlying problem with the storage or network, the compute-cluster failure detector may incorrectly attribute the failure to the compute stack in the VM. For this reason, such gray failure is challenging to diagnose and respond to. Indeed, we have encountered cases where teams responsible for different subsystems blame each other for the incidents since no one has clear evidence of the true cause.
Isn't this a classic example of under-specification of healthy behavior. The VM should have specified its rely/expectation from the storage for its correct execution and accordingly included a detector for when that expectation is violated?


I also have a problem with "the high redundancy hurts" example.
As a consequence, we sometimes see cases where increasing redundancy actually lowers availability. For example, consider the following common workload pattern: to process a request, a frontend server must fan out requests to many back-end servers and wait for almost all of them to respond. If there are n core switches, the probability that a certain core switch is traversed by a request is $1-\frac{n-1}{n}*m$, where m is the fan-out factor. This probability rapidly approaches 100% as m becomes large, meaning each such request has a high probability of involving every core switch. Thus a gray failure at any core switch will delay nearly every front-end request. Consequently, increasing redundancy can counter-intuitively hurt availability because the more core switches there are, the more likely at least one of them will experience a gray failure. This is a classic case where considering gray failure forces us to re-evaluate the common wisdom of how to build highly available systems.

I think this is not a redundancy problem, this is the "you hit cornercases faster in big deployments" problem. This setup is in fact an abuse of distribution as it is the exact opposite of providing redundancy. This setup provides a conjunctive distribution as it makes success depend on success of each component, rather than a disjunctive distribution to make success depend on success of some components.

MAD questions

1. Is this related to robust-yet-fragile concept?
This notion of masked latent faults later causing big disruptions reminds be of the robust-yet-fragile systems concept. Robust-yet-fragile is about highly optimized tolerance. If you optimize your tolerance only for crash failures but not partial/gray failures, you will be very disappointment when you are faced with this unanticipated fault type.
A good example here is the glass. Glass (think of automobile glasses or gorilla glass) is actually very tough/robust material. You can throw pebbles, and even bigger rocks at it, and it won't break or scratch, well, up to a point that is. The glass is very robust to the anticipated faults (stressor) up to a point. But, exceed that point, and then the glass is in shambles.  That shows an unanticipated stressor (a black swan event in Taleb's jargon) for the glass: a ninja stone. The ninja stone is basically a piece of ceramic that you take from the spark plug, and is denser than glass. So if you gently throw this very little piece of ceramic to your car window, it breaks in shambles.  
This is called a robust-yet-fragile structure, and this is actually why we had the Titanic disaster. Titanic, the ship, had very robust panels, but again upto a point. When Titanic exceeded that point a little bit (with the iceberg hitting it), the panels broke into shambles, very much like the glass meeting ninja stone. Modern ships after Titanic, went for resilient, instead of robust (yet fragile) panels. The resilient panels bend easier, but they don't break as miserably. They still hold together to the face of an extreme stressor. Think of plastic; it is less robust but more resilient than glass. 
The robust-yet-fragile effect is also known as highly optimized tolerance. If you optimize tolerance for one anticipated stressor, you become very vulnerable to another unanticipated fault. (Much like the closed Australian ecosystem.)

2. Is fuzzy logic applicable here?
It seems like instead of binary detectors which output healthy or failed, it is better to have detectors that give probabilities and confidence to their decisions. So I thought the phi accrual detectors should be a relevant work to consider for detecting gray failures. I don't know if there are any other fuzzy detectors work for identifying latent failures.



Update: Ryan Huang, one of the authors of the work, left a comment with insightful response to my questions. In the response, he includes links to followup work as well. https://docs.google.com/document/d/18Du33J1v3wOhqj-Vcuv5-wPnaweGhvnWFzenmHoxVcc/edit

Wednesday, September 11, 2019

Paper review. Asynchronous consensus without rounds

This paper by Robbert van Renesse appeared on Arxiv two weeks ago. (Update: Huh, I missed this earlier, but the paper has a footnote that says it was written in 2010.) The paper looks very interesting. I only got to skim the paper, but I will give this a careful read later.

All published crash and Byzantine fault tolerant asynchronous consensus protocols use rounds. (Yes, indeed... Paxos, Viewstamped Replication, even Nakamoto consensus, and Avalanche protocol all use rounds.) Rounds are such an inherent part of consensus algorithms that it is tempting to conclude that solving fault tolerant consensus requires ordered rounds. This paper shows that such a conclusion would be wrong by showing an asynchronous consensus protocol that does not use rounds.

The protocol is named after Texel, an island of Netherlands. Presumably this is because 1) Robbert is Dutch, and 2) he wants to name an alternative island to Paxos island in a sea farther away from the Ionian sea. Texel provides binary consensus (can only decide between 0 and 1 as input votes) without rounds. Nodes query/poll other nodes to make up their minds. If 2/3rd vote is same, the value is anchored. Texel is reminiscent of Avalanche in that it is a decentralized binary consensus algorithm that works by nodes polling other nodes. However, in contrast to Avalanche which uses rounds and is probabilistic, Texel does not use rounds and it is deterministic. Instead of rounds, nodes use querying of consistent cuts and change their vote. The consistent cuts are identified by using vector clocks in the Texel algorithm.

Texel is not a very efficient algorithm, because each node makes up its own mind in a decentralized manner. In contrast, by having leaders coordinate other nodes for which proposal to vote on for a given round, Paxos achieved efficiency and economy.

What is more, in Texel, processes are not allowed to respond to queries if they are experimenting themselves. How is this supposed to terminate/decide? In this respect, Texel again resembles the Avalanche algorithm. If two conflicting proposals exist, consensus is not guaranteed to terminate for Avalanche. The same thing holds for Texel as well.

Texel requires 3f+1 processes, where f is the number of processes that can crash. Byzantine fault-tolerant Texel requires 5f+1 nodes. (I need to check how Texel deals with vector clock integrity in the presence of Byzantine nodes. Maybe it is done through signing entries by corresponding processes, because vector clock entries are only maxed, not updated by processes other than the originators.)

Conclusion

Ok, so what do we have here? Texel is not very efficient. It may not terminate. It uses more processes to tolerate f faulty processes. This all makes me think, rounds are a  great abstraction for distributed systems. Ain't nothing wrong with rounds. They are implementable via ballotnumbers as in PAxos and you are done. The paper also doesn't claim that there is something  wrong with rounds, and neither does it claim that solving consensus without rounds brings any advantages.

On the other hand, this is still a super exciting paper, because Texel proves that distributed fault-tolerant consensus is possible without rounds. Texel breaks new ground! It may be possible to have more useful instances of no-round consensus algorithms in the future. The Texel protocol is derived using stepwise refinement. (Robbert had also used stepwise refinement in his work on chain replication. It is a technique that keeps on giving.) Starting from a high-level specificition of Consensus, an intermediate level specification  called ProtoConsensus with no-rounds is shown to refine the Consensus specification, and Texel is shown to refine ProtoConsensus. It may be possible to search for alternative implementations refining ProtoConsensus.

I am happy that new territory is being explored for decentralized consensus for the last couple years. It is exciting times for distributed systems and algorithms.

MAD questions

1. How could we extend this to multi-decree consensus?
Texel is a single-instance consensus algorithm? Is it possible to extend Texel to multiple-instance consensus in an efficient way and implement state machine replication using it? Is it possible to do linearizable reads from that multi-instance algorithm? Given that there is no leader, and commit time of an update operation is murky, this will be tricky.

2. Is it possible to use HLC instead of VC for finding consistent cuts?
I suppose it may be possible to use HLC for the non-byzantine version of Texel and benefit from loosely synchronized clocks, but I don't know if there would be a big practical gain.

3. How do we implement this protocol in TLA+?
In PlusCal implementing Texel won't be very hard. Modeling Texel in PlusCal may provide value because it will let you test different invariants and temporal properties and exploring variations on the protocol. If I can get the scaffold for this in place, I may even assign this as course project this year.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...