Antifragility from an engineering perspective

I read Nassim Taleb's "Antifragility: Things That Gain from Disorder" book a while ago, and enjoyed it a lot. I have been thinking about antifragility in engineering systems, and thought it would be good to put what I have come up with so far in writing to be able to contribute to the discussion. There are some nice reviews of the book in various places. My intention here is not to review the book, but to try to look at antifragility from an engineering perspective. Unfortunately, this came out mostly as rambling, but here it is for what it is worth.

Engineered systems: 

Let's start with giving examples from the mechanical world. I will try to give examples for three increasingly-superior levels of reliability: robust-yet-fragile < resilient < antifragile.

Robust-yet-fragile. A good example here is the glass. Glass (think of automobile glasses or gorilla glass) is actually very tough/robust material. You can throw pebbles, and even bigger rocks at it, and it won't break or scratch, well, up to a point that is. The glass is very robust to the anticipated faults (stressor) up to a point. But, exceed that point, and then the glass is in shambles.  That shows an unanticipated stressor (a black swan event in Taleb's jargon) for the glass: a ninja stone. The ninja stone is basically a piece of ceramic that you take from the spark plug, and is denser than glass. So if you gently throw this very little piece of ceramic to your car window, it breaks in shambles. This is a well-known trick to break into cars.

This is called a robust-yet-fragile structure, and this is actually why we had the Titanic disaster. Titanic, the ship, had very robust panels, but again upto a point. When Titanic exceeded that point a little bit (with the iceberg hitting it), the panels broke into shambles, very much like the glass meeting ninja stone. Modern ships after Titanic, went for resilient, instead of robust (yet fragile) panels. The resilient panels bend easier, but they don't break as miserably. They still hold together to the face of an extreme stressor. Think of plastic; it is less robust but more resilient than glass.

The robust-yet-fragile effect is also known as highly optimized tolerance. If you optimize tolerance for one anticipated stressor, you become very vulnerable to another unanticipated fault. (Much like the closed Australian ecosystem.) There is some literature on robust yet fragile systems.

Resilient. Robust systems mask stressors (upto a point), and they optimize for certain anticipated stressors and fail miserably for unanticipated stressors. The resilient system approach is the opposite. It prescribes not to mask stressors; the stressors perturb the system somewhat, but you eventually recover from them. A good slogan for resilience is robust-yet-flexible. These systems stretch somewhat with the stressor, but not fail completely or discretely.

Engineers embrace resilience today. We can see this idea in construction, in bridges, skycrapers, which are built in a flexible way to sway somewhat with wind and earthquakes but not fail completely. In fact, by flexing/stretching a bit with stressors, these systems remain unharmed and last longer, as they tolerate the shocks by rolling with them instead of absorbing the shocks fully.

Today, it is more or less established that systems should aim to expose stressors/faults/problems in some suitable manner, rather than hide them. Even as a system masks problems, it should at least report/log them. Otherwise your system will die slowly and you won't know (such accounts are more common than you would imagine).

Antifragility in mechanical systems. Taleb emphasizes repeatedly that the antifragility idea is not robust and not just resilient. An antifragile system thrives under failures, not just tolerate it. Resilient is better than robust(-yet-fragile), and antifragile is better than resilience in this respect.

Taleb says in the book that one of the few examples of antifragile materials he knows is carbon nanotubes, which gets stronger when faced with a stressor. Here is another antifragile material, a non-newtonian fluid.

There are not many antifragility examples to mechanical systems. In fact, Taleb gives the washing machine versus the cat section in the book to illustrate this point. The cat (living organism) is antifragile. The cat becomes stronger with stressors: It is use it or lose it for the cat, as her muscles would atrophy in the absence of use. So living organisms benefit from stressors. But for the washing machine (mechanical system) it is use it and lose it, as there is wear and tear associated with usage. A mechanical system does not gain much from stressors. (Actually my old car sitting in the garage disagrees. When a car is not used, it starts developing some problems, and you can say that the car benefits to a degree from being used and stressors. This is similar to the state of a vacant house versus an occupied house.)

For engineering of mechanical systems, maybe antifragility is not that applicable. With the wisdom of hundreds of thousands of man-years in engineering (practical systems and technology), why aren't we already seeing examples of antifragility?

Antifragility is not well formalized in the book, especially from the engineering perspective. So, there is some gray area for which systems we can categorize as antifragile. Does antifragility involve effects of randomization, white-noise effects? If that is the case, you can say that engineered systems benefit from some randomness. It was found that Norbert Weiner's mechanical target-tracking calculators/computers performed better in the plane (with noise from vibration) than on the ground (with no noise). There is this thing called stochastic resonance, which has been harvested to achieve some gains in practice. These are sort of white noise asyncronized noise to the systems. On the other hand, it is also well-known that synchronized resonance (an army marching in unison) can bring down bridges. Maybe by harvesting resonance and learning to use it, antifragile mechanical systems can be engineered. Unfortunately I don't know much about this domain, and I am not sure if there are some good progress and results in this domain.

Antifragility in cyber world

Antifragility is actually all about environment actions and reaction by the system. An antifragile system reacts to environment actions to keep a utility function high, and sometimes achieve a very high utility function (this is when stressors help to improve the system).  This is called the barbell strategy in the book: Be safe with the utility function, and sometimes improve it drastically.

Then we can define antifragility as the ability to improve the utility function to the face of "bad" environment actions. But what is bad? One man's food can be another man's poison. Bad can be defined with respect to (comparative to) other systems: When other systems are badly affected by the environment actions, if you are gaining then you are antifragile. (This also implies that for zero-sum environments, you are being antifragile to the expense of the fragility of others.)

Probabilistic algorithms. The CS literature is full of examples of probabilistic algorithms that do much better than any deterministic algorithm. The impossibility results (such as FLP and attacking generals) are circumscribed by probabilistic algorithms. Some algorithms specifically benefit from the randomness: the random the process the better they fare. A nice book has been written by Rajiv Motwani and Prabhakar Raghawan on randomized algorithms.

Randomness is especially great for breaking ties, and used in networking and distributed systems literature for that. For example, in the ethernet and wifi protocols, random backoffs are used so that communication over a shared medium is possible at all. Do these examples count as antifragility?

This also reminds me of an example from the Sync book by Strogatz (which is an excellent book about research, read it if you haven't already). Strogatz has formulated a sync problem for runners on a running track. "Each runner has his own speed, which is analogous to the frequency of an oscillator, and all the runners shout at and are heard by every other runner, which is analogous to the coupling between the oscillators. Depending on the initial conditions and the setup of the coupling, a group of runners may synchronize into a single block all running at the same speed, fall into chaos everybody running on her own, or anything in between." It was found that when the runners (speeds and positions) are not very uniform but rather somewhat random, synchronization was possible and achieved faster. It was also found that there is a sudden phase transition non-sync and sync outcomes. I don't know what (if anything) this implies for antifragility.

BitTorrent example. For an antifragile system, the more you try to stress the system, the stronger it grows. A great example is bittorrent. In bittorrent, the more a file becomes traffic hotspot, the faster it gets to download it. Bittorrent streams gains from hotspots and contention by exploiting the Network effect to provide scaling. If a file is popular for downloads, then its parts are available from more peers, so it will be faster to download it from many peers available.

Fault-tolerant computing angle

Self-stabilization. In the fault-tolerant computing domain, self-stabilization comes to mind immediately as an example of resiliency. Self-stabilization, first proposed by Dijkstra in 1970s, calls for not categorizing faults but instead to design tolerance for anticipated faults. Treat all faults uniformly as a perturbation to the system, and design your system to be tolerant to perturbation. So regardless of faults or combinations of faults, your system is going to recover. This is the self-stabilization view. There has been a lot of work on self-stabilizing computer systems in the literature. The trivial kinds of stabilizing systems are soft-state systems, and restartable systems.

Self-stabilization does not fit the antifragility definition. Since the state corruption abstraction is too abstract and well-defined, the "corruption helps" becomes an oxymoron. Corruption is pre-defined to be the bad thing, so it is hard to play that game. Maybe you have zones of perturbation, and the further you are perturbed away, the faster you recover from it. But, if you have a fast recovery method why not use it for the other regions as well?

Self-adaptive systems, a more recent concept can fit the antifragility idea. But self-adaptive systems are not well defined/formalized, and I am not aware of any big success stories from that line of thinking yet.  And I guess the philosophical difference between self-adaptive versus antifragile systems is that, you can still have a predefined/constant antifragile system that is not self-adaptive. You can have an antifragile system that uses the barbell idea, and does not do bad in any input, but does great in some inputs. That system is not adaptive, but it is still antifragile.

Software rejuvenation. Software rejuvenation can be an example of antifragility in the fault-tolerant computing domain. The software rejuvenation idea is to reset the software occassionally to get rid of memory leaks, unoptimal or incorrect bugs. The antifragility angle is that, if the number of faults increase, you start a software rejuvenation. So the increased number of faults helps for faster recovery/rejuvenation of the software. Specifically the Microsoft bug reporting paper comes to my mind, which is a really interesting piece of work that appeared in SOSP 2009. The idea is that if a bug is more prevalent, it is reported more automatically and is fixed first.


Henrik Warne said…
Good post! I also really liked Antifragile - it is such a powerful concept. It made me think a lot, both on antifragility in general, and how it applies to software and software development. I wrote down my thoughts here:

Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

Learning about distributed systems: where to start?

Strict-serializability, but at what cost, for what purpose?

CockroachDB: The Resilient Geo-Distributed SQL Database

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

Anna: A Key-Value Store For Any Scale

Your attitude determines your success