Friday, April 12, 2019

Azure Cosmos DB: Microsoft's Cloud-Born Globally Distributed Database

It has been almost 9 months since I started my sabbatical work with the Microsoft Azure Cosmos DB team. 

I knew what I signed up for then, and I knew it was overwhelming.
It is hard not to get overwhelmed. Cosmos DB provides a global highly-available low-latency all-in-one database/storage/querying/analytics service to heavyweight demanding businesses. Cosmos DB is used ubiquitously within Microsoft systems/services, and is also one of the fastest-growing services used by Azure developers externally. It manages 100s of petabytes of indexed data, and serves 100s of trillions of requests every day from thousands of customers worldwide, and enables customers to build highly-responsive mission-critical applications.
But I underestimated how much there is to learn about, and how long it would be to develop a good sense of the big picture. By "developing a good sense of the big picture", I mean learning/internalizing the territory myself, and, when looking at the terrain, being able to appreciate the excruciatingly tiny and laborious details in each square-inch as well.

Cosmos DB has many big teams working on many different parts/fronts at  any point. I am still amazed by the activity going on simultaneously in so many different fronts. Testing, security, core development, resource governance, multitenancy, query and indexing, scale-out storage, scale-out computing, Azure integration, serverless functions, telemetry, customer support, devops, etc. Initially all I cared for was the core development front and the concurrency control protocols there. But overtime via diffusion and osmosis I came to learn, understand, and appreciate the challenges in other fronts as well.

It may sound weird to say this after 9 months, but I am still a beginner. Maybe I have been slow, but when you join a very large project, it is normal that you start working on something small, and on a parallel thread you gradually develop a sense of an emerging big picture through your peripheral vision. This may take many months, and in Microsoft big teams are aware of this, and give you your space as you bloom.

This is a lot like learning to drive. You start in the parking lots, in less crowded suburbs, and then as you get more skilled you go into the highway networks and get a fuller view of the territory. It is nice to explore, but you wouldn't be able to do it until you develop your skills. Even when someone shows you the map view and tell you that these exists, you don't fully realize and appreciate them. They are just concepts to you, and you have a very superficial familiarity with them until you start to explore them yourself.

As a result, things seem to move slowly if you are an engineer working on a small thing, say datacenter automatic failover component, in one part of the territory. Getting this component right may take months of your time. But it is OK. You are building a new street in one of the suburbs, and this adds up to the big picture. The big picture keeps growing at a steady rate, and the colors get more vibrant, even though individually you feel like you are progressing slowly.

"A thousand details add up to one impression." -Cary Grant 
"The implication is that few (if any) details are individually essential, while the details collectively are absolutely essential." -McPhee

When I started in August, I had the suspicion that I needed to unpack this term: "cloud-native".
(The term "cloud-native" is a loaded key term, and the team doesn't use it lightly. I will try to unpack some of it here, and I will revisit this in my later posts.)
Again, I underestimated how long it would take me to get a better appreciation of this. After learning about the many fronts things are progressing, I was able to get a better understanding of how important and challenging this is.

So in the rest of the post, I will try to give an overview of things I learned about what a cloud-native distributed database service does. It is possible to write a separate blog post for each paragraph, and I hope to expand on each in the future.

What does a cloud-born distributed database look like?

On the quest to achieve cost-effective, reliable and assured, general-purpose and customizable/flexible operation, a globally distributed cloud database faces many challenges and mutually conflicting goals in different dimensions. But it is important to hit as high a mark in as many of the dimensions as possible.

While several databases hit one or two of these dimensions,  hitting all these dimensions together is very challenging.  For example, the scale challenge becomes much harder when the database needs to provide guaranteed performance at any point in the scale curve. Providing virtually unlimited storage becomes more challenging when the database also needs to meet stringent performance SLAs at any point. Finally, achieving these while providing tenant-isolation and managing resources to prevent any tenant from impacting the performance of others is extra challenging, but is required for providing and cost-efficient cloud database.

Trying to add support and realize substantial improvements for all of these dimensions does not work as an afterthought. Cosmos DB is able to hit a high mark in all these dimensions because it is designed from the ground up to meet all these challenges together. Cosmos DB provides a frictionless cloud-native distributed database service via:

  • Global distribution (with guaranteed consistency and latency) by virtue of transparent multi-master replication,
  • Elastic scalability of throughput and storage worldwide (with guaranteed SLAs) by virtue of horizontal partitioning, and
  • Fine grained multi-tenancy by virtue of highly resource-governed system stack all the way from the database engine to the replication protocol.

To realize these, Cosmos DB uses a novel nested distributed replication protocol, robust scalability techniques, and well-designed resource governance abstractions, and I will try to introduce these next.

Scalability, Global Distribution, and Resource Governance

To achieve high-scalability and availability, Cosmos DB uses a novel protocol to replicate data across nodes and datacenters with minimal latency overheads and maximum throughput. For scalability, data is automatically sharded guided by storage and throughput triggers, and the shards are assigned to different partitions. Each partition is served by a multiple node replica-set in each region, with one node acting as a primary replica to handle all write operations and replication within the replicaset. Reading the data is carried out from a quorum of secondary replicas by the clients. (Soft-state scale out storage and computation implementations provide independent scaling for storage-heavy and computation-heavy workloads.)

Since the replica-set maintains multiple copies of the data, it can mask some replica failures in a region without sacrificing the availability even in the strong consistency mode. (While two simultaneous replica failure may be rare, it is more common to observe one replica failure while another replica is unavailable due to rolling upgrades, and masking two replica failures can help for smooth uninterrupted operation for the cloud database.) Each region contains an independent configuration manager to maintain system configuration and perform leader election for the partitions. Based on the membership changes, the replication protocol also reconfigures the size of read and write quorums.

In addition to the local replication inside a replica-set, there is also geo-replication which implements distribution across any number of Azure regions ---50+ of them. Geo-replication is achieved by a nested consensus distribution protocol across the replica-sets in different regions. Cosmos DB provides multimaster active-active replication in order to allow reads and writes from any region. For most consistency models, when a write originates in some region, it becomes immediately available in that region while being sent to an arbiter (ARB) for ordering and conflict resolution. The ARB is a virtual process that can co-locate in any of the regions. It uses the distribution protocol to copy the data to primaries in each region, which then replicate the data in their respective regions. The Cosmos DB distribution and replication protocols are verified at the design level with TLA+ model checker and the implementations are further tested for consistency problems using (an MS Windows port of) Jepsen tests.

Cosmos DB is engineered from the ground up with resource governance mechanisms in order to provide an isolated provisioned throughput experience (backed up by SLAs), while achieving high density packing (where 100s of tenants share the same machine and 1000s share the same cluster). To this end, Cosmos DB define an abstract rate-based currency for throughput, called Request Unit or RU (plural, RUs) that provide a normalized model for accounting the resources consumed by a request, and charge the customers for throughput across various database operations consistently and in a hardware agnostic manner. Each RU combines a small share of CPU, memory and storage IOPS. Tenants in Cosmos DB control the desired performance they need from their containers by specifying the maximum throughput RUs for a given container. Viewed from the lens of resource governance, Cosmos DB is a massively distributed queuing system with cascaded stages of components, each carefully calibrated to deliver predictable throughput while operating within the allotted budget of system resources, guided by the principles of Little's Law and Universal Scalability Law. The implementation of resource governance leverages on the backend partition-level capacity management and across-machine load balancing.

Contributions of Cosmos DB

By leveraging on the cloud-native datastore foundations above, Cosmos DB provides an almost "one size fits" database. Cosmos DB provides local access to data with low latency and high throughput, offers tunable consistency guarantees within and across datacenters, and provides a wide variety of data models and APIs leveraging powerful indexing and querying support. Since Cosmos DB supports a spectrum of consistency guarantees and works with a diverse set of backends, it resolves the integration problems companies face when multiple teams use different databases to meet different use-cases.

Vast majority of other data-stores provide either eventual, strong consistency, or both. In contrast, the spectrum of consistency guarantees Cosmos DB provides meets the consistency and availability needs of all enterprise applications, web applications, and mobile apps. Cosmos DB enables applications decide what is best for them and to make different consistency-availability tradeoffs. An eventually-consistent store allows diverging state for a period of time in exchange for more availability and lower latency and maybe suitable for relaxed consistency applications, such as recommendation systems or search engines. On the other hand, a shopping cart application may require a "read your own writes" property for correct operation. This is served by the session consistency guarantee at Cosmos DB. Extending this across many clients requires provides "read latest writes" semantics and requires strong consistency. This ensures coordination among many clients of the same application and makes the job of the application developer easy. The strong consistency is preserved both within a single region as well as across all associated regions. The bounded staleness consistency model guarantees any read request returns a value within the most recent k versions or t time. It offers global total order except within the staleness window, therefore it is a slightly weaker guarantee than the strong consistency.

Another way Cosmos DB manages to become an all-in-one database is by providing multiple APIs and serving different data models. Web and mobile applications need a spectrum of choices/alternatives for their data models: some applications take advantage of simpler models, such as key-value or column-family, and yet some other applications require more specialized data models, such as document and graph stores. Cosmos DB is designed to support APIs and data models by projecting the internal document store into different representations depending on the selected model. Clients can pick between document, SQL, Azure Table, Cassandra, and Gremlin graph APIs to interact with their datastore. This not only provides great flexibility for our clients, but also allows them to effortlessly migrate their applications over to Cosmos DB.

Finally, Cosmos DB provides the tightest SLAs in the industry. In contrast to many other databases that provide only availability SLAs, Cosmos DB also provides  consistency SLAs, latency SLAs, and throughput SLAs. Cosmos DB guarantees the read and write operations to take under 10ms at the 99th percentile for a typical 1KB object. The SLAs for throughput guarantee that the clients to receive the throughput equivalent to the resources provisioned to the account via RUs.

Tuesday, April 9, 2019

Book Notes. Show Your Work! 10 Ways to Share Your Creativity and Get Discovered, by Austin Kleon

I recently read this book on Kindle. I really liked this short book, and in general, Austin Kleon's work. These are my notes, without context.

A new way of operating

But it’s not enough to be good. In order to be found, you have to be findable. I think there’s an easy way of putting your work out there and making it discoverable while you’re focused on getting really good at what you do.

Almost all of the people I look up to and try to steal from today, regardless of their profession, have built sharing into their routine. These people aren’t schmoozing at cocktail parties; they’re too busy for that. They’re cranking away in their studios, their laboratories, or their cubicles, but instead of maintaining absolute secrecy and hoarding their work, they’re open about what they’re working on, and they’re consistently posting bits and pieces of their work, their ideas, and what they’re learning online. Instead of wasting their time “networking,” they’re taking advantage of the network. By generously sharing their ideas and their knowledge, they often gain an audience that they can then leverage when they need it — for fellowship, feedback, or patronage.

If Steal Like an Artist was a book about stealing influence from other people, this book is about how to influence others by letting them steal from you.

You don't have to be a genius

If you look back closely at history, many of the people who we think of as lone geniuses were actually part of “a whole scene of people who were supporting each other, looking at each other’s work, copying from each other, stealing ideas, and contributing ideas.” *Scenius* doesn’t take away from the achievements of those great individuals; it just acknowledges that good work isn’t created in a vacuum, and that creativity is always, in some sense, a collaboration, the result of a mind connected to other minds.

Amateurs are not afraid to make mistakes or look ridiculous in public. They’re in love, so they don’t hesitate to do work that others think of as silly or just plain stupid.

Mediocrity is, however, still on the spectrum; you can move from mediocre to good in increments. The real gap is between doing nothing and doing something.

Think process, not product

The best way to get started on the path to sharing your work is to think about what you want to learn, and make a commitment to learning it in front of others. Find a scenius, pay attention to what others are sharing, and then start taking note of what they’re not sharing. Be on the lookout for voids that you can fill with your own efforts, no matter how bad they are at first. Don’t worry, for now, about how you’ll make money or a career off it. Forget about being an expert or a professional, and wear your amateurism (your heart, your love) on your sleeve. Share what you love, and the people who love the same things will find you.

Whether you share it or not, documenting and recording your process as you go along has its own rewards: You’ll start to see the work you’re doing more clearly and feel like you’re making progress. And when you’re ready to share, you’ll have a surplus of material to choose from.

Share something small everyday

Once a day, after you’ve done your day’s work, go back to your documentation and find one little piece of your process that you can share. Where you are in your process will determine what that piece is. If you’re in the very early stages, share your influences and what’s inspiring you. If you’re in the middle of executing a project, write about your methods or share works in progress. If you’ve just completed a project, show the final product, share scraps from the cutting - room floor, or write about what you learned. If you have lots of projects out into the world, you can report on how they’re doing — you can tell stories about how people are interacting with your work.

Don’t show your lunch or your latte; show your work.

Don’t worry about everything you post being perfect. Science fiction writer Theodore Sturgeon once said that 90 percent of everything is crap. The same is true of our own work. The trouble is, we don’t always know what’s good and what sucks. That’s why it’s important to get things in front of others and see how they react.

Don’t say you don’t have enough time. We’re all busy, but we all get 24 hours a day. People often ask me, “How do you find the time for all this ?” And I answer, “I look for it.” You find time the same place you find spare change: in the nooks and crannies.

Absolutely everything good that has happened in my career can be traced back to my blog.

Tell good stories

“The problem with hoarding is you end up living off your reserves. Eventually, you’ll become stale. If you give away everything you have, you are left with nothing. This forces you to look, to be aware, to replenish... Somehow the more you give away, the more comes back to you.” — Paul Arden

There’s not as big of a difference between collecting and creating as you might think. A lot of the writers I know see the act of reading and the act of writing as existing on opposite ends of the same spectrum: The reading feeds the writing, which feeds the reading.

If you want to be more effective when sharing yourself and your work , you need to become a better storyteller .

Teach what you know

Teaching doesn’t mean instant competition. Just because you know the master’s technique doesn’t mean you’re going to be able to emulate it right away.

In their book, Rework, Jason Fried and David Heinemeier Hansson encourage businesses to emulate chefs by out - teaching their competition. “What do you do? What are your ‘recipes’? What’s your ‘cookbook’? What can you tell the world about how you operate that’s informative, educational, and promotional?” They encourage businesses to figure out the equivalent of their own cooking show.

The minute you learn something, turn around and teach it to others. Share your reading list.

When you share your knowledge and your work with others , you receive an education in return.

Make stuff you love and talk about stuff you love and you’ll attract people who love that kind of stuff. It’s that simple.

Sunday, April 7, 2019

How I pulled the fire alarm in my apartment complex

This is an embarrassing story for me. But I guess I should talk about it for sake of transparency. You have a right to know what kind of person's paper reviews, TLA+ models, distributed systems musings, Paxos commentary, and life/research advice you are reading. Yep, I am the kind of goofy guy that triggers the fire alarm for the entire complex unknowingly but inevitably.

The Background

This happened on Sunday, August 5, 2018. (Yes, I have been too embarrassed to post this any earlier.) It was only 6 days after we moved into our apartment as part of my sabbatical at Microsoft Cosmos DB.

So that morning, we were about to leave home for grocery shopping. With 3 kids leaving the house is a ...process. You get them ready, you convince them about leaving house. And they somehow can detect desperation in you and drag their feet if you like them to leave soon. It seems like my kids derive pleasure on being the last one to leave the house, the last one to put on shoes, or the last one to get into the car. Any ways this takes a good 20 minutes some days.

So the process is happening and they are slowly preparing to leave. And I leave the apartment to start waiting for them outside the door, in order to show them how serious I am about leaving and we are about to leave.

Just outside our door, there is this red box for fire alarm just outside our apartment. This is normally where the doorbell should be in any sensibly designed building. But for some reason, in our apartment complex, this is where they decided the fire alarm should be.

I am very distressed about this fire alarm being right next to our door. I am scared that one of my kids, maybe the 7 year old, or the 3.5 year old would open this box and then set off the fire alarm.

Trainwreck in motion  

So I am thinking that I should take a closer look at this box, and see if my kids might accidentally set it off. This way I can warn them about this danger-box.

The box is dusty, so I dust it off. And I proceed to open the box, to see what kind of button or arm mechanism the box has inside. This is because I want to tell my kids to stay away from it. You know, for science. (Yeah, that is not very solid reasoning, which will become clear to me soon.)

Anyhow, I open the box; it opens easily. And in this 3 seconds after opening the box, I realize I have gone too far and did something very wrong. Of course the arm is the box itself, there is no arm or button in the box!!

And 1, 2, 3... The shit hits the fan. All the fire alarms in the apartment complex starts screaming!

I try to close the box, but of course it is triggered. And can't be closed.

The fire alarm is too loud. And first I am in denial. This can't be happening. I shouldn't have set off the fire alarm for the entire complex by opening this box, I am innocent.

Every one starts leaving their apartments, and start looking around to see what is wrong. I am still standing in front of the box, trying to figure out what to do. I can't think of anything except that this can not happening.

The old lady above our apartment walks outside and asks me what happened. I told her I set off the fire alarm accidentally. She said she will look for the number of the apartment manager so they can come reset it. Another lady next door comes out alarmed, I also admit her that I had triggered the alarm.

Now there is a crowd in front of the apartment complex. And I have to come clean to them. An old guy from the adjoint apartment block says "Jesus" after hearing my explanation. I think he might have said "what a dork" under his breath. And he tells others "He set off the alarm". Again, I am pretty desensitized to all this because I am still shocked by this thing.

The alarms are blaring and the old lady tells me we should call 911 and ask the firefighters to come reset the thing. I call 911, and ask the firefighters. We are next to the fire station so it takes them 3 minutes to get here. Of course they come with the big red truck, and all in suites. They can not be unprepared. They always need to arrive with their entire equipment and tools. Those may be need.

I show them the box. One of them takes a look. But then of course before resetting the box, they need to first disarm the control panel so the sound can stop. Our neighbor tells them about it. Turns out the plans they brought for the apartment complex is outdated, and they didn't have this marked up in there. So a couple of them goes that way. And after a minute, the sounds stop. Oh, bliss, finally.

One of the firefighter says to the waiting crowd, "Thank you everyone  for adhering the rules. Going out is the right thing to do when fire alarm sets off". The firefighter is friendly but rubs it in my face that I set off the alarm.

He then proceeds to the box with a ring of 20-30 keys. He tries them one by one on the box to find a fitting key. These must be the master keys for the companies that build fire alarms. One of them fits eventually. And the box is reset, triggered, and shut down again.

The mousetrap is set again.

Then another firefighter goes back to the control panel behind the building to arm the system again. They are done. This is probably all within 15 minutes from when I set it off to when they are done. The firefighter chief says this was not all in vain, they get to update their map and plan for the complex. Chaos engineering anyone.

And a good thing is that there is no fee for setting the fire-alarm accidentally. Phew...

The aftermath 

My son, 11 years old, is very disappointed in me. He says "Dad, how could you not know? Even kids know about the fire alarms and to leave them alone." He spends the entire day in being very disappointed in me. But things are normal after a day. And I start joking with him afterwards by pretending to reach out to the box to get him mad.

My wife is only mildly angry with me. After the shock is over, and we are finally on our way to shopping for groceries, she tells me: "Why are you like this? And why am I not surprised you would do something this goofy?"

When I told the story to my PhD students, Ailidani and Aleksey, they are amused. But again they are not very surprised. Somehow this is something they expect their advisor would do. Set off the fire alarm, while checking it.

But for my defense, this box is like a mousetrap for the ADD-me. It is not at all obvious to me that the arm is the box cover. When I see a box, my instinct is to open it to see what is inside.

What kind of affordance is this? It is definitely not obvious that this is an arm.

Friday, April 5, 2019

The aging puzzle

Recently I came across this site. Looks like Sebastian Baltes has been doing interesting work. He studies the human aspects of software engineering. He interviews software engineers to learn about their work habits and processes, for example, how they achieve software development expertise by practicing through some tasks, and how this expertise helps them perform better in other software development tasks.

One of the things he investigated is how aging effects performance for software development. He inspects this through self-reporting of the developers, so the data is subjective and not empirical. The developers reported they felt their short term memory became limited and it got harder to write code as they aged. Some of the developers said "when you are young you are more competitive, but as you get old, you don't feel like competing that much."

Is this true? What do we have to look forward to as we age?

What does the data say?

"Young people are just smarter." 
--Mark Zuckerberg (2007), when he was 24

It turns out, among some other things, Mark Zuckerberg got this very wrong. Longstanding beliefs say the adult brain is best in its youth, but research now suggests otherwise. The middle-aged mind preserves many of its youthful skills and even develops some new strengths. I was surprised to learn that bilateralization of the brain is a real thing:
Several groups, including Grady’s, have also found that older adults tend to use both brain hemispheres for tasks that only activate one hemisphere in younger adults. Younger adults show similar bilateralization of brain activity if the task is difficult enough, Reuter-Lorenz says, but older adults use both hemispheres at lower levels of difficulty. 
The strategy seems to work. According to work published in Neuroimage (Vol. 17, No. 3) in 2002, the best-performing older adults are the most likely to show this bilateralization. Older adults who continue to use only one hemisphere don’t perform as well.

There have been studies of the effects of aging on professors' publication records. These studies show that there was no slow down of publications with age. 

But it looks like there is no data on whether or how developers' performance degrade as they age.

My speculations

The old professors I interacted with throughout my career were sharp, and some of them surprisingly sharp. I don't think the aging puts a big strain on the brain. As you age, the body starts to suffer first, not the brain. If you take care of your body, most importantly if you are able to keep slim and  sleep well at night, you can get a lot of mileage from your brain.

I think the self-reported observations from old knowledge workers may have many underlying causes. One cause could be confirmation bias. People may be getting more sensitive about age (which is especially the case in an ageist work environment), and pay more attention to this. Inevitably they see what they look for, and take this as real.

For the some of the old developers the loss of meaning could be a problem. When things get too monotonous and development work loses its novelty, it would be hard to extract meaning from the job.

It is often said that old people are slow to adopt new things. Douglas Adams has a famous quote, which you may have seen:
1. Anything that is in the world when you're born is normal and ordinary and is just a natural part of the way the world works.
2. Anything that's invented between when you're 15 and 35 is new and exciting and revolutionary and you can probably get a career in it.
3. Anything invented after you're 35 is against the natural order of things. 

While this is witty, this is not true for the 35+ years old I interact with in the academia and in the tech industry.

I think the first part is true. As a kid, you adapt quickly and since you don't have much experience, you don't question or reject something. But as you grow up, you develop some taste, so you may reject some things, even when you are between 15-35. And after 35, maybe you have too many experiences and scars, which may be make you more cautious and closed-minded about things.

In any case this may be an important point. To avoid falling behind technology and keep your edge as you get older, it would help to keep your curiosity alive. The other day, I got into an elevator with my 4 year old daughter, and she was utterly delighted by the elevator. I envied her and wished I could get that excited about things. But it is possible to keep curiosity alive and it is possible to look at things in a new light by being more mindful about this.

Finally, it is a fact that young people are statistically more likely to be risk takers.
But there is one talent that does decline over time—our willingness to take risks. For evolutionary reasons, risk-taking peaks between the ages of 17 to 27, then drops off precipitously.
Well, risk taking is not necessarily good, as it often does not pay off. The survivorship bias means that only the successful risk-takers gets all the publicity. On the other hand, I agree that old people may tend to get overly conservative and cautious. But, do you know why old people check 3 times if they locked the door or turned off the oven? Because they have been bitten by it before.

To avoid becoming overly-cautious and overly risk-averse, we may need to reset our attitudes every 5-10 years or so. Some people say psychedelics help for covering over the old beaten tracks, and resetting bad habits/thoughts. I think just taking time off, going on a journey, and reflecting on our behavior/attitudes could be very effective solutions for this.

MAD Questions 

This entire thing was already very speculative anyways. So I will be lazy and leave it with one MAD question.

How do we dig our way out of ageism?

Many people say (and I agree) that ageism in software industry is a real problem.
The software industry is overwhelmingly young. The median age of Google and Amazon employees is 30, whereas the median age of American workers is 42. A 2018 Stack Overflow survey of 100,000 programmers around the world found that three-quarters of them were under 35. Periodic posts on Hacker News ask, "What happens to older developers?" Anxious developers in their late thirties chime in and identify themselves as among the "older." 
Kevin Stevens, a 55-year-old programmer, faced a similar attitude when he applied for a position at Stack Exchange six years ago. He was interviewed by a younger engineer who told him, "I'm always surprised when older programmers keep up on technology." Stevens was rejected for the job. He now works as a programmer at a hospitality company where he says his age is not an issue.

How do we dig our way out of this situation? Could the capitalist free market offer a solution eventually? How likely is it that we will see a company that disrupts the agist companies by hiring and making better use of older developers?

Wednesday, April 3, 2019

Timely Algorithms

This is what I (along with some collaborators) have been thinking about lately. This is still raw, but the idea is exciting enough that I wanted to share early and get feedback, recommendations, pointers.

Using synchronized time for distributed coordination

A fundamental problem in distributed systems is to coordinate the execution of nodes effectively to achieve performance while preserving correctness. Unfortunately, distributed coordination is a notoriously tricky problem. Since the nodes do not have access to a shared state and a common clock, they need to communicate and synchronize state frequently in order to coordinate on the order of events. However, excessive communication hinders the performance/scalability of a distributed system.

A key component of distributed coordination is the enforcement of consistent views at all nodes for the ordering of significant events. Distributed algorithms employed logical clocks and vector clocks predominantly for this coordination. These are essentially event counters, and are entirely divorced from wallclock/physical clocks and real time. While the logical/vector clocks are robust to asynchrony (unbounded divergence in the speed/rate of processing of nodes), explicit communication is required to enable the nodes infer/deduce/learn about each others' logical counters. (That is, communication is the only way to convey information in this approach.)

In our work, we advocate using time from tightly-synchronized clocks to convey information with the passage of time and reduce the communication costs of distributed algorithms. 

Previous work on using synchronized clocks

This is an attractive goal, so it has been pursued before, and some progress has been made. Barbara Liskov's 1991 paper on practical use of synchronized clocks introduced the use of timeouts for leases. However, leases at the nodes do not require tightly synchronized clocks, and only use clocks to measure a timeout period. This is easily implemented because the skew of physical clocks is insignificant for the millisecond and second level timeout periods.

There has not been any significant progress for a long time after Liskov's 1991 paper; adopting synchronized clocks to improve distributed algorithms proved to be challenging. One reason was because synchronization was not tight enough and not reliable. As clock technology improved, synchronization has gotten tighter and its reliability also improved.

The availability of tightly synchronized clocks led to some more progress in adopting synchronized clocks in distributed systems. Google Spanner and Causal Spartan use synchronized clocks to provide extra guarantees. Spanner uses it to provide linearizability of transactions across WANs and Spartan uses it for implementing efficient causal consistency across WANs. Both Spanner and Spartan use availability of synchronized clocks for providing distributed consistent snapshots.

Challenges for using time for distributed coordination

The reason there has not been more examples of using synchronized clocks to convey information to reduce the communication costs of distributed algorithms is the challenges involved. There are hidden factors that can invalidate the information to be conveyed via passage of time:
(a) Crash of nodes,
(b) Loss of messages, and
(c) Asynchrony in clocks.

Consider a silent-consent commit algorithm. The transaction manager (TM) sends a message to the resource managers (RMs) asking them to commit the transaction at time T. If an RM needs to reject the transaction, it sends a message, and then the TM sends the Abort message to all the RMs to abort the transaction. Otherwise the TMs first message to commit at T is the only message employed in the algorithm. Note that if one disagreeing RM has crashed, or if a message loss occurs, the safety of this silent consent algorithm is violated.

Timely Algorithms

In our work, we formulate a systematic approach to circumvent these challenges that enable building next-generation of synchronized clocks based  distributed coordination algorithms, which we call timely algorithms. Our method is to first formally articulate what information we like the time to convey, and then to determine if it is possible to convey that information in the presence of type a, b, and c faults. If it is found that the information to be conveyed is not tolerant to the faults considered, there are three ways to resolve the situation.

One way to resolve this problem is by conveying less-ambitious information and still get some benefit from that. In that case safety holds to the face of faults. An alternative way to resolve this problem is by involving extra/orthogonal mechanisms that mask type faults. For example using a Paxos group instead of one node would mask the type a fault. Using datacenter networks with redundancy would mask type b fault. Using better hardware and software backed clocks would mask type c fault. Then it is possible to provide silent-consent-like efficient timely algorithms.

Finally, another way to resolve the problem is to let the safety be violated by the faults, but use another mechanism to roll-back or reconcile the problem on the occasion when that happens. Provided that these faults are rare, the timely algorithm would still lead to a lot of savings.

Using our method, we give three examples of timely algorithms. We show how these timely algorithms address the faults, save extra communication (avoid incast storm problems), and increase the throughput of the system compared to their asynchronous counterparts. 

Details to come.

Monday, April 1, 2019

Hacking the simulation

According to this piece, two billionaires (Elon Musk and Peter Thiel?) want to help break humanity out of a giant computer simulation.
"Many people in Silicon Valley have become obsessed with the simulation  hypothesis, the argument that what we experience as reality is in fact fabricated in a computer; two tech billionaires have gone so far as to secretly engage scientists to work on breaking us out of the simulation."
In this post, I indulge this view, and in the spirit of April 1st, I go with it.

Hacking the simulation by breaking its multitenancy and overloading the VMs

What do the aliens computers look like? Do they have racks and datacenters? Or are their computers more at the molecular/analog level, and avoid the inefficiency of overlaying a digital layer over the physical one. Quantum bits seem so bizarre that I think it follows directly from an elemental/basic construct of the aliens' simulation hardware.

Anyways... Since these simulations are very detailed, and even simulations of branching multi-verses, they should have distributed clusters dedicated to the simulation. And if they are worth their salt, they should be using multitenancy to prevent underutilization of hardware and reduce costs of these heavy simulations.

Is this simulation at the person level? I think it is at least at that level if not at a more finer level. Anything larger would be too coarse. So let's bet on the person level.

If this is a distributed multitenancy deployment simulation, many of us (i.e., many people) share the same machine for the simulation. And we are overpacked based on the statistical likelihood that when some of us are using more resources some others use less resources. This way we co-inhabit the same machine in a crowded manner without any problem. This multitenancy setup helps keep computation costs low, a fraction of what it would be without using multitenancy.

Based on these assumptions, we can try to trick the multitenancy system in order to overload some  machines. The trick is to first do nothing, and let the load-balancing system pack way too many of us together in the machines. If, say, 100 million of us do nothing (maybe by closing our eyes and meditating and thinking nothing), then the forecasting load-balancing algorithms will pack more and more of us in the same machine. The next step is, then, for all of us to get very active very quickly (doing something that requires intense processing and I/O) all at the same time. This has a chance to overload some machines, making them run short of resources, being unable to meet the computation/communication needed for the simulation. Upon being overloaded, some basic checks will start to be dropped, and the system will be open for exploitation in this period.

*wink, wink* Maybe this is why @elonmusk invest his time on its Twitter account to reach to many millions of followers. He keeps posting memes, rap singles, and controversial statements and keeps increasing his follower count. Maybe one day he can convince his followers and coordinate them to give the above scheme a try.

How do we exploit the system in this vulnerable window?

In this vulnerable window, we can try to exploit the concurrency cornercases. The system may not be able to perform all those checks in an overloaded state.

Maybe then we should try to find cornercases like Douglas Adams's description of flying: You jump forward, and then while going down you get distracted by something, and miss the ground accidentally, and that is how you fly.

We can also try to break causality. Maybe by catching a ball before someone throws it to you. Or we can try to attack this by playing with the timing, trying to make things asynchronous. Time is already a little funny in our universe with the special relativity theory, and maybe in this vulnerable period, we can stretch these differences further to break things, or buy a lot of time.

What are other ways to hack the system in this vulnerable window? Can we hack the simulation by performing a buffer overflow? But where are the integers, floats in this simulation? What are the data types? How can we create a typecast error, or integer overflow?

Can we hack by fuzzing the input? Like by looking at things funny. By talking to the birds or jumping into the walls to confuse them.

What are other possible ways? I am not well versed with cracking/exploiting computer systems. Come on, we should all do our parts.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...