Friday, May 29, 2020

Curiosity-driven research

It seems like 20th century was the golden age of science and technology. Just look at that list, so many breakthroughs in math, physics, astronomy, biology, medicine, energy, electronics, computing. Some of these include:
  • set theory, topology, abstract algebra, formal logic, incompleteness theorems, theory of computation
  • special relativity, general relativity, quantum mechanics, fundamental interactions, electroweak interaction, nuclear fusion
  • Big Bang theory, space probes, moon landing, space station, hubble space telescope
  • DNA structure, human genome project, antibiotics, many vaccines, many drugs
  • Transistor, semiconductor, integrated circuits, MOS image sensors, CPUs, radio, TV, Internet

What about the breakthroughs in the last 30 years, 1990 onwards? There isn't any significant breakthroughs we can point to in basic sciences except for detection of gravitational waves and Higgs boson. On the applied side of things we have the human genome project and personal computing revolution we can show. We are getting comfortable and cushy alright.

The middle of 20th century was the golden age of Science Fiction. There was so much optimism about the future. Recently I re-watched Back To The Future trilogy with my kids. In the second one, they travel to the future, 2015, where they show flying cars. It was so tragicomic watching that.

We wanted flying cars, instead we got 140 characters. In 2020, we are in the golden age of viral videos (I am serious, check the link). We are brought to our knees by a pesky virus. Oh, but no need to worry, we are good at serving ads and making people click at them to buy groceries online. And we have contact tracing apps. And social networks that are amplifying ignorant views. Last night George Soros was trending on Twitter for inciting Minneapolis riots, and a significant portion of the population believes Bill Gates want to chip people, and he started Covid-19, which is by the way fake and not to be believed anyways, but then again it is better to destroy 5G towers to be sure.

I am aware that we need some time for noticing the impact of some of the work that happened in the last 10 years. But still we are visibly behind, we got a lot of catching up to do. This is a very bad record, C+, at best.

What went wrong? And how can we right this?

We have been doing a lot of application and utility motivated research in the last several decades. All this application/utility inspired research depleted our basic research deposits/reserves, and our progress/returns slowed to an halt. We got so fascinated with application of techniques, that we neglected to refill our basic research deposits, and now we are in scientific debt. We don't have enough basic research deposits to leverage and apply anymore.

Many of the 20th century advances in science and technology built on the Golden Age of Physics, Chemistry, Math that came before it. Although there were necessity driven inventions accelerated by response to war, cold war, and economic development, these inventions had fully charged basic science reserves they could draw on. For example, GPS drew on atomic clocks, which drew on basic research on physics, and was motivated for testing relativity effects.

We need to fund basic research to pay our scientific debt, and build up our basic research reserves. We have been understanding to build, without replenishing our knowledge pool. It is time to build to understand, to pay our scientific debt.

Consider Pasteur's quadrant. We have been mostly doing engineering work (understanding to build) in Thomas Edison's quarter, and even that not very successfully recently. Funding agencies has been calling for use-inspired research to move one step away from application for application-sake. But that is not enough! We need to move closer to pure basic research (building to understand).

And what is wrong with fourth quadrant anyways? We should also accept some futility-inspired research. Many practical outcomes arose from theoretical math. Poor Hardy must be spinning in his grave considering how many applications number theory found. Here is a relevant presentation summarizing the ideas in a recent book: "Why greatness cannot be planned: the myth of the objective."

So here is what we should do. We should pick harder/deeper questions not driven by any immediate utility but driven by curiosity.

And there is good news. Working on basic research is not much harder than working on applied research. Applied researcher needs to jump through hoops to show novelty and distinguish from other work. If we just used the same energy instead to work on something that is inherently novel (but not necessarily practical), we would cover more distance.

Thursday, May 21, 2020

SLOG: serializable, low-latency, geo-replicated transactions

This paper is by Kun Ren, Dennis Li, and Daniel Abadi, and it appeared at VLDB 2019.

This paper is about providing strict serializability in geo-replicated databases. Strict serializability implies that all reads within a transaction must see the value of any writes that committed before the transaction began, no matter where that write was performed world-wide. Furthermore, if a transaction, A, begins after (in real time) transaction B completes, no client can see the effect of A without the effect of B.

Since a strict serializability system behaves like it is running on a single machine processing transactions sequentially, this reduces application code complexity and bugs. However, strict serializability comes with a cost. Current state-of-the-art geo-replicated systems cannot provide strict serializability  alongside low latency writes and high transactional throughput.

To achieve all three (strict-serializability, low-latency writes and high transactional throughput), SLOG uses locality in access patterns to assign a home region to each data granule. Reads and writes to nearby data occur rapidly, without cross-region communication. However, reads and writes to remote data, and transactions that access data from multiple regions (i.e., multi-home transactions), must pay cross-region communication costs. Nonetheless, SLOG uses a deterministic architecture to move most of this communication outside of conflict boundaries, enabling these transactions to be processed at high throughput. More specifically, SLOG relies on deterministic processing to avoid two phase commit. Once all parties agree to the plan, processing occurs (mostly) independently on each node, with the system relying on the plan's determinism in order to avoid replica divergence.

Unfortunately, in order to create a deterministic plan of execution, more knowledge about the transaction is needed prior to processing it relative to traditional nondeterministic systems. Most importantly, the entire transaction (information regarding which data will be accessed by the transaction) must be present during this planning process.

SLOG borrows the deterministic architecture from Calvin, and is implemented leveraging the open source Calvin codebase. However, SLOG improves on Calvin in the following ways. In Calvin, every transaction, no matter where it originates from, is sequenced by a global Paxos process. This enables Calvin to have complete knowledge of the input to the system while planning how a batch of transactions will be executed. Of course, this comes at the cost of requiring every transaction to pay the cross-region latency to run Paxos across regions. SLOG removes the global Paxos process in order to reduce latency, but this causes unawareness of transactions submitted to replicas located in different regions during the planning of transaction processing. We will see how SLOG coordinates the transactions, classifying them as single-home and multi-home, with as little communication as possible.

In my summary below I use many sentences lifted from the paper. The paper is well written, and I wouldn't be able to improve on many of these explanations.

SLOG overview

SLOG uses a master-oriented architecture, where every data item is mastered at a single "home" replica. Writes and linearizable reads of a data item must be directed to its home replica. Each SLOG region contains a number of servers over which data stored at that region is partitioned (and replicated). Some of this data is mastered by that region (it is the home region for that data) and the rest is a replica of data from a different home region.

Each region maintains a local input log which is implemented via Paxos across its servers. This local log only contains transactions that are expected to modify data mastered at that region. This input log is sent to all other regions in batches. Each batch is labeled with a sequence number, and the other regions use this sequence number to ensure that they have not missed any batches from that region.  Regions use deterministic transaction processing to replay the local log from other regions. By continuously replaying local logs from other regions, the system is able to support local snapshot reads of data mastered at other regions at any desired point in the versioned history.

Single-home transactions

  1. When a region receives a transaction to process, it sends a message to its Lookup Master to obtain its cached value for the home of each granule accessed by the transaction. The returned locations are stored inside the transaction code. If every granule is currently mapped to the same home region, the transaction becomes initially assumed to be a single-home transaction, and is shipped to that region.
  2. Once the (assumed) single-home transaction reaches its home region, it is appended into an in-memory batch of transactional input on the server at which it arrives, and this server appends this batch to that region's input log via Paxos.
  3. A separate Paxos process interleaves batches from the local log with batches from the local logs that are received from other regions in order to create that region's view of the global log. 

All three batches appear in the global log of all three regions. However, the order in which these batches appear in the three respective global logs is different. The only guarantee is that 1-2 will appear after 1-1, since they originated from the same region. If all transactions are single-home, then it is guaranteed that each region's local log will access a disjoint set of database system granules (i.e., transactions across local logs do not conflict with each other). Therefore, the limited degree to which the global logs are ordered differently across different regions will not cause replica divergence.

This being a deterministic system ensures that all data progress through the same sequence of updates at every replica, without runtime coordination. Since home metadata is part of the data granule, the metadata is updated at the same point within the sequence of updates of that granule at every region. Therefore, any assumed single-home transaction that is not actually single-home will be exposed as non-single-home independently at each region (i.e., each region will independently, without coordination, observe that it is not single-home, and will all agree to abort the transaction without any cross-region communication). The transaction is then restarted as a multi-home transaction. Similarly, if an assumed single-home transaction is indeed single-home, but the assumed home is incorrect, all regions will see the incorrect home and counter annotation and independently agree to abort the transaction.

Multi-home transactions

  1. All multi-home transactions, no matter where they originate, must be ordered with respect to each other. For this SLOG employs a global log for ordering the multihome transactions with respect to other multi-home transactions and sends them to the regions where they are ordered with respect to single-home transactions.
  2. A multi-home transaction exists in several different locations in a region's global log. There will be an entry containing the code for that transaction, and then separately will come several LockOnlyTxns entries, one from each region that houses data expected to be accessed by that transaction. The code can start to be processed when it arrives, but it will block whenever it tries to access data for which the corresponding LockOnlyTxn has yet to complete.
  3. LockOnlyTxns exist to specify how the multi-home transaction should be ordered relative to single-home transactions at that region. Depending on where the LockOnlyTxn gets placed in the local log of that region, it will ensure that the multi-home transaction will observe the writes of all transactions earlier than it in the local log, and none of the writes from transactions after it. When the region has all the locks in its log, it executes the code. 

The example in the figure assumes that the code for the multihome transaction is already disseminated to the regions, and it only shows the dissemination of the LockOnlyTxn to the regions. At Region 0, InsertIntoLocalLog(T2) is called after it has placed single-home T1 into its local log. It therefore places its generated LockOnlyTxn for T2 after T1. InsertIntoLocalLog(T2) is called at Region 1 between placing single home transactions T3 and T4 into its local log and thus places the LockOnlyTxn for T2 it generates there. T2's LockOnlyTxns are ordered differently at each region is not problematic since LockOnlyTxns always access disjoint data and thus commute.

SLOG's deterministic locking scheme acquires all locks prior to processing a transaction and releases all locks after commit. Thus it is a form of two-phase locking (2PL). In any schedule produced by a 2PL implementation, if locks are held until after the transaction commits, and all reads read the value of the most recent write to that data item, then the resulting schedule is strictly serializable.

The use of a global log for multi-home transaction ordering was an interesting point of discussion in our Zoom reading group. We discussed whether it could be possible to order the multi-home transactions by other means, say using version vectors included in their metadata. The global log provides a central point which increases latency, as we will see in the latency graphs in the evaluation.

Dynamic remastering

SLOG does not require any remastering of data to process multi-home transactions, but does perform dynamic data remastering as access patterns change over time. A remaster request updates metadata of a single data granule and is thus a single-home transaction. The request is sent to the home region for that granule, which will insert the request in its local log, which will eventually cause the request to appear in the global logs of all regions. When each region processes this request in its global log, it updates the granule metadata. The Lookup Master of that region is also asynchronously updated to reflect the new mastership information.

However, caution is required as this example demonstrates: Region 2 places the local log from region 0 that contains T3new before the local log entry from region 1 that contains T2. Thus, it sees a different serial order: T1, T3, T2. This leads to potential replica divergence. The counter part of the metadata is used to circumvent this danger. Prior to requesting a lock on a granule, SLOG compares the counter that was written into the transaction metadata by the LookupMaster at the region the transaction was submitted to with the current counter in storage at that region. If the counter is too low, it can be certain that the LookupMaster that annotated the transaction had out of date mastership information, and the transaction is immediately aborted (every region will independently come to this same conclusion). If the counter is too high, the transaction blocks until the region has processed a number of remaster requests for that data granule equal to the difference between the counters.


Since SLOG relies on deterministic processing, it was implemented over the open source Calvin codebase [2], by adding the two Paxos logs per region and the Lookup Master index, and processing transactions as described above.

We can see that SLOG is not able to do as high throughput as Calvin, because Calvin uses a single point for serialization and does not have aborts ever. On the other hand, due to exactly the same reason, we see that SLOG improves the latency of Calvin in WAN. Notice in the latency graph that for 100% multi-home transactions, SLOG latency degrades to that of Calvin.

Here SLOG is compared with Spanner. One point to keep in mind is that Spanner supports SQL-API and more general transactions, whereas SLOG is restricted to use a deterministic architecture.

Here is the paper presentation video from our Zoom reading group.

Saturday, May 16, 2020

Clash of civilizations: Medical versus the World edition

During this Covid-19 pandemic, I have been staying home and putting a lot of trust in medicine assured that they will find a way to get us through this. What else can I do? I don't know about the domain, so I defer to the experts.

But, several months in to the quarantine, with the deluge of disheartening news about lack of progress on this problem, I am getting more and more worried, anxious, and restless. I am sensing I am not alone. There is a clash of civilizations brewing between medical and lay people, and maybe more relevant for my domain, between medical and IT people.

OK, this is how we will do this. I will first give you off-the-cuffs comments from the Cynical Murat. I know that Cynical Murat is wrong in many places, because I don't know anything about medicine and he is a caricaturized version of myself to voice my insecurities/worries about the situation. (Oh God, this is getting weird.) So I can't just leave you with his rant. I follow that up with a response from another IT researcher/practitioner, but someone who had collaborated with medical people a lot in the last 10+ years.

Rants from the Cynical Murat 

Let me start by saying that I applaud the ER doctors/nurses for their bravery and selfless sacrifices. I doubt that I would be able to raise to that level, if I were in their place. (I am cynical, but I am not a jerk.)

My impression about the medical people is that they are failing this test. They have been very slow to disseminate best practices and act on them. The invasive ventilators do not seem effective (very low survival rates) but they are being used, because it is the "procedure". The proning technique was still not practiced even after months. They are slow to act on re-purposing existing drugs. We still don't have a good understanding of the illness. Antibody and vaccine development has uncertain time horizons.

If the entire world, 7 billion people, the governments, the economies, were holding their breaths waiting for an "IT problem" to be resolved, regardless of how difficult that problem is, we would see much better progress in the IT domain. For example, the Y2K problem was resolved without any fuss. People came together, best practices were openly shared, and things got straightened out quickly.

Why don't we see a similar coordinated quick response from Medical side?

1. Is it because IT/tech people have better tools and networking? Then we should try to help by giving the medical people better tools and networking support for collaboration.

2. Is it because the IT domain problems are intrinsically easier? (If we had 1% crash rate in servers in a cluster, we wouldn't even care. We would just replace crashed servers periodically, as it is not a big deal. Nobody dies.)

3. Or is it because of the intrinsic "slow" culture in medicine? The culture in IT/tech is innovate and disrupt. Try many different attacks on a problem, including peripheral attacks from left field. When the prize is this big, IT/tech would have been crushing things. Maybe there is also an incentive problem in medicine? The scientists don't see enough of the prize, which prevents the hordes of scientists working on the problem.

I think we need some of that IT chutzpah to attack this problem. For example, take Paul Buchheit. He doesn't know shit about the problem domain, but he is trying and he is ambitious.

Well, that is my rant. I realize the domains are very different, and I don't know anything about medicine. But there seems to be a big mismatch between the effort/results out there compared to what this crisis requires. And this is not even a very potent bug, I am afraid of worse things to come.

Response from my wise friend 

First, realize this:
   “There is much more to health care than algorithms.”
(Got this quote, by a doctor at Iowa, in an MIT Tech Review article about Google's AI)

Second, the physician I work with, an infectious disease specialist, says we have known about and been using proning for twenty years.

Third, the common cold is a coronavirus but it is so mild it is not a catastrophic epidemic. Why hasn't medicine been able to cure or have a vaccine against common cold? Why don't we have vaccines for malaria, for AIDS, etc? Maybe some things are hard, maybe the coronavirus family (in general) is one of those, but now more money and intensive efforts are being thrown at it than ever before, because SARS-CoV-2 is a leading cause of death, for now.

Fourth, about being slow, it is true that medical research is deliberate. Two famous events (and there are more) have cause medicine to be deliberate. One was thalidomide which tempered the free-wheeling pharmacy attitudes; another was the Tuskegee study. Because of these and other examples, the formal system of ethics impedes progress in medicine. And in the US, we have this horrible public/private, insurance/profit bureaucracy. More recently, there is a call for even more careful and deliberate research due to the replication crisis, p-value hacking, and the like.

Fifth, you (like countless VC types) have observed that software people are super smart and could solve these things quickly if given the chance. So why don't you? You are much smarter than Elizabeth Holmes :)
(Murat's remark: I am still unclear if this was written to be sarcastic or super sarcastic.)

Also, you might want to take a look at this, which, I think, indicates some in the medical community think quite a  lot about their mission and methods.


Well, the medical people, they think differently, they work differently, but they are very smart people. So the Cynical Murat is jumping to conclusions prematurely.

We the IT people should first make sure we do all we can to help, before we start to criticize Medical research. For example, we still couldn't get our shit together solving the disinformation warfare problem, which is a critical problem during these trying and challenging times. (Yes, I used this in a sentence finally without being sarcastic.) It seems hypocritical to criticize the medical people about being slow, before we could do that.

There might also be several other important problems we can and should help with.

Ahmet's Unity project

Around the beginning of the quarantine, my son, Ahmet (age 12), has started working on the Unity framework. Unity is a popular game engine, like Unreal Engine 4, and Godot. It was launched in 2005, aiming to democratize game development. It is very versatile and beginner friendly. There are many YouTube tutorials about Unity. It is also powerful, as it includes 2D, 3D terrain engines, physics simulator, real-time dynamic shadows, graphics rendering, networked multi-player support, etc. Unity was used for building many amazing games, including Call of Duty: Mobile.

Ahmet's Unity journey 

I would have loved to say that I supported Ahmet in his quest to learn Unity. But I am just a professor, I am hopelessly disconnected with cool new programming environments/frameworks. I couldn't even help him install the thing, when he had difficulties in the beginning. This was a 5GB installation, and he would ran into problems in the last GB, and also had problems with package dependencies.

I was quick to give up. I tried to convince him that Python is the way to go. He had done some work using PyGame last year, so I insisted that he continue on that. I told him that real programmers work using a text editor, not with complicated integrated frameworks. What programming framework requires a 5GB download anyways? The first programming IDE I used, Turbo Pascal, only required 39Kbytes, smaller than many 1 page pdf sizes today. I told him that I won't (can't!) help him with installation of the Unity framework.

Ahmet didn't give up and found a way to install Unity. Using an archive copy, he first downloaded Unity, and then downloaded packages as add on.

He followed YouTube videos and tutorials and tried his hands on a couple games. The first was a simple box racing game. The second one was a multi-player first person shooter in an arena. He added things like fog and smoke. He said these were particles and relatively easy to add. These games looked very cool. I am an old school guy. I didn't even know it was possible to code such graphically rich physics.

For Unity he learned scripting in C#. He says he likes C# much better than Python. When he had to do a math homework with Python, he said he forgot all about the Python syntax, since he had been working with C# for some time. (Welcome to the club, buddy.)

A couple weeks ago, Ahmet started his real project: a parkour plus first-person-shooter/katana-swinger/grappler-gun-handler. He prepared a checklist on paper with approximate completion times for each task. According to his estimation this will take 2-3 months to finish.


Ahmet is one of the winners of the Covid19 crisis. He told me he is very happy that schools got canceled. He said he was looking forward to summertime to have undisturbed programming time, but with the school cancellation, he was able to get started on Unity much earlier. He told me he doesn't need school anymore (uh oh).

I am happy he is able to learn on his own. He has been working hard on Unity. We share the same study room with Ahmet, and he has been in deep concentration making steady progress with Unity in the last couple months. I haven't seen him this engaged with anything (except of course when playing computer games).

I like that there is a good community around Unity. Without a community it is hard to make significant progress. It looks like there is a scenius effect going on there. This effect was pretty obvious in Hackers at MIT, and Crypto communities, both captured perfectly in books from Steven Levy.

Inspired by many in the Unity community, Ahmet also started a YouTube channel to share the progress of his project. He told me this is called a Devlog. This keeps him motivated and on-track. He calls his channel "The Unity Noob". (I like that he knows being humble is a competitive advantage.) He didn't expect any one to watch his videos, but he is getting followers and many comments for his videos.

He asked me to give him a shout-out on my blog and on Twitter. Well, if the spirit moves you, and you want to show some encouragement and support for him, please subscribe and like his videos.

Going forward

I don't know if I should be concerned. He is learning advanced programming concepts on his own, by imitation and hacking. He will likely develop some bad habits, and these may do some damage when he takes his first proper programming course.

But I think that is a minor concern and it is totally compensated by the self-learning and interest going on thanks to this project. He is not only building things, he is also learning how to communicate and explain those things. That is amazing.

Probably the best thing I can do is to not get involved and get in his way. This is very easy to do, and being lazy, I will comply. If in the future he needs help with distributed systems, I can have "finding the right Paxos variant" conversation with him.

Mergeable Replicated Data Types

This paper is by Gowtham Kaki, Swarn Priya, KC Sivaramakrishnan, and Suresh Jagannathan, and it appeared in OOPSLA 19.

We all know and love CRDTs. Using CRDTs, any replica can accept updates without remote synchronization. The updates may reach to replicas and applied in each in different order, but provided that the updates commute each replica converges to the same state eventually. CRDTs provide out-of-the-box eventual consistency to replicated data types with commutative operations, examples include set data type with member-addition/removal and counter data type with addition/subtraction.

Unfortunately CRDTs are limited to commutative operations. If we also wanted to have multiplication for the counter data structure, since that won't commute with addition and subtraction, we would be in trouble. MRDTs extend CRDT idea to allow non-commutative operations for certain data structures.

The idea in MRDTs is to use a 3-way merge formula. If you have an extra bit of information, the least common ancestor (LCA), you can merge the two replica states v1 and v2 leveraging the LCA as the starting point context. The paper notes that this is similar to how Git operates when merging branches.

Let's consider this idea for the counter data structure (with both addition and multiplication operations). Consider two replicas starting from LCA l=5. One applied multiplication by 2 and ended up V1=10. The other applied addition with -1 and ended up with V2=4. The three way merge suggested for v1, v2, and l is as follows: merge= L + (V1-L) + (V2-L). This is applied in each replica unilaterally, without coordination, and each replica ends up with value 9 (i.e., 5 + 5 -1).

Note that this mergeable counter does not guarantee linearizability (for instance, if the concurrent operations in Fig. 2 are mult 2 and mult 3, then the merge result would be 25 and not 30). Nonetheless, it guarantees convergence, with a somewhat meaningful semantics.

Let's consider the three-way merge idea for sets. Given two replicas {1,2} and {2,3,4}, without additional context, we would say that the merge would be {1,2,3,4}. But if we knew that the LCA for these two replicas was {1,2,3}, we would be able to infer, there was a removal of 3 for the first replica/branch, and removal of 1 and addition of 4 on the second replica/branch, and thus the merge result would become {2,4}. This is indeed the case for applying the 3-way merge formula adopted for the set data structure:
merge = $(L \cap S1 \cap S2)  \cup (S1 \setminus L) \cup (S2 \setminus L) $

Invertible relational specifications

The neat thing in the paper is that this same 3-way merge idea and semantics  generalizes to the lists, queues, priority queues, binary trees, etc. The paper presents a general instantiable merging framework with the use of abstraction/concretization relation. The abstraction idea is useful because it is easier to define this mergeable property at the abstract level. Moreover, since several data structures share the same abstract data-type, the approach becomes simple to use.

A data structure can be uniquely characterized by the contents it holds, and the structural relationships defined among them. The paper identifies a member relationship and occurred before relationship and show that these two relationships are enough to model many data-structures at the concrete level in the abstract. And, best of all, the general formula is the same as the formula for the counter data structure, but with the member and occurred-before relationships changed to fit the data structure.

The set data structure discussed above is considered as the base abstract data structure all the other data structures can be mapped to. The paper shows that many data structures can be mapped losslessly to the rich domain of relations over sets, wherein relations so merged can be mapped back to the concrete domain to yield consistent and useful definitions of these aforementioned concepts. For these data structures, the merge semantics for arbitrary data types can be automatically derived, given a pair of abstraction ($\alpha$) and concretization ($\gamma$) functions for each data type that map the values of that type to and from the relational domain (the pair ($\alpha,\gamma$) is an *invertible relational specification* of the type).

The paper provides an OCaml library that lets you derive MRDTs from ordinary OCaml data types. The @@deriving merge annotation tells the compiler to automatically derive the merge function.

The video presentation of the paper by Gowtham is easy to follow, and explains  the motivation and the basic idea clearly.


MRDTs also have several limitations. They don't apply to arbitrary complex data structures or objects. To merge concurrent versions, the 3-way merge formula ignores the operations and instead focus on the difference between each version and the LCA.  This is why this idea may not work well for operations that require a precondition to execute. The precondition may be satisfied in one replica and not satisfied in the other replica.

The paper appeared in a programming languages conference, and several parts in the paper are hard to follow for people outside that domain. This paper has good applicability to distributed systems and especially distributed databases. A nice followup to this paper could be to implement this in a NoSQL database, such as  MongoDB or Cassandra, or even Voldemort (for simplicity). Without a Git based infrastructure, how feasible is it to implement MRDTs in these data stores?

Another interesting thing would to be show applications of MRDT idea in the context of coordination avoidance in database systems.

Thursday, May 7, 2020

Stable and Consistent Membership at Scale with Rapid

This paper is by Lalith Suresh, Dahlia Malkhi, Parikshit Gopalan, Ivan Porto Carreiro, and Zeeshan Lokhandwala, and it appeared in USENIX Annual Technical Conference 2018.

In datacenters complex network conditions may arise, such as one-way reachability problems, firewall misconfigurations, flip-flops in reachability, and high packet loss (where some-but-not-all packets being dropped). Services became even more prone to these gray failures as processes are typically hosted in VM and container abstractions, and networks are governed more and more with software-defined-networking (SDN). Furthermore, everyday several cloud-side service upgrades happen, which create potential failure risks.

Despite being able to cleanly detect crash faults, existing membership solutions struggle with these gray failure scenarios. When you introduce 80% packet-loss failures in 1% of nodes, the correct nodes start to accuse each other and the membership view becomes unstable. The experiments in the paper show that Akka Cluster becomes unstable as conflicting rumors about processes propagate in the cluster concurrently, even resulting in benign processes being removed from the membership. Memberlist and ZooKeeper resist removal of the faulty processes from the membership set but they are unstable for prolonged period of time.

This behavior is problematic because unstable and flapping membership views may cause applications to repeatedly trigger expensive recovery workflows, which may degrade performance and service availability. (Side remark: In practice, though, there is always grace periods before triggering expensive recovery operations. Applications are already designed to be tolerant of inaccurate detections, and expensive operations are delayed until enough evidence amounts.)

Rapid Membership Service Design

To address the stability and consistency problems with existing services, the paper presents the Rapid membership service, which consists of three components.

1. Expander-based monitoring edge overlay.
Rapid organizes a set of processes (a configuration) into an expander graph topology for failure detection. Each process (i.e. subject) is assigned K observers that monitor and disseminate reports about itself. This is achieved by overlaying K rings on the topology. Each process observes the successor in a ring, and is observed by its predecessor in a ring. This way every process p is monitored by K observer processes and it is responsible for observing K subjects itself. (Who watches the watchers? Well, other watchers! This is a self policing system.)

2. Multi-process cut detection.
If L-of-K correct observers cannot communicate with a subject, then the subject is considered observably unresponsive. For stability, processes in Rapid delay proposing a configuration change until there is at least one process in stable report mode and there is no process in unstable report mode. For this another parameter H is used such that, 1 ≤ L ≤ H ≤ K. A stable report mode indicates that p has received at least H distinct observer alerts about s, hence we consider it “high fidelity”. A process s is in an unstable report mode if tally(s) is in between L and H. It may be that tally did not reach H because the other observers for s have been down by themselves. If those observers are considered unresponsive by some processes, these reports are used to bring the tally above H. If that fails, there is a timeout to take s out of unstable report mode, so that a configuration change can be proposed.

3. Fast consensus.
The paper argues that the above cut detection suffices to drive unanimous detection almost everywhere. Rapid uses a leaderless Fast Paxos consensus protocol to achieve configuration changes quickly for this common case: every process simply validates consensus by counting the number of identical cut detections. If there is a quorum containing three-quarters of the membership set with the same cut (i.e., a supermajority quorum), then without a leader or further communication, this is adopted as a safe consensus decision. If this fails, the protocol falls back to a classical consensus protocol.


Rapid is implemented in Java with 2362 lines of code (excluding comments and blank lines). The code is opensource at

The paper reports on experiments on a shared internal cloud service with 200 cores and 800 GB of RAM (100 VMs), where the number of processes (N) in the cluster are varied from 1000 to 2000. (They run multiple processes per VM, because the workloads are not CPU bottlenecked.)

Here is a video of the presentation of the paper in our Zoom DistSys Reading group.


We had a productive discussion about the paper in our Zoom DistSys Reading Group. (You can join our Slack workspace to get involved with paper discussions and the Zoom meetings.) Comments below capture some of that discussion.

1. Problem definition
The paper says that the membership service needs to have stability (robustness against gray failures, flip-flops) and consistency (providing the same sequence of membership changes to processes). We also need to have quick reaction to the changes. But what is the right tradeoff of these conflicting features? If the speed is high and captures the membership changes quickly, the nodes getting information at different times will see different views/epochs of the group membership as even the supermajority group leaves behind a quarter of the nodes. This would lead to a loss of stability.

It looks like the right tradeoff between speed and stability would be application dependent, and this makes the problem specification fuzzy and fickle.

2. Cassandra membership service problems
The paper mentions this:
In Cassandra, the lack of consistent membership causes nodes to duplicate data re-balancing efforts when concurrently adding nodes to a cluster [11] and also affects correctness [12]. To work around the lack of consistent membership, Cassandra ensures that only a single node is joining the cluster at any given point in time, and operators are advised to wait at least two minutes between adding each new node to a cluster [11]. As a consequence, bootstrapping a 100 node Cassandra cluster takes three hours and twenty minutes, thereby significantly slowing down provisioning [11].
This is true for scaling an existing hash ring by adding new members. There the gossip-based eventual consistency membership service becomes a bottleneck and require a two minute wait time for safety between adding each node to the cluster. On the other hand, if you are bootstrapping 100 nodes from scratch with no hash ring, it is possible to get all those nodes running in parallel, and construct the hash ring after stabilization of the membership.

3. Using ZooKeeper as a membership service 
The paper has this to say about membership management with ZooKeeper:
 Group membership with ZooKeeper is done using watches. When the ith process joins the system, it triggers i-1 watch notifications, causing i-1 processes to re-read the full membership list and register a new watch each. In the interval between a watch having triggered and it being replaced, the client is not notified of updates, leading to clients observing different sequences of membership change events. This behavior with watches leads to the eventually consistent client behavior in Figure 7. Lastly, we emphasize that this is a 3-node ZooKeeper cluster being used exclusively to manage membership for a single cluster. Adding even one extra watch per client to the group node at N=2000 inflates bootstrap latencies to 400s on average.
In practice ZooKeeper is typically used for managing membership for small clusters.

4. What do large scale cloud systems use as membership services?
Microsoft uses service fabric (available as opensource). The service fabric paper is a good companion to read with this Rapid paper to compare and contrast approaches. Here is an overview summary of Service Fabric paper.

I had recently covered Physalia, which provides a membership service over Amazon Elastic Block Storage (EBS). Physalia had very nice ideas for minimizing blast radius of failures, and for relocating membership service close to the nodes it is monitoring to alleviate effects of network partitions.

Google may be using some membership services based on Chubby. But they must have other solutions for membership services for large scale clusters.

Saturday, May 2, 2020

Our Europe trip

Last summer I attended the Sigmod 19 conference to present our work on Dissecting performance bottlenecks of strongly-consistent replication protocols. We used the conference as an opportunity to have a short family trip in Europe. Now, it is nice to reminisce about the times when we could roam free without worries about a pandemic.

Our trip had three legs. Here is how the trip went.

Paris, France

My wife and I had been to Paris in 2006, and loved it. We wanted to revisit the place. We also thought it would be good for our three kids (11, 7, and 4) to see Paris.

We flew to Paris, and from the airport, we headed to take the metro to Paris city center. It wasn't hard to figure out which train to take after we purchased the tickets. When the train arrived my wife and the two kids get on it with the luggage, and I was trying to board my little daughter with the stroller on the train. The entrance to the train was crowded with several people trying to enter. And I felt someone reaching to my wallet in my cargo pocket. I lost it, I got really angry. There I had all my credit cards, ids, and money ($700 of emergency cash if in case we run into a problem with the credit cards). I turned around and saw two women. I think one of them was covering while the other was pickpocketing me. I lost it and started shouting. What are you doing? How dare you try to steal my wallet?  The women said "no, no". But then they turned away and ran. One of the American tourists on the train said, "Yes I saw them, they were following you and trying to steal your wallet". I thought, "Thanks buddy, why didn't you say anything?"

Well, that is how our Paris trip started. Our AirBnB was at a good location, close to Paris center. Unfortunately it did not have any AC, so it was hot and humid. This was during a record hot wave for Paris. Anyways, we were determined to do a lot of sightseeing, and we did OK.

First day, we went to Trocadero to get a view of Eiffel tower from across. It was very good. Our kids liked it a lot. We had a stroll along the river bank, and later in the evening, we visited Champs-Elysees.

We spared Tuesday for the museums. Who knew that the Louvre Museum is closed on Tuesdays? Why would you close a popular museum one day a week? What happened to high-availability? Well, the French don't care much about high availability. The two times before I had been to France, there were strikes, and I had to find alternative ways to get to the airport.

Anyways, we skipped the Louvre, braved the burning sun, and waited in queue for the Orsay museum. It was worth it. There were many impressionist paintings. Including Van Gogh and Gaugin. This was nice because, later in Netherlands, I would go to the Van Gogh museum to see more paintings from Van Gogh. The kids were not as enthusiastic about the museum as us. Toward the end, I had to carry my sleeping 4 year old daughter through the museum halls.

Our last day in France we went to the Versailles Palace. It was magnificent. It is hard to believe the quality and richness of the construction in 16th century. This must have been a huge show of luxury and probably sped the demise of the royalty. Even centuries later, it is still magnificent. The scale of the garden and the lake is simply unbelievable.

In Paris, there was a bakery almost every corner. We enjoyed great bakery and cheese for every breakfast. It was also nice being able to buy fruits from the neighborhood stop, instead of a supermarket. Food options were very good. One evening, we found a Turkish doner shop, and ate there.


My father in-law lives in greater Dortmund area, which is near the France and Netherlands borders. We bought train tickets to Germany to visit him for three days.

The night before the train, I checked the Gar De Est website to see if the trains have any delays. I was anxious because our train trip had two connections. To my surprise many trains were canceled, and the remaining trains had 2 hour delays. After playing some web detective with Google Translate, I learned that there was a fire, and it caused the disruption. When we arrived the train station the next morning, we found that the train before our train got delayed by 2 hours, but our train still appeared on time. Our train indeed departed on time, but it was crowded because many passengers from the previous train hopped on our train. We made our first connection on time, but for some reason the second train started to fall behind its schedule, eventually by upto 20 minutes, and caused us to miss our connection to the third train. These were German Bahn trains. I thought Germans were punctual! But it seems like things have been degrading in Germany. A lot of Germans feel very resentful about the unreliability of their trains. We waited for another hour for the next train.  We were surprised to see it very full. We hardly could get on that train, but after a couple stops, the situation improved and we were able to sit.

In Germany, the kids had a blast. They were tired of all the sightseeing we did in France, and they loved staying home and going out for ice-cream with their grandparent four times a day.

To get to Amsterdam, we used trains again. The first leg of the train got canceled due to technical problem. German trains, right? A friend drove us to Dortmund, and we were able to catch our connection there. The ICE train to Netherlands was pretty nice and comfortable.


I didn't think there would be a city that reeks weed worse than Seattle. But there we were. Exiting the Amsterdam Centraal station, the marijuana smell was so stiff for many meters (see I switched to metric system), we were scared the entire city smelled that way. It wasn't so bad, after that first encounter, but yes Amsterdam outcompetes Seattle on the weed smell.

Our hotel in Amsterdam was very nice and comfortable. It was on the south side of the city, in Amstelveen, but the tram system was very nice and punctual and it was very convenient to get to the Amsterdam central.

Sigmod 2019 was held at Beurs van Berlage. This was truly one of the most authentic conference venues I have ever been. It was for sure the most historic one. The building has its own wikipedia page. It is the first stock exchange in the world. This is natural because the stock based funding model originated in Amsterdam as part of building/establishing the city. You see, Amsterdam is built over the sea. Each block of the city was funded by selling stocks for the block. This funding model is credited for the rise of Scientific Revolution in the West. 

The Sigmod reception was held at the Van Gogh museum. It was a blast to visit this museum. Van Gogh started painting after 30, after failing to stick with many other professions. The book "Range: Why Generalists Triumph in a Specialized World" had this to say about Van Gogh:
 Van Gogh was an example of match quality optimization, Robert Miller’s multi-armed bandit process come to life. He tested options with maniacal intensity and got the maximum information signal about his fit as quickly as possible, and then moved to something else and repeated, until he had zigzagged his way to a place no one else had ever been, and where he alone excelled. Van Gogh’s Grit Scale score, according to Naifeh’s assessment, was flush with hard work but low on sticking with every goal or project. He landed in the 40th percentile.
He was told he sucked at painting, but he was undeterred. He kept on painting. He painted a picture every day, and didn't care what people thought of his paintings. That is very brave and very inspirational. But of course there is more to the story. He was part of a scene. He was part of a community of famous painters in Montmarte, Paris. But again, he loved painting, and he persisted in it without much validation. He only started to get recognized in the last two years of his life.

Sigmod also had a canal cruise. The conference banquet was held at Noorderlicht Café accessible via boat. It was a great time.

Overall impressions from the trip

The public transportation is so good in Europe that I am ashamed of the US system. With our limited transportation options, we deny many citizens the freedom to roam around.

The food is so much better in Europe, it is beyond comparison.

People are more laid back in Europe. Compared to Europe, we are all workholics in US. I guess in US we put it on a positive spin and claim we are more passionate and competitive here?

Out of France, Germany, and Netherlands, it was Netherlands that seemed most modern and prosperous. Initially Amsterdam seemed a bit chaotic to us, but the end of the fourth day, we were in love with the place.

Unifying Consensus and Atomic Commitment for Effective Cloud Data Management (VLDB19)

This paper is by Sujaya Maiyya, Faisal Nawab, Divyakant Agrawal, and Amr El Abbadi, and appeared in VLDB19. Here is a video of the presentation of the paper from our Zoom DistSys Reading Group.

Atomic commitment protocols (e.g., Two Phase Commit) provide ACID guarantees for transactional access to sharded data.

Consensus protocols (e.g., Paxos) replicate data across servers in a consistent and fault tolerant manner.

Many prior works have observed the similarities between the commitment and consensus problems. Both consensus and atomic commit protocols aim at ensuring that one outcome is agreed upon in a distributed environment. However, the conditions for achieving this agreement is different in the two cases. Paxos only needs a majority of nodes to be available for a decision to be made whereas 2PC needs votes from all the participants to decide on the final value.

This paper proposes a framework that unites and explains several commitment and single-leader-based consensus protocols under one roof, the Consensus and Commitment (C&C) framework, and then derive protocols to demonstrate the framework. Many of these derived protocols are generalizations of existing protocols, however, some of them are novel in their own right.
  • Paxos Atomic Commitment (PAC) is a distributed atomic commitment protocol managing sharded but non-replicated data. PAC is a variant of 3PC, and integrates crash recovery and normal operations seamlessly in a simple Paxos-like manner. Leader election phase requires a majority quorum response, and the value discovery phase requires all shards to respond unless one of the responses had Decision value true or its AcceptVal value set.
  • Replicated-PAC (R-PAC) is for fully replicated cloud data management, which is similar to Replicated Commit. It is a variant of Gray and Lamport's Paxos Commit. R-PAC is similar to PAC above, but since all nodes maintain identical data, the leader need to only wait for replies from a majority of replicas that have the same InitVal.
  • Finally G- PAC is a novel protocol for sharded replicated architectures, which is similar to other recently proposed hybrid protocols, Janus and TAPIR. G-PAC integrates transaction commitment with the replication of data and reduces transaction commit latencies by avoiding the unnecessary layering of commitment and consensus.
The paper compares the performance of G-PAC with a Spanner-like protocol, where 2PC is used at the logical data level and Paxos is used for consistent replication of logical data. The layering of Paxos and 2PC increases latency, but combining them into a flat hierarchy in G-PAC means that the leader needs to communicate with a large number of nodes, which incurs overhead. If a transaction, T, accesses n shards and each shard is replicated in r servers, there are a total of n ∗ r servers that are involved in the transaction T. Method V with the newly defined notions of super-set and super-majority, which decides the value for the transaction, is described in Algorithm 4. Super-set: majority of replicas (r /2 + 1) from each of the n shards. Super-majority: majority of replicas (r /2 + 1) from majority of shards (n/2 + 1).

The experimental results highlight the low-latency benefits of combining consensus along with commitment into a single integrated protocol. But it looks like the G-PAC evaluation is not done with 15 nodes (i.e., 3 replicas in each of the 5 regions), but only with 5 nodes (1 in each region) where each node uses 2 nearby nodes (within these 5 nodes in 5 different regions) to replicate.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...