Sunday, April 26, 2020

Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log

This paper appeared in NSDI'20 and is authored by Cong Ding, David Chu, Evan Zhao, Xiang Li, Lorenzo Alvisi, and Robbert van Renesse. The video presentation of the paper is really nice and gives a good overview of the paper. Here is a video presentation of our discussion of the paper, if that is your learning style, or whatever. (If you like to participate in our paper discussions, you can join our Slack channel.)


The problem considered is building a fault-tolerant scalable shared log. One way to do this is to employ a Paxos box for providing order and committing the replication to the corresponding shards. But as the number of clients increase the single Paxos box becomes the bottleneck, and this does not scale. Corfu had the nice idea to divorce ordering and replication. The ordering is done by the Paxos box, i.e., the sequencer, and it assigns unique sequence numbers to the data. Then the replication is offloaded to the clients, which contact the storage servers with the sequence number to commit the data. This technique achieves scalability as the number of clients increase.

A limitation of Corfu is that any change in the set of storage servers makes Corfu unavailable until new configuration has been committed to all storage servers and clients. Corfu requires all the clients and storage servers to have the same mapping function which maps sequence numbers to specific shard. This paper provides a simple (almost trivial) idea for solving this problem and improving over Corfu to maintain globally ordered shared logs with seamless reconfiguration of the log-shards.


Scalog turns the Corfu decoupling strategy on its head with a judo move. By first replicating the record and then assigning sequence number to the record via a batch watermarking strategy, it solves the unavailability problem Corfu faces during reconfiguration. (This also takes care of the problem Corfu faces  with a client who took a sequencing number and crashed without replicating it, leaving a gap in the log.)

In Scalog clients write records directly to storage servers, where they are (trivially) FIFO ordered without the mediation of a global sequencer. Records received by a storage server are then immediately replicated across the other storage servers in the same shard via FIFO channels. Periodically, each storage server reports the lengths of the log segment to an ordering layer. To produce a total order out of these local/shard ordered log-segments, the ordering layer in Scalog summarizes the *fully replicated prefix* of the primary log segment of each storage server in a cut, which it then shares with all storage servers.

The ordering is done at the Paxos box in the ordering layer by releasing version vector like watermarks across sharded logs based on the fully-replicated log-segment progress heard from each log. The ordering layer interleaves not only records but also other reconfiguration events. As a result, all storage servers see the same update events in the same order. The storage servers use these cuts to deterministically assign a unique global sequence number to each durable record in their log segments using the deterministic ordering inside each.

Remark: Kafka also provides horizontal scalability of the logs via sharding. However, thanks to the version vector watermarking, Scalog can impose a global order on the logs which is very useful for many applications such as those that need to do multi-object transactions across shards (as in Scalog-Store application shown in the paper). The global order is also very useful for debugging problems across shards/subsystem, which many deployed systems run into.

As we have seen, this batch-based total order imposition provides both seamless reconfiguration and scalability to Scalog. These two figures explain the Scalog architecture and operation very nicely. The aggregators serve as soft-state buffer to batch communication in front of the Paxos box to alleviate the communication the box needs to endure.

In a storage shard, f+1 storage servers are sufficient to tolerate upto f crash failures. Due to its loosely decoupled coordination which can handle any number of shards you throw at it, Scalog also takes an interesting approach to fault-tolerance. It uses two servers in a shard to tolerate the crash of one server, and instead of trying to resuscitate the crashed server (which may take a long time), it prescribes the client to finalize this shard, and start a new shard to continue operation. After all, adding a new shard is frictionless and no-effort in Scalog.


Scalog is evaluated extensively and is shown to give good results. To evaluate Scalog at scale, the authors use a combination of real experiments and emulation. They use a 10 Gbps infrastructure and SSDs, and consider 4KB records. With 17 shards, each with two storage servers, each processing 15K writes/sec, they show that Scalog achieves a total throughput of 255K totally ordered writes/sec. Further, through emulation, they demonstrate that, at a latency of about 1.6 ms, Scalog can handle about 3,500 shards, or about 52M writes/sec. This means that Scalog can deliver throughput almost two orders of magnitude higher than Corfu’s, with comparable latency. They also provide experiments to show how reconfigurations impact Scalog, and how well Scalog handles failures.


Embrace small-coordination and punt things to the client

The main contributions in Scalog are that:
  1. it allows applications to customize data placement 
  2. it supports reconfiguration with no loss in availability
  3. it recovers quickly from failures
The first two features are mostly client's responsibility. The thing about Scalog is that it provides small-coordination (as in small government), and it gets out of the way of the client, so the client can customize its data placement and add new shards without leading to a slowdown or loss in availability in the system. The third feature, recovering quickly from failures, is also an artifact of the small-coordination provided by Scalog.

How does Scalog compare with consensus protocols?

Scalog is not a consensus protocol. It assumes Paxos/consensus box to provide the strong-consistency/linearizability to the face of fault-tolerance. From the Paxos side, the closest idea to Scalog is SDPaxos. SDPaxos also separates ordering from replication, and replication is done first and ordering is done over the already replicated requests. On the other hand, SDPaxos does not provide horizontal scaling as in Scalog. There must be a separate SDPaxos overseeing each shard, and even then we will need a similar watermarking trick as in Scalog to impose a posteriori global ordering on the logs.

Scalog's ordering is a posteriori ordering. There is no immediate read operation supported in Scalog. Reads are supported either through subscribe operation (which provides streaming batch reads) or via readRecord(l,s) operation, where l is the sequence number and s is the shard. In either case, it seems like the reads are reads from the past and not real-time reads. I wonder if there is a loss of functionality because of this. For example does this cause problems for log-based global RSM maintenance. The paper cites analytics applications for use of Scalog, and it is not very clear if we could have responsive command servers maintained using the RSM approach over Scalog shared logs.

How does Scalog compare with AWS Physalia?

AWS Physalia presented large-scale sharded configuration boxes which oversee sharded chain replication systems. I think Physalia is a better engineered system since it takes into account partitioning problems, and can reconfigure the coordination cells to relocate closer to the storage servers to handle some partitioning problems.

The batch-style asynchronous acknowledgements in Scalog is bad for usability. They make it hard for the clients to determine when there is a problem and where there is a problem. The clients can not differentiate between a problem in the shard and in the ordering layer. Are the two servers in the shard partitioned from each other? Or is the entire shard partitioned from the ordering layer. Is the ordering layer functioning OK and having delays or is it down or unreachable. These limit the actions the clients can take to respond and adopt to problems.

Of course Physalia does not provide an ordering across shards, but maybe an ordering layer can also be added similar to the watermarking ideas presented in Scalog.

Friday, April 17, 2020

Elle: Inferring Isolation Anomalies from Experimental Observations

This paper is by Kyle Kingsbury (of Jepsen fame) and Peter Alvaro (of beach wandering fame) and is available on Arxiv.

Adya (2000) showed that transaction isolation anomalies can be defined in terms of cycles over a Direct Serialization Graph (DSG) that captures the dependencies between transactions. Unfortunately, it was hard to utilize this DGS technique for isolation anomaly checking in practice because many database systems do not have any concept of a version order, or they do not expose that ordering information to clients. This paper shows that it is possible to use an encoding trick on the client side to emulate/maintain that ordering information and ensure that the results of database reads reveal information about their version history. The solution they find is the list data structure, which is supported by many databases. The paper also shows that lighter weight data structures, such as sets, can also be useful for checking violations of weaker isolation properties.

Building on these, the paper presents Elle: a novel checker which infers a DSG using client-observed transactions. Elle not only detect anomalies, it can also discriminate between them, and provide concise explanations of each. The paper gives evidence of its effectiveness via a case study of four real databases using Elle.

The paper is valuable because it builds a bridge between research on dependency graphs and emerging techniques for black-box database testing. On the impact side, I believe Elle the checker will have enormous real work impact as part of the Jepsen framework as it extends checking to multi-object transactions (the serializability branch in the tree). Elle is released as an opensource project (

In my summary below, I use paragraphs and figures lifted from the arxiv version. I strongly recommend you read the paper, and give Elle a try.


Many databases do not provide the isolation guarantees they claim. Checkers help detect violation of isolation guarantees. By generating client workloads and injecting faults, checkers produce anomalies that witness a violation of a stated guarantee.

Many checkers use a particular pattern of transactions for checking. These also have several drawbacks. They find a small number of anomalies in a specific pattern of transactions, and tell us nothing about the behavior of other patterns. They require hand-proven invariants, and each property may require a separate test.

More general checkers exist. But their use is limited by the NP-complete nature of linearizability checking, and the combinatorial explosion of states in a concurrent multi-register system. Serializability checking is also (in general) NP-complete and unlike linearizability, one cannot use real-time constraints to reduce the search space.

Instead of solving for a transaction order, Elle uses its knowledge of the transactions issued by the client, the objects written, and the values returned by reads to reason about the possible dependency graphs of the database in the language of Adya's Direct Serialization Graph (DSG) formalism.

DSG is a graph over transactions in some history H, whose edges are given by these dependencies. It provides a simple way to represent isolation violation anomalies and we can check these properties in linear time: intermediate and aborted reads are straightforward to detect, and once we constructed the dependency graph, cycle detection is solvable in O(vertices + edges) time, thanks to Tarjan's algorithm for strongly connected components.

However, there is one significant obstacle to working with an Adya history: we don’t have it. Many database systems do not have any concept of a version order, or they do not expose that ordering information to clients. This paper shows that it is possible to use an encoding trick on the client side to emulate/maintain that ordering information and ensure that the results of database reads reveal information about their version history. The solution they find is the list data structure, which is supported by many databases.

Deducing dependencies

We can infer properties of every history compatible with a given client observation, by taking advantage of datatypes which allow us to trace the sequence of versions which gave rise to the current version of the object. If every value written to a given register is unique, then we can recover the transaction which gave rise to any observed version. We call this property recoverability: every version we observe can be mapped to a specific write.

Recoverability allows us to infer read dependencies. But blind writes to a register "destroy history". To circumvent this we consider richer datatypes, whose writes do preserve some information about previous versions.

For instance, we could take increment operations on counters, starting at 0.  The problem here is we can't tell which increment produced a particular version. This keeps us from inferring write-write, write-read, and read-write dependencies

What if we let our objects be sets of elements, and had each write add a unique element to a given set? While we can recover some (but not all) write-write, write-read, and read-write dependencies, we cannot identify all write-write dependencies due to lack of ordering.

But we can add order to our values by letting each version be a list, to which a write appends a unique value. As with counters and sets, we can use traceability to reconstruct read-read, write-read, and read-write dependencies— but because we have the full version history, we can precisely identify read-write and write-write dependencies for every transaction whose writes were observed by some read.

While Elle can make limited inferences from read-write registers, it shines with richer datatypes, like append-only lists. The paper also shows that lighterweight data structures, such as sets, can also be useful for checking violations of weaker isolation properties.

Implementation and evaluation 

Elle is straightforward to run against real-world databases. Most transactional databases offer some kind of list with append. The SQL standard’s CONCAT function and the TEXT datatype are a natural choice for encoding lists, e.g. as comma-separated strings. Some SQL databases, like Postgres, offer JSON collection types. Document stores typically offer native support for ordered collections.

The authors implemented Elle as a checker in the opensource distributed systems testing framework Jepsen and applied it to four distributed systems, including SQL, document, and graph databases, namely TiDB, YugaByte DB, FaunaDB, and Dgraph. Elle revealed anomalies in each of them.

Elle's performance on real-world workloads was excellent; where Knossos (Jepsen’s main linearizability checker) often timed out or ran out of memory after a few hundred transactions, Elle did not exhibit Knossos’ exponential runtimes and was able to check histories of hundreds of thousands of transactions in tens of seconds.

Another nice thing about Elle is that it is informative. It provides a human-readable explanation of why each witness must be an instance of the claimed anomaly.

What is with the name?

I am not hip or young, so I did some googling about Elle. I decided that the name refers to Elle Fanning alluding that this checker being a new boost/cover for Jepsen. 

But I was wrong.

My next theory was that it is about the evergreen lists that Elle magazine publishes. "Lists" because lists play a major role in Elle, the checker, as well as Elle the magazine. And I may not be too far away from the mark.

Thursday, April 16, 2020

Fine-Grained Replicated State Machines for a Cluster Storage System

This paper appeared in NSDI 2020 and was authored by Ming Liu and Arvind Krishnamurthy, University of Washington; Harsha V. Madhyastha, University of Michigan; Rishi Bhardwaj, Karan Gupta, Chinmay Kamat, Huapeng Yuan, Aditya Jaltade, Roger Liao, Pavan Konka, and Anoop Jawahar, Nutanix.

The paper presents the design and implementation of a consistent and fault-tolerant metadata index for a scalable block storage system via distributed key-value abstraction. The key idea is to use fine-grained replicated state machines (fRSM), where every key-value pair in the index is treated as a separate RSM to reduce tail-latency in key-value access and provide robustness to key access skews.


The problem arised from Nutanix's business in building private clouds for enterprises to enable them to instantiate VMs that run legacy applications. A cluster management software determines which node to run each VM on, migrating them as necessary. And Stargate provides a virtual disk abstraction to these VMs translating virtual accesses to physical disk accesses. This work focuses on how Stargate stores the metadata index that maps virtual disk blocks to physical locations across the cluster.

To minimize the probability of data loss, any update to the metadata must be committed to stable storage on multiple nodes before Stargate acknowledges the write to the client. In the baseline/traditional design Paxos would be employed for this. More specifically, the traditional design shards the metadata index across multiple nodes, and then uses Paxos for ordering operations on any given shards. The operations are then executed on a durable data structure such as a log-structured merge tree. For example, Spanner and CockroachDB designs are similar to this.

But there are drawbacks to the baseline design. The use of a per-shard consensus operation log introduces inefficiencies, more specifically,  head-of-line (HOL) blocking which arises due to bursty operations on the same shard. Even when the operations are for different keys, the latter operations would have to wait for the earlier to complete, as requests have to commit and execute in sequence as specified in the log order for the shard. This can increase latency and cause unpredictable tail-latency. The paper reports factor of 4x between loads on shards, which turns HOL blocking into a significant problem.

I think this is application dependent, and by choosing smaller shards (rather than abolishing shards all together as fRSM suggests) the problem can be remedied. For example, CockroachDB uses small shards of 64Mb, and it looks like that granularity works good for them. In the discussion section at the end, I will revisit this shard-size discussion.

Side-quest: going beyond linearizable operations 

As a side-quest, the paper sets up this problem and it later demonstrates how this is resolved by fRSM read operation. Under linearizability, even when a read issued after a failure does not reflect a write issued before the failure, this does not mean that the write failed. It may be that the update could have been arbitrarily delayed and might get applied later, causing subsequent reads to observe the updated value. This is the ghost-writes problem in Cassandra, Mongo, and is annoying to deal with. The paper argues that they would need to provide stronger guarantees to client VMs so that they can reason about operation failures.

They require that any subsequent read of the metadata after an operation timeout must confirm whether the prior operation succeeded or not. As a result, successive reads of a piece of metadata should return the same value as long as there are no concurrent updates initiated by other agents in the system.

Paxos can satisfy this if the leader keeps track of client submission order and uses read-in-log execution. But the problem is you have to wait commit, before you acknowledge. If you do speculative acks of dirty writes, you are prone to this problem again. Below we will see how fRSM read operations resolve this problem, by fate-sealing writes before themselves.

Fine-grained replicated state machine (fRSMs)

Using fRSMs, each key-value pair is represented as a separate RSM and can operate independently. fRSM uses no operation logs and maintains only a small amount of consensus state along with the perceived value of a key.

Associated with each key is a clock attribute storing
  • epoch number to represent the generation for the key 
  • timestamp within an epoch is advanced whenever the key's value is updated. The epoch number and the timestamp together represent a Paxos instance number 
  • promised proposal number and accepted proposal number associated with the key's value
  • chosen bit as the commit flag in Paxos

CAS-Update operation

CAS updates are built using the clock the client obtained via the key read. With each read, a client also receives the current epoch (e) and timestamp (t) for the value. The client specifies the new value for timestamp t+1 having read the value previously at timestamp t. The request is routed to the leader of the replica group responsible for the key.

Here are the phases of the CAS-update operation:
  • Retrieve key's consensus state: The leader reads its local state for key k and retrieves the key's local clock: pp for promised proposal number, pa for accepted proposal number
  • Prepare request If pp is for a prepare issued by a different node, then the leader generates a higher proposal number, sends prepare messages to other nodes. (The leader skips this step if pp and pa are the same.)
  • Prepare handler: Same as in Paxos phase1b. The replicas durably store the prepare proposal number as part of the key’s clock attribute
  • Accept request  The key's value and the corresponding clock are recorded in the commit log and Memtable at each node
  • Accept response processing
    • If a quorum of successful accept responses is received at the leader, the leader considers the operation to be completed, sets chosen bit, and acks the client.
    • If the request is rejected because the (epoch, timestamp) tuple at a replica is greater than the client-supplied epoch and timestamp, then a CAS error is sent to the client. Further, accept messages are initiated to commit the newly learned value and timestamps at a quorum.
This is very much the classical Paxos protocol.

The video presentation of fRSM is misleading here, as it uses the term "Fast path" in discussing the update operation. In Paxos literature, fast path means reaching a supermajority to get a commit quickly, as in Fast Paxos or EPaxos. However, in fRSM the authors are using fast path to refer to the stable leader (aka multiPaxos) optimization of skipping the prepare phase and just doing Accept phase.

The big difference from the classical Paxos operation is that fRSM is logless, the nodes do not maintain per-key or per-shard operation logs. This required some subtle changes to the protocol. The nodes skip over missed operations and directly determine and apply the accepted value with the highest associated proposal number (with a possibly much higher timestamp). They still send a CAS error, but also adopt the value. Say a replica got update t, missed t+1, and received t+2. Since it didn't see t+1, it will send a CAS error, but will also accept t+2, and use it, so it is caught up for t+3. If the leader does not get a quorum of accepts due to CAS-errors, the leader will not be able to commit the value.

The processing logic speculatively updates the LSM tree and relies on subsequent operations to fix speculation errors. The leader checks its chosen bit, but the replicas/followers don't wait for a phase-3 commit. They just go with what they accepted. The read operation does some fate-sealing if necessary to take care of finalization in the presence of logless operation.

Read operation 

What do I mean by fate-sealing? The read needs to satisfy two properties:
  • the value returned should be the most recent chosen value for a given key
  • other accepted values with higher <epoch, timestamp> than the returned value are not chosen
This second property is required to ensure that any other CAS operations that are in-progress but aren't visible will not be committed in the future; this is similar to a view change in the Viewstamped Replication protocol, and achieved by increasing the epoch number as discussed below as part of mutating quorum reads. This way operations that are not deemed complete at the end of a view are prevented from committing in a subsequent view.

I still think a much cooler name for this would be "a fate-sealing read for Shroedinger's database". This read operation takes care of the ghost-writes problem we discussed in the side-quest as part of background section.

The reads are processed in one of three modes
  1. leader-only reads
  2. quorum reads
  3. mutating quorum reads (read-repair mode)
When the operation is routed to the leader, the leader checks whether it is operating in theleader-only mode. If the check is successful, then the leader will serve the request from its Memtable or one of the SSTables. I am assuming some leader-lease is also involved here. The system employs ZooKeeper as a hint for key to leader mapping -- which is mentioned in one passing sentence in Section 3.1.

If the leader is not operating in the leader-only mode, then it has to poll the replica set for a quorum read and identify the most recent accepted value for the key. If this value is not available on a quorum of nodes, the leader has to propagate the value to a quorum of nodes (i.e., performing a mutating quorum read as in read-repair operation in Cassandra and ABD protocol).

Further, if there is an unreachable replica that might have a more recent accepted value (i.e., the promise is made to a node that did not respond with a clock attribute), then the mutating quorum read performs an additional quorum-wide update to just the timestamp to prevent such a value from being chosen (i.e., fate sealing against ghost-writes).
  • Let pp be the promise associated with an unreachable node, and let v, e, and t be the value, epoch, and timestamp associated with the highest accepted proposal
  • The leader issues prepare commands to the replica nodes to obtain a promise greater than pp, and then sends accept commands to the replica nodes to update their value, epoch, and timestamp fields to v, e, and t+1, respectively. The higher timestamp value prevents older CAS operations from succeeding.


The paper argues that fRSM allows for flexible and dynamic scheduling of operations on the metadata service and enables effective use of the storage and compute resources, and uses the evaluation to show benefits of fine grain RSM approach over coarse-shard grained RSMs.

The evaluations show that compared with coarse-grained RSMs, fRSMs achieve 5.6× and 2.3× higher throughput for skewed and uniform scenarios in controlled testbeds. The paper argues this is because fRSM (1) allows requests accessing different keys to be reordered and committed as soon as they complete; (2) eliminates the computation cost associated with scanning the RSM log to identify and retry uncommitted entries; (3) avoids unnecessary head-of-line blocking caused by other requests; (4) achieves better load balance across SSDs and cores even in skewed workloads.


Here is the YouTube video of the discussion of the paper from our Zoom DistSys Reading Group. I include some of our discussion in the below section. Here is a link to the Google Docs for recording questions for the discussion.

Next week, we will discuss "Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log" paper from NSDI 2020. To engage in the paper discussion and get the Zoom Link for the meeting, join our Slack channel.


1. Why not use smaller shards? Why go to per-key RSM?

Smaller shards (in the limit one-key per shard as in fRSM) helps for more hardware utilization, but it also has more overhead. The paper does not talk about shard size selection much, but I suspect choice is application dependent. The paper has this to say about smaller shards, but it should have elaborated more on this topic.

Sub-dividing the shards into even smaller shards would mitigate the load imbalance issue. However, it suffers from three drawbacks. First, it doesn’t address the request head-of-line blocking issue. Requests still have to commit and execute in sequence as specified in the log order. Second, it further reduces batching efficiency for storage devices. Third, it doesn't provide the benefit of fast node recovery, as a recovering node cannot immediately participate in the protocol.

Our previous work WPaxos also provides per-key RSM, and in an efficient manner. Moreover, it allows efficient and safe key-stealing so that the system can self-optimize and adapt to access patterns and provide local yet strongly consistent updates even in WAN settings.

2. Why does fRSM not support blind writes? 

The paper says this on that: "We do not support blind writes, i.e., operations that merely update a key’s value without providing the current value. Since all of our operations are CAS-like, we can provide at-most-once execution semantics without requiring any explicit per-client state as in RIFL."

It would be nicer, if this point is elaborated on. It is still not very clear. Hermes also had a logless design, but it allowed blind writes as well.

3. What does Mad Danny say about the paper?

I asked Mad Danny about his views on the paper. This is what he had to say.

This protocol is part of a system that has been running for 8 years in deployment. fRSM is a weird cross between CASPaxos and MultiPaxos. It is probable they may not even had Paxos in mind, when they first deployed this solution. Or maybe they started with MultiPaxos and evolved toward a logless CASPaxos like solution. The protocol is so intertwined with the application, maybe hyperoptimized to the application.

This is what you get for writing a paper after so many years. The protocol takes on a life of its own, and became huge with many details. It becomes hard for the authors to figure out how to abstract away and prioritize the ideas, having worked on implementation and devops of the protocols for many years. It becomes hard to abstract out and tell a simple story. As a result, the paper is not well written, and looks like a collection of half-finished thoughts in many places.

The nice thing about fRSM is it has no logs, which gives it an agility advantage. It uses dirty writes (followers take accept as the last word speculatively), and offloads responsibility to reads for fate sealing of the writes. Solving the ghost-write problem is a big boon. Having similar thing for Cassandra and MongoDB could be of humongous help.

For utmost reliability, I would use a very well understood, battle-tested Paxos implementation, and deploy it as a configuration service fleet like AWS did in Pysalia. This way you can shard at your heart's will as needed using cells, reconfigure them, move them. Nice things, about cells is you can still do transactions within the cell using Paxos, and this gives extra power for programmability of the service. And that solution would be more general, not  tied to an application, and can be lifted and applied for other applications.

Sunday, April 12, 2020

DistSys Reading Group second meeting: Wormspace

We had our second Zoom DistSys Reading Group on Wednesday. The meeting is open to all who are working on distributed systems. Join our Slack channel for paper discussion and meeting links (password protected).

I had summarized Wormspace paper before the meeting. It is a great paper. This week I was the presenter, and we started the meeting with my presentation for 30 minutes. Here is a link to my slides.

In the presentation I made sure to emphasize the benefit provided by WormSpace. It is an abstraction that enable developers to use distributed consensus as a building block for applications. Developers don't need to understand how distributed consensus via Paxos works. The API hides the details and complexity of Paxos under a data-centric API: capture, write, and read. The API is at a low enough level to enable efficient designs on top (as demonstrated for WormTX) without the need to open the Paxos box.  Bunching WORs in WOS was also a very useful decision for improving the programmability of WormSpace library. In short, WormSpace enables you to remix distributed consensus in your applications. While WormSpace is not groundbreaking in terms of novelty, it is a very important contribution because providing the right abstractions enable an explosion of applications in a field, e.g., Map-Reduce, spark, and TensorFlow.

After the presentation, we had general discussion about the paper, answering questions posed by participants at the linked Google Docs. Some interesting ones included:
  • Q: How does WormSpace's use of registers compare with Heidi Howard’s generalized solution to distributed consensus which builds on top of write-once register (WOR)? 
  • A: WormSpace implements WOR via Paxos, whereas Heidi's work assumes WOR is provided at middleware to build Paxos protocols on top. WormSpace registers are distributed registers, in contrast Heidi's registers are local-implementations, many rounds per register is provided by the middleware over which distributed registers are implemented via Paxos variants.
  • Q: What is the purpose of a WOS? Is WOS just at a coarser granularity than WOR?
  • A: Yes, WOS provides a batching opportunity over WORs, and helps for performance and programmability. As an example of programmability improvement, consider WormLog. A client (sequencer) can allocate a WOS, and batch capture all the WORs in it, and pass these as write tokens to tail in sequence to other clients talking to this sequencer.
  • [comment] Fig 9 &10 compare WormPaxos vs Cpaxos (classical Paxos). But Cpaxos does not use the multipaxos optimization of stable leader, and also not piggybacking of commit. In contrast WormPaxos has a stable leader: the client who allocates the WOS. This makes the comparison unfair.

In the discussion, we decided on two questions to explore deeper for the breakout session.
  • WormSpace can be viewed as an easy to use paxos-as-a-library.  With the many variants of Paxos that exist, which can be implemented using WormSpace’s API?  What extensions to WormSpace would be needed to be able to support the others?  (E.g. Mencius, generalized paxos, flexible paxos, e-paxos, {cheap,vertical,stoppable} paxos?)
  • WormSpace proposes WORs/Paxos as the fault-tolerant, replicated base on which to build applications and services.  Other work (Tapir, Replicated Commit, ?) suggests that replication should be fused with higher level protocols/applications for maximum performance. How do these approaches compare?

Here is the YouTube video for our presentation and discussion. After the breakouts, there had been another discussion, but we couldn't record that.

After I put people to the breakout rooms, our neighborhood had a power outage. It lasted for two hours. I tell you, it is not fun to have a blackout, when you are in quarantine and are worried about whether the world is falling apart.

Monday, April 6, 2020

What Pasha taught me about life

Pasha joined our family when he was barely 2 months old. Shortly after that the Covid-19 quarantine started, and we have spent our lives 24/7 with Pasha.
We are all big fans of Pasha, but I particularly admire him as I am in awe of his approach to life. I fired my life-coach and decided to follow Pasha's teachings instead.

Here are the things I learned from Pasha.

Play hard, rest easy

Pasha has a lot of energy in the mornings. He bounces off the walls, climbs to our curtains, and playfully harasses our wrists and ankles. If we try to snuggle with him in the morning, he runs away, to continue his parkour route around the house.

At 11am, he crashes. He sleeps where he deems fit. If it is sunny, he finds a sunny spot in front of a window. But he also doesn't mind sleeping on the carpet in a busy room, oblivious of the feet traffic in the area.

After his nap, Pasha wakes up, a new cat, refreshed, and does a rinse and repeat.

Resting is important. Some people say work hard, party hard. No, Pasha says "work is play, play hard, and then rest easy." Pasha's philosophy is to embrace the boredom and tranquility, so he can play hard again.

Treat work as play, and play as work

Pasha doesn't differentiate between work and play. He takes his play seriously and not-seriously at the same time. I know this sounds like a zen koan, but let me try to explain. Pasha knows he is playing but he also knows he is training as well. It looks like he is wired to do deliberate practice in his playing/training. He goes the extra length when jumping or stretching even when he knows this is for play, and this is not a real prey he is chasing.

By training hard through his play time (which he observes religiously), he is cultivating essential skills for when they could be needed.

Pasha has intense concentration when playing. I envy this, I wish I could summon my focus as quickly and as intensely as he can. I will be working on this taking inspiration from Pasha.

Pursue your curiosity

Pasha follows his curiosity to the end, undeterred by any obstacles. Scolding him or relocating him to another room doesn't help. If he got interested in finding out about something (e.g., breakfast table, laundry room), he keeps chasing that relentlessly. Yesterday he fell into the toilet bowl, and had to get another bath.

Curiosity kills the cat, but satisfaction brought him back. The thing is we can't get mad at Pasha for the troubles he causes when following his curiosity. It is very apparent that he is wired this way. There is no point in getting angry at him for something outside his control.

I am a curious person, but looking at Pasha I know there is a lot of room to improve in this department. I wish I could be as driven by curiosity as Pasha.

Dance to your own beat

Pasha determines his schedule, and he doesn't let us dictate our schedule on to him. He naps always around the same times. He takes a nap from 11pm to little past midnight in the chair next to me in my study. When I am ready to go to bed, he wakes up without fail. I guess he gets some more sleep during the night, because he is up early in the morning full of energy.

Pasha enjoys life. He looks fully present and content at every moment. He doesn't have high expectations from life, and he can entertain himself. He is definitely handling this quarantine better than us, and probably the life thing as well.

Treasure your family

Pasha is not needy of attention and affection. When we give him affection he welcomes it (only briefly if he is on the prowl), and he doesn't shy back to show his affection. Pasha is very patient with our kids, especially our 5 year old daughter. The reason we didn't get a cat earlier was because we waited for her to grow up more. My daughter can act like Elmyra Duff sometimes, but Pasha tolerates this well, and in fact he shows a lot of affection to her.

Lick Like yourself

Pasha takes a good chunk of time taking care of his hygiene. Even the quarantine could not cause an erosion of his personal care standards. Pasha surely knows how to treat and take care of himself.

Here is Pasha again. He grew up a lot in one month's time.


Hey, it's my blog, and I can post whatever I like. If you find this post nonsensical, you do you. I learned from Pasha to be indifferent to others' opinion/evaluation.

Saturday, April 4, 2020

WormSpace: A modular foundation for simple, verifiable distributed systems

This paper is by Ji-Yong Shin, Jieung Kim, Wolf Honore, HernĂ¡n Vanzetto, Srihari Radhakrishnan, Mahesh Balakrishnan, Zhong Shao, and it appeared at SOCC'19. 

The paper introduces the Write-Once Register (WOR) abstraction, and argues that the WOR should be a first-class system-building abstraction. By providing single-shot consensus via a simple data-centric API, the WOR acts as a building block for providing distributed systems durability, concurrency control, and failure atomicity.

Each WOR is implemented via a Paxos instance, and leveraging this, WormSpace (Write-Once-Read-Many Address Space) organizes the address space into contiguous write-once segments (WOSes) and provides developers with a shared address space of durable, highly available, and strongly consistent WORs to build on. For example, a sequence of WORs can be used to impose a total order, and a set of WORs can keep decisions taken by participants in distributed transaction protocols such as 2PC.

To demonstrate its utility, the paper implements three applications over WormSpace. These applications are built entirely over the WOR API, yet provide efficiency comparable to or better than handcrafted implementations.
  • WormPaxos, a Multi-Paxos implementation
  • WormLog, a distributed shared log (omitted in my summary for brevity)
  • WormTX, a distributed, fault tolerant transaction coordinator
The main benefit of WormSpace is that compared to Paxos, developers do not need to reason about or understand Paxos protocols to build applications on top, instead the developer has the flexibility to choose low-level implementation details. The WormSpace API enables a more optimized design (as demonstrated for WormTX) without the need to modify the system (to rewire the communication pattern).

If you like to see how extensions of some of these ideas can be put in action, watch this video by Mahesh Balakrishnan (one of the authors of WormSpace) and Jason Flinn (U Michigan) describe the Delos Storage system for Facebook's Control Plane.

The paper is well-written, and I really enjoyed reading this paper. I think this paper should receive more attention. The paper has another track, which I will not discuss here: Formal verification of WormSpace and reuse of this proof for verifying systems built on top, via the Certified Concurrent Abstraction Layer (CCAL) approach. Sometimes, when a paper includes many ideas, it may dilute the focus and hurt its reception/recognition.

In my summary below, I use many sentences from the paper verbatim. I am saving my MAD questions and in-depth review to the Zoom paper discussion on Wednesday 15:30 EST. Our Zoom Distributed Systems Reading Group meeting is open to anyone who is passionate about distributed systems. Here is an invitation to the Slack workspace for the reading group, where we will conduct pre-meeting discussion and announce the password-protected link to the meeting. (UPDATE: Here is the link to our WormSpace paper discussion.)

The WormSpace system

The WOR abstraction hides the logic for distributed coordination under a data-centric API: a client can capture a WOR; write to a captured WOR; and read the WOR.

The address space is divided into write-once segments (WOSes) of fixed size. Segments are explicitly allocated via an alloc call that takes in a segment ID and succeeds if it is as yet unallocated. Once a client has allocated a WOS, any client in the system can operate on WORs within the segment. Specifically, it can capture a WOR; write to it; and read from it.

Clients must capture an address before writing to it to coordinate replicated servers to make the write atomic and immutable. The capture call is similar to a preemtable lock (e.g. phase1, prepare of Paxos): the lock must be acquired to write, but it can be stolen by others. A successful capture call returns a unique, non-zero captureID; a subsequent write by the same thread is automatically parameterized with this ID, and succeeds if the WOR has not been captured by some other client in the meantime.

The WormSpace design is similar to a write-once distributed key-value store: WORs are associated with 64-bit IDs (consisting of segment IDs concatenated with offsets within the segment) and mapped to partitions, which in turn consist of replica sets of wormservers. Partitioning occurs at WOS granularity; to perform an operation on a WOR within a WOS, the client determines the partition storing the segment (via a modulo function) and issues the operation to the replica set.

Each WOR is implemented via a single-shot Paxos consensus protocol, with the wormservers within a partition acting as a set of acceptors. In the context of a single WOR, the wormservers act identically to Paxos acceptors; a capture call translates to a phase 1a prepare message, whereas a write call is a phase 2a accept message. The read protocol mirrors a phase 1a message, but if it encounters a half-written quorum, it completes the write. Each wormserver maintains a map from WOR IDs to the acceptor state for that single-shot Paxos instance. If a map entry is not found, the WOR is treated as unwritten.

The client-side library layers the logic for enforcing write-once segments. Each WOS segment is implemented via a set of data WORs, a single metadata WOR, and a single trim WOR. Allocating the WOS requires writing to the metadata WOR. If two clients race to allocate a WOS, the first one to capture and write the WOR wins.


WormPaxos is an implementation of Multi-Paxos over WormSpace, exposing a conventional state machine replication (SMR) API to applications. Implementing Multi-Paxos over WormSpace is simple: the sequence of commands is stored on the WormSpace address space. In WormPaxos, servers that wish to replicate state act as WormSpace clients, and are called WP-servers. They can propose new commands by preparing and writing to the next free address; and learn commands by reading the address space in sequential order.

In WormPaxos, a WP-server becomes a sticky leader simply by using a batch capture on a WOS; accordingly, leadership strategies such as sticky leader, rotating leader, etc. can be implemented simply as policies on who should call the batch capture and when. The leader's identity can be stored within the metadata for each segment, obviating the need for WormSpace to know about the notion of a leader or the leadership strategies involved. If the leader crashes, a new leader that allocates the next WOS can batch capture the WOS of the previous leader, complete partially finished operations, and fill in junk values to unwritten WORs to prevent holes in the SMR/Multi-Paxos log.


2PC is known to be a blocking protocol in the presence of crash failures.  WormTX shows how it can be made non-blocking leveraging on WormSpace. A number of variant protocols is presented to show how the efficiency can be gradually improved.

  • [Variant A8: 8 message delays] An obvious solution is to simply store the votes in a set of per-RM WORs. In the WOR-based 2PC protocol, an RM initiates the protocol by contacting the TM (message delay #1); the TM contacts the RMs (#2); they capture the WOR (#3 and #4), and then write to it (#5 and #6); send back their decision to the TM (#7), which sends back a commit message to all the RMs (#8). 
  • [Variant B6: 6 message delays] Each RM can allocate a dedicated WOS for its decisions and batch capture the WOS in advance. 
  • [Variant C5: 5 message delays] TM can directly observe the decision by listening for write notifications on the WOS.
  • [Variant D4: 4 message delays] Individual RMs can directly listen to each other’s WOSes; this brings us down to 4 message delays.
  • [Variant E3: 3 message delays] We do not need a TM, since the final decision is a deterministic function of the WORs, and any RM can time-out on the commit protocol and write a no vote to a blocking RM's WOR to abort the transaction. The initiating RM can simply contact the other RMs on its own to start the protocol (combining #1 and #2 of variant A8), bringing down the number of delays to 3. This variant is not described by Gray and Lamport' Consensus of Transaction Commit paper.
  • [Variant F2: 2 message delays]: This works only if RMs can spontaneously start and vote. 


Thursday, April 2, 2020

Gryff: Unifying consensus and shared registers

This paper is by Matthew Burke, Audrey Cheng, and Wyatt Lloyd, and appeared in NSDI'20. Here is a link to the paper, slides, and video presentation.

Straight talk (from the middle of the book)

  • The model of the paper is a great contribution. Stable versus Unstable ordering is a good framework to think in. Carstamps (consensus after registers) logical clock timestamping is a good way to realize this ordering. I think carstamps will see good adoption, as it is clear, concrete, and useful.
  • Constructing a hybrid of EPaxos and ABD is a novel idea.
  • The performance of Gryff is not good. A straightforward distributed key-value sharded implementation of Paxos would do a better job. I think Hermes is a better choice than Gryff with read-write and read-modify-write operations.


Recently we see a lot of interest in unifying consensus and shared registers, the topic of the paper. I think this is because of the popularity of distributed key-value stores/systems. While consensus is often used for implementing distributed key-value stores, this is a bit of an overkill. You don't need to know the previous value of the key, if you are overwriting it with a write operation to the key. For the write operation, the new value is not a function of the old value.  As Gryff says in the abstract: "Consensus is inefficient for workloads composed mostly of reads and writes". ABD is good enough for that, and it even gives you linearizability to the face of concurrent asynchronous execution with crash faults. However, ABD for shared registers is too weak to implement stronger synchronization primitives. You would still need to use the consensus gun for the occasional read-modify-write operation.

So this led many people to think about providing read/write operations and read-modify-write operations separately for implementing distributed key-value systems. Hermes (ASPLOS'20), which I reviewed earlier, is an example of this. Fine-Grained Replicated State Machines (NSDI'20), which we will discuss in an upcoming Zoom DistSys reading group, also looks at a related problem.

Consensus vs. Shared Registers

A big contribution of the paper is the framework it introduces for thinking about read write operations and read-modify write operations.

Applying commands in the same order on all replicas requires an ordering mechanism that is stable, i.e., a replica knows when a command's position is fixed and it will never receive an earlier command. In asynchronous systems where processes can fail, consensus protocols are used to agree on this stable ordering.

Shared register protocols provide a linearizable ordering of operations. That ordering does not have to be stable, however, because each write operation fully defines the state of the object. Thus, a replica can safely apply a write w4 even if it does not know about earlier writes. If an earlier write w3 ever does arrive, the replica simply ignores that write because it already has the resulting state from applying w3 and then w4. Figure 1b shows shared register ordering where there is a total order of all writes (denoted by <) without stability.

A shared object in Gryff exposes the following interface:
  • READ(): returns the value of the object
  • WRITE(v): updates the value of the object to v
  • RMW( f (·)): atomically reads the value v, updates value to f (v), and returns v

Carstamps for correct ordering

The paper says that AQS (active quorum system) protocol [2010] which tried to unify consensus and shared registers, has a subtle bug that allows rmw to be misplaced/misordered. The paper says that, to simplify reasoning about correctness, it is best to enforce the interaction at a deeper level, in the ordering mechanism, by imposing a structural order in the timestamps.

To leverage this insight, the paper introduces consensus-after-register timestamps, or carstamps. Carstamps allow writes and rmws to concurrently modify the same state without serializing through a leader or incurring additional round trips. Reads use carstamps to determine consistent values without interposing on concurrent updates.

Gryff Protocol

The name Gryff stands for Griffin, a hybrid between lion and eagle, as an attribution for the protocol being a hybrid between EPaxos and ABD.

The only difference in read-write in Gryff from multi-writer ABD is that replicas maintain a carstamp associated with the current value of the shared object instead of a tag so that rmws are properly ordered with respect to reads and writes.
  • Write. The rmwc field is reset to 0 (Line 15 of Algorithm 1).
  • Reads. "We make the same observation as Georgiou et al. [26] that the second phase in the read protocol of multi-writer ABD is redundant when a quorum already store the value and associated carstamp chosen in the first phase."
I like the succinct description of EPaxos in the paper; it depicts EPaxos as a generalization of Paxos with its three phases:
 EPaxos is a consensus protocol that provides optimal commit latency in the wide-area. It has three phases in failure-free executions: PreAccept, Accept, and Commit. If a command commits on the fast path (i.e., If the coordinator receives a fast/supermajority quorum of responses that all contain the same dependencies), the coordinator returns to the client after the PreAccept phase and skips the Accept phase (where it builds the final dependencies for the command by taking the union of all the dependencies that it received in the PreAccept phase). Otherwise, the command commits on the slow path after the Accept phase. Commands that do not read state complete at the beginning of the Commit phase; commands that do read state complete after a single replica, typically the coordinator, executes the command to obtain the returned state. The purpose of the PreAccept and Accept phases is to establish the dependencies for a command, or the set of commands that must be executed before the current command. The purpose of the Commit phase is for the coordinator to notify the other replicas of the agreed-upon dependencies.
Gryff makes three high-level modifications to EPaxos to unify its stable ordering with the unstable ordering of the multiwriter ABD read and write protocol.
  1. A base update attribute, base, is decided by the replicas during the same process that establishes the dependencies and the approximate sequence number for a rmw.
  2. A rmw completes after a quorum execute it.
  3. When a rmw executes, it chooses its base update from between its base attribute and the result of the previously executed rmw prev. The result of the executed rmw is applied to the value and carstamp of the executing replica.
The first change adapts EPaxos to work with the unstable order of writes by fixing the write upon which it will operate. The second change adapts EPaxos to work with reads that bypass its execution protocol and directly read state. In other words, the second change makes the ABD protocol able to read the EPaxos outcome. This makes the commit phase a two-phase operation; the rmw coordinator should see the commit completed by a majority quorum.

In sum, Gryff makes reads more efficient by performing them with ABD (with the above-mentioned Georgiou optimization) instead of EPaxos where a supermajority quorum would be needed. While Gryff uses two-round ABD writes, I think that may also be reduced to one round write with a trick for inferring the higher TS value to propose time with (like using HLC clock timestamps) in an optimistic way and learn the higher timestamp if that fails, and complete in two rounds.

On the other hand, I am starting to like the invalidation idea in Hermes more. In contrast to ABD used in Gryff, Hermes allows linearizable reads while writes are ongoing, and local reads at that.
  A coordinator node issues a write to a key only if it is in the Valid state; otherwise the write is stalled. This doesn't seem to be necessary for safety, because the higher timestamped writes will preempt the lower timestamped writes. So why does Hermes do this? I think they do this, because it get replicas see the writes concluded, even when there is a deluge of writes to the same key. This may in turn help alleviate the read starvation due to constant flood of writes to the same key. I found this in the slack channel for ASPLOS'20 from the first author:
  It is safe for a read that initially found the object invalidated with version timestamp 2 and then subsequently invalidated with a version timestamp 3 to get serialized and return the version 2 value. Intuitively this is partly safe because a write with version 3 could not have started unless the write with version 2 has been committed.

Proxying reads

The base Gryff read protocol provides reads with single round-trip time latency from the coordinator to the nearest quorum including itself (1 RTT) when there are no concurrent updates. Otherwise, reads have at most 2 RTT latency. The paper discusses how read latency can be further improved in deployments across wide area networks.

Because the round-trip time to the replica that is colocated with a client process is negligible relative to the interreplica latency, replicas can coordinate reads for their colocated clients and utilize their local state in the read coordinator protocol to terminate after 1 RTT more often. When using this optimization, we say that the coordinating replica is a proxy for the client process's read.

  • Propagating Extra Data in Read Phase 1. The proxy includes in the Read1 (i.e., read-phase1) messages its current value v and carstamp cs. Upon receiving a Read1 message with this additional information, a replica applies the value and carstamp before returning its current value and carstamp. This has the effect of ensuring every replica that receives the Read1 messages will have a carstamp (and associated value) at least as large as the carstamp at the proxy when the read was invoked.
  • Updating the Proxy’s Data. The proxy also applies the values and carstamps that it receives in Read1Reply messages as it receives them and before it makes the decision of whether or not to complete the read after the first phase. If every reply contains the same carstamp, then the read completes after 1 RTT even if the carstamp at the proxy when the read was invoked is smaller than the carstamp contained in every reply.


Gryff is implemented in the same framework as EPaxos and MultiPaxos and its performance is evaluated in a geo-replicated setting. The evaluation shows that, for workloads with moderate contention, Gryff reduces p99 read latency to ∼56% of EPaxos, but has ∼2x higher write latency. This tradeoff allows Gryff to reduce service-level p50 latency to ∼60% of EPaxos for large-scale web applications whose requests fan-out into many storage-level requests.

My big problem with the evaluation is that it doesn't use leader-leases optimization in MultiPaxos to allow serving of reads locally at the leader. This standard optimization would likely lead to MultiPaxos giving the best result for read latencies in the evaluations.

Another thing missing in the evaluation is a comparison with Cassandra. Cassandra implements read/write registers via ABD-like algorithm and can give linearizability if you configure the quorums accordingly. Cassandra also has CASPaxos for compare-and-set for conditional write, which can be used to implement read-modify-write.

The evaluation shows less blocking for rmws in Gryff. Gryff achieves 2 RTT rmws when there are no conflicts and 3 RTT when there are. While Gryff must still block the execution of rmws until all dependencies have been received and executed, Gryff experiences significantly less blocking than EPaxos. This is because EPaxos needs to have dependencies on writes, but Gryff’s rmw protocol does not keep track of dependencies on writes.

We see that Gryff and EPaxos each achieve a slightly higher maximum throughput than MultiPaxos due to their leaderless structure. (This is of course at low conflict rates, because at high conflict rates EPaxos and Gryff pay a stiff price). It is easy to access MultiPaxos to per-key sharded Paxos, and that would compete and out-do Gryff and EPaxos.  For achieving best performance (for both throughput and latency) for a WAN key-value deployment, however, I suggest using our WPaxos protocol, as it can adapt to locality of access as well.

Our first Zoom DistSys reading group meeting

We did our first Zoom DistSys reading group meeting on April 1st, Wednesday 15:30 EST. We discussed the Gryff paper.

I didn't have much Zoom  experience, and this was very experimental reaching out to the world at large to run a reading group with whomever is interested.

20 people attended.  As I was introducing the format, one person starting writing chat messages, saying "this is so boring", etc.  He had connected with a phone, and the video was showing him walking probably in a market. This should have been a red flag. The meeting participants asked me to remove him, because he was pinging them and bothering them as well. That was our troll.

I had taken measures to stop zoom-bombing, since I had heard this was an issue.
  • Only the hosts and cohosts could share screen. 
  • I made two co-hosts to help with moderation. 
  • I had selected the option to disallow joining after removal.
I removed the troll, and there was no incidents after that.

The meeting took 90 minutes. The presentation from the slides was for 30 minutes. The general discussion was 25 minutes, and we used 25 minutes for the breakout rooms for focused deeper discussion on selected questions. And we had a 10 minutes of wrap up at the end, where we summarized discussion from the breakout sessions.

This meeting was the first time I used the breakout room session. I was not sure how well the breakout rooms would work, but it turned out great. I used automatic assigning of participants to the room and mentioned that I would switch users between rooms if requested. We had 3 breakout rooms, and around 5 people per room. This enabled us to meet each other in the breakout rooms, and we had a more relaxed and productive conversation. One participant in my breakout room had joined from India (2:30 am local time) and another had joined from Germany (11:30 pm local time). It was nice meeting and discussing with people passionate about distributed systems around the globe.

In the wrap up after the breakouts, I performed a poll and learned that 90% of the participants has read the paper before the meeting. This improved the quality of the discussion. It is not easy to understand a paper just by watching a 25 minute presentation.

Here are the documents for offline viewing. But you had to be there; there is a big advantage for live participation.

On the whole, the Zoom meeting went better than I expected. After we get more experience with the mechanics of Zoom, I think things will run better.

I am now hopeful that we can make this meeting sustainable. I hope we will be able to get enough volunteers for presenting papers. The paper presentation is only for 20-30 minutes, because we assume participants read the paper before the presentation. Volunteering to be a presenter is a good commitment to make for learning more about a specific paper/topic. We ask the presenter to give us a summary 4-5 days before the presentation. This gives enough time for others to get a head start for preparing for the discussion.

Here are the next papers, we will discuss in the upcoming meetings:

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...