Thursday, April 2, 2020

Gryff: Unifying consensus and shared registers

This paper is by Matthew Burke, Audrey Cheng, and Wyatt Lloyd, and appeared in NSDI'20. Here is a link to the paper, slides, and video presentation.

Straight talk (from the middle of the book)

  • The model of the paper is a great contribution. Stable versus Unstable ordering is a good framework to think in. Carstamps (consensus after registers) logical clock timestamping is a good way to realize this ordering. I think carstamps will see good adoption, as it is clear, concrete, and useful.
  • Constructing a hybrid of EPaxos and ABD is a novel idea.
  • The performance of Gryff is not good. A straightforward distributed key-value sharded implementation of Paxos would do a better job. I think Hermes is a better choice than Gryff with read-write and read-modify-write operations.


Recently we see a lot of interest in unifying consensus and shared registers, the topic of the paper. I think this is because of the popularity of distributed key-value stores/systems. While consensus is often used for implementing distributed key-value stores, this is a bit of an overkill. You don't need to know the previous value of the key, if you are overwriting it with a write operation to the key. For the write operation, the new value is not a function of the old value.  As Gryff says in the abstract: "Consensus is inefficient for workloads composed mostly of reads and writes". ABD is good enough for that, and it even gives you linearizability to the face of concurrent asynchronous execution with crash faults. However, ABD for shared registers is too weak to implement stronger synchronization primitives. You would still need to use the consensus gun for the occasional read-modify-write operation.

So this led many people to think about providing read/write operations and read-modify-write operations separately for implementing distributed key-value systems. Hermes (ASPLOS'20), which I reviewed earlier, is an example of this. Fine-Grained Replicated State Machines (NSDI'20), which we will discuss in an upcoming Zoom DistSys reading group, also looks at a related problem.

Consensus vs. Shared Registers

A big contribution of the paper is the framework it introduces for thinking about read write operations and read-modify write operations.

Applying commands in the same order on all replicas requires an ordering mechanism that is stable, i.e., a replica knows when a command's position is fixed and it will never receive an earlier command. In asynchronous systems where processes can fail, consensus protocols are used to agree on this stable ordering.

Shared register protocols provide a linearizable ordering of operations. That ordering does not have to be stable, however, because each write operation fully defines the state of the object. Thus, a replica can safely apply a write w4 even if it does not know about earlier writes. If an earlier write w3 ever does arrive, the replica simply ignores that write because it already has the resulting state from applying w3 and then w4. Figure 1b shows shared register ordering where there is a total order of all writes (denoted by <) without stability.

A shared object in Gryff exposes the following interface:
  • READ(): returns the value of the object
  • WRITE(v): updates the value of the object to v
  • RMW( f (·)): atomically reads the value v, updates value to f (v), and returns v

Carstamps for correct ordering

The paper says that AQS (active quorum system) protocol [2010] which tried to unify consensus and shared registers, has a subtle bug that allows rmw to be misplaced/misordered. The paper says that, to simplify reasoning about correctness, it is best to enforce the interaction at a deeper level, in the ordering mechanism, by imposing a structural order in the timestamps.

To leverage this insight, the paper introduces consensus-after-register timestamps, or carstamps. Carstamps allow writes and rmws to concurrently modify the same state without serializing through a leader or incurring additional round trips. Reads use carstamps to determine consistent values without interposing on concurrent updates.

Gryff Protocol

The name Gryff stands for Griffin, a hybrid between lion and eagle, as an attribution for the protocol being a hybrid between EPaxos and ABD.

The only difference in read-write in Gryff from multi-writer ABD is that replicas maintain a carstamp associated with the current value of the shared object instead of a tag so that rmws are properly ordered with respect to reads and writes.
  • Write. The rmwc field is reset to 0 (Line 15 of Algorithm 1).
  • Reads. "We make the same observation as Georgiou et al. [26] that the second phase in the read protocol of multi-writer ABD is redundant when a quorum already store the value and associated carstamp chosen in the first phase."
I like the succinct description of EPaxos in the paper; it depicts EPaxos as a generalization of Paxos with its three phases:
 EPaxos is a consensus protocol that provides optimal commit latency in the wide-area. It has three phases in failure-free executions: PreAccept, Accept, and Commit. If a command commits on the fast path (i.e., If the coordinator receives a fast/supermajority quorum of responses that all contain the same dependencies), the coordinator returns to the client after the PreAccept phase and skips the Accept phase (where it builds the final dependencies for the command by taking the union of all the dependencies that it received in the PreAccept phase). Otherwise, the command commits on the slow path after the Accept phase. Commands that do not read state complete at the beginning of the Commit phase; commands that do read state complete after a single replica, typically the coordinator, executes the command to obtain the returned state. The purpose of the PreAccept and Accept phases is to establish the dependencies for a command, or the set of commands that must be executed before the current command. The purpose of the Commit phase is for the coordinator to notify the other replicas of the agreed-upon dependencies.
Gryff makes three high-level modifications to EPaxos to unify its stable ordering with the unstable ordering of the multiwriter ABD read and write protocol.
  1. A base update attribute, base, is decided by the replicas during the same process that establishes the dependencies and the approximate sequence number for a rmw.
  2. A rmw completes after a quorum execute it.
  3. When a rmw executes, it chooses its base update from between its base attribute and the result of the previously executed rmw prev. The result of the executed rmw is applied to the value and carstamp of the executing replica.
The first change adapts EPaxos to work with the unstable order of writes by fixing the write upon which it will operate. The second change adapts EPaxos to work with reads that bypass its execution protocol and directly read state. In other words, the second change makes the ABD protocol able to read the EPaxos outcome. This makes the commit phase a two-phase operation; the rmw coordinator should see the commit completed by a majority quorum.

In sum, Gryff makes reads more efficient by performing them with ABD (with the above-mentioned Georgiou optimization) instead of EPaxos where a supermajority quorum would be needed. While Gryff uses two-round ABD writes, I think that may also be reduced to one round write with a trick for inferring the higher TS value to propose time with (like using HLC clock timestamps) in an optimistic way and learn the higher timestamp if that fails, and complete in two rounds.

On the other hand, I am starting to like the invalidation idea in Hermes more. In contrast to ABD used in Gryff, Hermes allows linearizable reads while writes are ongoing, and local reads at that.
  A coordinator node issues a write to a key only if it is in the Valid state; otherwise the write is stalled. This doesn't seem to be necessary for safety, because the higher timestamped writes will preempt the lower timestamped writes. So why does Hermes do this? I think they do this, because it get replicas see the writes concluded, even when there is a deluge of writes to the same key. This may in turn help alleviate the read starvation due to constant flood of writes to the same key. I found this in the slack channel for ASPLOS'20 from the first author:
  It is safe for a read that initially found the object invalidated with version timestamp 2 and then subsequently invalidated with a version timestamp 3 to get serialized and return the version 2 value. Intuitively this is partly safe because a write with version 3 could not have started unless the write with version 2 has been committed.

Proxying reads

The base Gryff read protocol provides reads with single round-trip time latency from the coordinator to the nearest quorum including itself (1 RTT) when there are no concurrent updates. Otherwise, reads have at most 2 RTT latency. The paper discusses how read latency can be further improved in deployments across wide area networks.

Because the round-trip time to the replica that is colocated with a client process is negligible relative to the interreplica latency, replicas can coordinate reads for their colocated clients and utilize their local state in the read coordinator protocol to terminate after 1 RTT more often. When using this optimization, we say that the coordinating replica is a proxy for the client process's read.

  • Propagating Extra Data in Read Phase 1. The proxy includes in the Read1 (i.e., read-phase1) messages its current value v and carstamp cs. Upon receiving a Read1 message with this additional information, a replica applies the value and carstamp before returning its current value and carstamp. This has the effect of ensuring every replica that receives the Read1 messages will have a carstamp (and associated value) at least as large as the carstamp at the proxy when the read was invoked.
  • Updating the Proxy’s Data. The proxy also applies the values and carstamps that it receives in Read1Reply messages as it receives them and before it makes the decision of whether or not to complete the read after the first phase. If every reply contains the same carstamp, then the read completes after 1 RTT even if the carstamp at the proxy when the read was invoked is smaller than the carstamp contained in every reply.


Gryff is implemented in the same framework as EPaxos and MultiPaxos and its performance is evaluated in a geo-replicated setting. The evaluation shows that, for workloads with moderate contention, Gryff reduces p99 read latency to ∼56% of EPaxos, but has ∼2x higher write latency. This tradeoff allows Gryff to reduce service-level p50 latency to ∼60% of EPaxos for large-scale web applications whose requests fan-out into many storage-level requests.

My big problem with the evaluation is that it doesn't use leader-leases optimization in MultiPaxos to allow serving of reads locally at the leader. This standard optimization would likely lead to MultiPaxos giving the best result for read latencies in the evaluations.

Another thing missing in the evaluation is a comparison with Cassandra. Cassandra implements read/write registers via ABD-like algorithm and can give linearizability if you configure the quorums accordingly. Cassandra also has CASPaxos for compare-and-set for conditional write, which can be used to implement read-modify-write.

The evaluation shows less blocking for rmws in Gryff. Gryff achieves 2 RTT rmws when there are no conflicts and 3 RTT when there are. While Gryff must still block the execution of rmws until all dependencies have been received and executed, Gryff experiences significantly less blocking than EPaxos. This is because EPaxos needs to have dependencies on writes, but Gryff’s rmw protocol does not keep track of dependencies on writes.

We see that Gryff and EPaxos each achieve a slightly higher maximum throughput than MultiPaxos due to their leaderless structure. (This is of course at low conflict rates, because at high conflict rates EPaxos and Gryff pay a stiff price). It is easy to access MultiPaxos to per-key sharded Paxos, and that would compete and out-do Gryff and EPaxos.  For achieving best performance (for both throughput and latency) for a WAN key-value deployment, however, I suggest using our WPaxos protocol, as it can adapt to locality of access as well.

Our first Zoom DistSys reading group meeting

We did our first Zoom DistSys reading group meeting on April 1st, Wednesday 15:30 EST. We discussed the Gryff paper.

I didn't have much Zoom  experience, and this was very experimental reaching out to the world at large to run a reading group with whomever is interested.

20 people attended.  As I was introducing the format, one person starting writing chat messages, saying "this is so boring", etc.  He had connected with a phone, and the video was showing him walking probably in a market. This should have been a red flag. The meeting participants asked me to remove him, because he was pinging them and bothering them as well. That was our troll.

I had taken measures to stop zoom-bombing, since I had heard this was an issue.
  • Only the hosts and cohosts could share screen. 
  • I made two co-hosts to help with moderation. 
  • I had selected the option to disallow joining after removal.
I removed the troll, and there was no incidents after that.

The meeting took 90 minutes. The presentation from the slides was for 30 minutes. The general discussion was 25 minutes, and we used 25 minutes for the breakout rooms for focused deeper discussion on selected questions. And we had a 10 minutes of wrap up at the end, where we summarized discussion from the breakout sessions.

This meeting was the first time I used the breakout room session. I was not sure how well the breakout rooms would work, but it turned out great. I used automatic assigning of participants to the room and mentioned that I would switch users between rooms if requested. We had 3 breakout rooms, and around 5 people per room. This enabled us to meet each other in the breakout rooms, and we had a more relaxed and productive conversation. One participant in my breakout room had joined from India (2:30 am local time) and another had joined from Germany (11:30 pm local time). It was nice meeting and discussing with people passionate about distributed systems around the globe.

In the wrap up after the breakouts, I performed a poll and learned that 90% of the participants has read the paper before the meeting. This improved the quality of the discussion. It is not easy to understand a paper just by watching a 25 minute presentation.

Here are the documents for offline viewing. But you had to be there; there is a big advantage for live participation.

On the whole, the Zoom meeting went better than I expected. After we get more experience with the mechanics of Zoom, I think things will run better.

I am now hopeful that we can make this meeting sustainable. I hope we will be able to get enough volunteers for presenting papers. The paper presentation is only for 20-30 minutes, because we assume participants read the paper before the presentation. Volunteering to be a presenter is a good commitment to make for learning more about a specific paper/topic. We ask the presenter to give us a summary 4-5 days before the presentation. This gives enough time for others to get a head start for preparing for the discussion.

Here are the next papers, we will discuss in the upcoming meetings:

Friday, March 27, 2020

Zoom Distributed Systems Reading Group

A couple days ago, I tweeted this out, and was surprised how well this was received.
The involvement level distribution looks healthy. One third wants to follow offline, one third likes to attend the discussion session live, and the remaining will get more involved in the reading group by doing their readings and some volunteering to present papers.

We will start the Zoom DistSys Reading Group on Wednesday April 1st (no joke!). We will meet at my/ muratbuffalo at 15:30 EST. (remove the spaces.)

Following a very brief presentation of the paper, we will start discussing the paper together. After we identify 4-5 interesting questions/directions to dig deeper, we will go into breakout sessions, so that each group can work discuss on the question. At the end, we will gather again, and get summaries from breakout sessions. (I have been using this reading group format for my seminars and was very happy about it.)

We will use Google docs to collect comments, and to help the breakout sessions work on their questions together.

I will be recording the Zoom session and upload it on YouTube. After the meeting, I will write a blog post summarizing our understanding and findings of the paper, providing links to the video recording and Google documents.

Preparation before the Zoom meeting

The paper we will discuss is: Gryff: Unifying Consensus and Shared Registers. You can download the paper freely here. Here is a link to USENIX NSDI'20 presentation of the paper, and a link to the presentation slides.

This paper got on my radar after writing a summary of a related paper, Hermes: A fast fault-tolerant and linearizable replication protocol.

Aleksey will present the paper. Here is a link to his one page (for some definition of a page) summary for Gryff.

We like to collect questions about the paper before the Zoom Discussion. Please use this Google Document to write questions you like to get resolved about the paper. We will be using this document also as the discussion/collaboration document during the Zoom session.

Aleksey created a Google/YouTube account for distsysreading, and if you have suggestions/comments you can email to the distsysreading gmail address (just add to the name).

Tuesday, March 24, 2020

Hermes: A fast fault-tolerant and linearizable replication protocol

This paper (from ASPLOS'20 which was held remotely) is by Antonios Katsarakis, Vasilis Gavrielatos, M. R. Siavash Katebzadeh, Arpit Joshi, Aleksandar Dragojevic, Boris Grot, Vijay Nagarajan. The paper has its own website, where you can get to the video presentation, slides, and code.


Hermes is a replication protocol that guarantees linearizability. It enables local reads: a client can execute a read locally on any of the replicas. Hermes enables any replica to coordinate a write to a key, and supports concurrent writes to different keys quickly.

Too good to be true? You have to read the protocol section below to see how these are achieved.

But if you are a distributed systems expert, here is my shortcut explanation of the protocol. Hermes is simply chain replication (with CRAQ optimization) deployed with the following "chain" topology:
  1. The head and tail node of the chain is colocated in one node, called the coordinator
  2. The intermediate nodes of the "chain" are all parallel-connected (rather than serial) to the coordinator
In chain replication (or CRAQ) the replication goes in a serial manner, but  the chain replication paper mentioned that parallel wiring is possible as long as you have one head and one tail. And when you colocate the head and the tail in the same node, Voila!, you have Hermes. To clean the replicas for reads from any replica, the tail (i.e., the coordinator) sends an ACK message as in CRAQ. (see Figure 2 below.)

In addition to solving the latency problem in chain replication by using parallel-wiring, Hermes also allows multiple coordinators to help balance the coordinating load across nodes. Any node can be a coordinator for any key. Thanks to the logical clock timestamping (using node-ids as tie-breakers), the writes are total-ordered in Hermes. Since higher timestamped writes invalidate lower timestamped writes, the result of concurrent writes will be the same at each node, and linearizability is achieved even with local reads.

I summarize Hermes below, and then discuss how Hermes compare with Paxos-based protocols.


Writes can be initiated by any replica:
  1. the replica initiating the write (called coordinator) broadcasts an Invalidation (INV) message to the rest of the replicas (called followers) and waits on acknowledgments (ACKs)
  2. once all ACKs have been received; the write completes via a Validation (VAL) message broadcast by the coordinator replica
As the first step, the coordinator invalidates replicas for this key, so that linearizability is not violated if a client reads the key from a node that has an old value.

A read request can be served locally on any operational replica (i.e., one with a lease from the membership service). The replica returns the local value of the requested key only if it is in the Valid state. When an INV message for a key is received, the replica is placed in an Invalid state for that key, meaning that reads to the key cannot be served by the replica.

Membership service

In the write operation described above, the coordinator waits to hear an ACK from each replica. If a replica crashes, this results in the write to get stuck forever, right? To address this issue, there is a need for detecting crashed nodes.

Instead of making each node have a failure detector ---which is hard to keep consistent---, Hermes (similar to chain replication) employs an external Paxos-backed configuration/membership service that decides on the health of the nodes. This service acts as a single consistent (but not necessarily perfect) failure detector for the replicas. It becomes the sole source of "truth/perspective": While it can be mistaken in its judgment, it keeps every replica in Hermes consistent with respect to their view of which nodes are healthy and part of the protocol.

This Paxos-powered membership/configuration service changes configuration/view when needed, and at each view-change it increases the epoch number. This keeps Hermes safe (and eventually live) in a partially synchronous environment ---with bouts of asynchrony.

Well, there is still the problem with lease safety at replication nodes. Each replica need a lease from the membership service for this to work (again as in chain replication). See the fault-tolerance section below for how this is handled.

Concurrent writes

Hermes allows writes to different keys to proceed in parallel for impoving the throughput.

As for concurrent writes to the same key, invalidations plus logical timestamps impose a total order on these writes. This prevents conflicts and aborts, and ensures that those are correctly linearized at the replicas.

A coordinator node issues a write to a key only if it is in the Valid state; otherwise the write is stalled. This doesn't seem to be necessary for safety, because the higher timestamped writes will preempt the lower timestamped writes. So why does Hermes do this? I think they do this, because it get replicas see the writes concluded, even when there is a deluge of writes to the same key. This may in turn help alleviate the read starvation due to constant flood of writes to the same key. I found this in the slack channel for ASPLOS'20 from the first author:
  It is safe for a read that initially found the object invalidated with version timestamp 2 and then subsequently invalidated with a version timestamp 3 to get serialized and return the version 2 value. Intuitively this is partly safe because a write with version 3 could not have started unless the write with version 2 has been committed. 
This assumes no epoch change, I presume. A couple sections below, I will discuss about our Paxos-Quorum-Reads technique which does a similar thing, but without blocking the writes to wait for earlier writes to finish, and without requiring leases or a configuration/membership service.

Read-modify-write updates

Linearizability is not the whole story. You can get linearizability in Cassandra, using the ABD algorithm, which is not even subject to FLP. But the problem is ABD is not consensus, and it is not good alone for maintaining state machine replication. 

Hermes is trying to do more and achieve state machine replication. It enforces the replicas to have the same log in the same order (for the same key). The paper also shows how Hermes can support read-modify-write (RMW) updates, an atomic execution of a read followed by a write to a key (e.g., a compare-and- swap to acquire a lock).
  An RMW update in Hermes is executed similarly to a write, but it is conflicting. An RMW which is concurrently executed with another update operation to the same key may get aborted. Hermes commits an RMW if and only if the RMW has the highest timestamp amongst any concurrent updates to that key. Moreover, it purposefully assigns higher timestamps to writes compared to their concurrent RMWs. As a result, any write racing with an RMW to a given key is guaranteed to have a higher timestamp, thus safely aborting the RMW. Meanwhile, if only RMW updates are racing, the RMW with the highest node id will commit, and the rest will abort.
A recent paper, Gryff in NSDI20, also investigates this problem. It uses the ABD algorithm for read-write registers, and EPaxos in conjunction with consensus-after-register timestamps (carstamps) for the RMW updates. In Gryff, the RMW operations do not get aborted, they just get ordered correctly by EPaxos even after a conflict.

While we are on the topic of related work, I wonder how Hermes compares with RIFL:
Implementing Linearizability at Large Scale and Low Latency (SOSP'15). The paper does not cite RIFL, but it would be nice to compare and contrast the two protocols.


Hermes seamlessly recovers from a range of node and network faults thanks to its write replays, enabled by early value propagation.
  Node and network faults during a write to a key may leave the key in a permanently Invalid state in some or all of the nodes. To prevent this, Hermes allows any invalidated operational replica to replay the write to completion without violating linearizability. This is accomplished using two mechanisms. First, the new value for a key is propagated to the replicas in INV messages (see Figure 2). Such early value propagation guarantees that every invalidated node is aware of the new value. Secondly, logical timestamps enable a precise global ordering of writes in each of the replicas. By combining these ideas, a node that finds a key in an Invalid state for an extended period can safely replay a write by taking on a coordinator role and retransmitting INV messages to the replica ensemble with the original timestamp (i.e., original version number and cid), hence preserving the global write order.
For fault-tolerance, the membership service and leases at replicas play a central role. If one replica is partitioned out, the coordinator cannot make progress unless the membership service updates the membership to remove that replica. The membership service waits until the lease it granted to the partitioned replica expires. The lease expiration makes the replica invalid. The membership service then increases epoch number, and disseminates the new membership information to the replicas, and the coordinator (or any other replica node via the early value propagation technique) can make progress.

Protocol-level comparison to Paxos based solutions, and PQR

The round trip and a half protocol in Hermes has similarities (at least in terms of performance bottleneck characteristics) to the Phase-2 "accept" and Phase-3 "commit" the Paxos leader (via MultiPaxos optimization) performs with the followers.

The nice thing about Paxos based solutions is that there is no outside membership/reconfiguration box needed in that solution. Below let's discuss how well Paxos-based solutions can hold up to Hermes's features.

Hermes distributes the coordination load across replicas. EPaxos has the same feature due to opportunistic leaders approach. It is also possible to deploy Paxos with per-key sharding to the leaders (this was mentioned and compared with in EPaxos paper I think). In our WPaxos protocol, we improved over the per-key sharding to the leaders approach, and showed how WPaxos can outperform it by stealing keys and assigning it to the closest leaders to improve performance based on the access pattern in the workload.

Hermes does local reads from one replica. Megastore from Google also allows local reads from one replica with support from coordinators.

In our recent work, we introduced Paxos Quorum Reads (PQR) and showed how to perform linearizable reads from Paxos protocols without involving the leader and without using any leases. Since PQR does not require leases, it works in an asynchronous environment. PQR requires reading from majority of the nodes, to catch if there has been a newer pending update to the key. If there is a pending update, the clearing of the read can be done by just barriering on one replica. It is possible to relax the initial majority-quorum read and instead use fewer number of nodes by using a larger write quorum. While reading from multiple nodes in parallel requires more messages, it does not increase the latency. See this brief summary of Paxos Quorum Reads to learn more.


Hermes is evaluated over an RDMA-enabled reliable datastore with five replicas. The evaluation compares Hermes with ZAB and CRAQ. At 5% writes, the tail latency of Hermes is 3.6× lower than that of CRAQ and ZAB.

The performance improvement in Hermes, I think comes from using multiple coordinators, which was not made available to ZAB or CRAQ.

The figures show that "fewer writes=more reads" is better for Hermes because of the local reads in Hermes. On the other hand, observe that for uniform key distribution workloads, CRAQ is as good as Hermes for throughput even though the replicas there are serially wired instead of parallel-wired.

For latency, the improvement due to parallel-wiring replicas is significant.

Sunday, March 22, 2020

My Emacs setup

In January, I switched to the new Meh-Book Pro (typo intended).

Setting up a new laptop is hard because there are too many things to keep track of. Even so, rather than copying the state of my previous laptop to the new one, I decided to do the setup from scratch as this would help for software rejuvenation and getting rid of all the junk that accumulated on my laptop for five years.

(Side remark: A couple years ago, I heard from a top cloud computing expert say that datacenters also gather junk over time. He told me that once during a renovation of a datacenter they discovered servers that are sitting there without any use --that no one knows of--, and throw them out. He had likened the experience to getting rid of defunct stuff when you migrate to a new laptop.)

Anyways, I was stressed about the task, but then I remembered that past Murat has left me notes on how to setup the laptop. Fantastic!

I found out a couple things past-Murat had missed in the instructions. I am noting them here out of courtesy for future-Murat. I am paying it forward.

Install auctex for latex stuff and ispell for the spell checker Emacs uses. brew install ispell --with-lang-en

Don't forget to "brew cask install" the following: slack, filezilla, adblock, tla-plus-toolbox

Add the "home directory" to finder. (Stupid MacOS.)

Disable bluetooth from waking up the laptop to prevent the phone headphone to connect to the sleeping laptop and wake it up. Duh!

Emacs setup

Most of my new notes is about the Emacs setup, where I spent the bulk of my time.

I installed emacs using cask rather than from binary so I get to enjoy the ELPA package-manager support by typing "M-x package-show-package-list".

I had a problem using the previous starter-kit I had and couldn't get it to work. I searched and found kjhealy's kit highly recommended, so I installed that one.

This starter-kit takes care of everything and is easy to install. It sits in my .emacs.d directory. I also create myusername.el in that folder to add my customization.
(global-set-key (kbd "<f1>") 'org-timer-set-timer)
(global-set-key (kbd "<f2>") 'flyspell-auto-correct-word)
;; F3 store macro
;; F4 apply macro
(global-set-key (kbd "<f5>") 'org-schedule)
(global-set-key (kbd "<f6>") 'org-deadline)
;; (global-set-key (kbd "<f7>") 'org-agenda-do-date-earlier) ;; instead use  M -
;; (global-set-key (kbd "<f8>") 'org-agenda-do-date-later) ;; instead use  M +
(global-set-key (kbd "<f7>") 'org-store-link)
(global-set-key (kbd "<f8>") 'org-insert-link)

(global-set-key (kbd "<f12>") 'org-agenda)

I install visual-fill-column, and add this to myusername.el
;; set line width to 80 characters
;; first do this!! M-x package-install visual-fill-column
(setq-default fill-column 80)
(global-visual-line-mode 1)
(global-visual-fill-column-mode 1)
(auto-fill-mode -1)
(remove-hook 'text-mode-hook #'turn-on-auto-fill)
;;shortcut for auto-fill mode on/off
(global-set-key (kbd "C-c w") 'auto-fill-mode)
;; visual-line-mode for handling long lines
(add-hook 'text-mode-hook 'turn-on-visual-line-mode)
(add-hook 'text-mode-hook 'turn-on-visual-fill-column-mode)

Yes, I have a lot of junk DNA in that file.

I found that this starter kit uses the zenburn theme. It is nice pastel colors, but I prefer a light background. To be able to load other themes, I found that I need to remove the background and foreground colors in custom.el. Then  "M-x load-theme" works for loading other themes. I use leuven and sometimes anti-zenburn as my light-background Emacs theme.

I know I can open dozens of buffers in one Emacs windows, but I like to keep three Emacs windows open for better organization:
  • one window is for org-agenda and related files to organize my life (for this I still use the zenburn theme as that looks cool in agenda mode)
  • the other two windows are on research projects (for these I use the leuven theme) 

I simplified my pomodoro setup as well. I use <f1> to invoke org-timer-set-timer, and enter how many minutes I want the pomodoro for. I hooked a function to the org-timer-done, and call applescript to vocalize a celebratory message when the timer expires. This follows from the practices recommended in TinyHabits.
;;(defun my-org-timer-hook ()
;;  (sound-wav-play "~/.emacs.d/Bell.wav"))
(defun my-org-timer-hook ()
  (do-applescript "say \"Well done!\""))

(add-hook 'org-timer-done-hook 'my-org-timer-hook)

;; (setq visible-bell nil)
(setq org-clock-sound t) ;; Standard Emacs beep; visible-bell if above commented

I set the timer to 10 minutes most of the time, because that gets me started.  I can keep going with longer durations.

Related post: My Emacs Journey

Wednesday, March 18, 2020

PigPaxos: Devouring the communication bottlenecks in distributed consensus

This is our most recent work, started and led by Aleksey Charapko. (This is a joint post with him.) You can get the paper at The paper is currently under submission to a journal.

The story

One day I challenged Aleksey to give me a ballpark number on how much he thinks we can scale Paxos vertically. While sharding --as in CockroachDB and Spanner-- helps for scaling Paxos deployments horizontally, vertical scaling is about how many nodes you can cram in a single Paxos cluster, with a single conflict domain.

Aleksey, who is not known for being an optimist, said that we can scale Paxos to several hundreds of nodes! He said this may be possible by employing intermediate proxy nodes to relay the communication between the leader and followers, as this would relieve the communication bottleneck at the leader.

I thought "yeah, it is a neat trick, but maybe not that impressive, because it is very simple". Surely others must have tried this, and there must be a catch/drawback. We checked but couldn't find any previous work on this. At the Sigmod'19 reception at Van Gogh museum, I mentioned this idea to Miguel Castro (of PBFT fame among others). He liked the idea and said he couldn't think of anyone that studied this before.

The central idea in PigPaxos is to decouple the communication from the decision-making at the leader. PigPaxos revises the communication flow to replace the direct communication between the leader and followers in Paxos with a relay based communication flow. PigPaxos chooses relays randomly from follower clusters at each communication round to reduce contention and improve scalability of throughput. 

When Aleksey started evaluating PigPaxos, we were surprised by the  effectiveness of this simple technique. This shouldn't have been too much of a surprise because in our recent Sigmod paper we showed that leader bottleneck is the culprit behind the scalability problems of Paxos family of protocols and quantified on this bottleneck with back-of-the-envelope formulas. Still, the results from PigPaxos were beyond our expectations. We repeated our experiments many times, and double-checked everything before we could allow ourselves to believe these results. Employing relay nodes for relieving the leader, and randomly rotating the relay nodes for communication bottleneck shedding did wonders for the performance. We found that PigPaxos improved the throughput limit more than 3 folds over Paxos with negligible latency deterioration at 25 nodes. Even for as low as 9 nodes, we were able to see 1.5 folds throughput improvement with the same latency as Paxos.

This was a very fun paper to write. The writing went easy and quick. The analytical results section at the end of the paper added a lot to the paper. We put that section after the evaluation section to show that a simple back-of-the-envelope analysis formula can explain the evaluation results nicely.

What is up with the name?

When we started working on the idea, we were calling it BigPaxos, because our ultimate goal was to scale to hundreds of nodes. One day while shooting the shit about BigPaxos on Slack, Aleksey made a typo and wrote "PigPaxos". He didn't realize he made the typo (even though it is very hard to confuse b and p on the keyboard). I teased Aleksey for a couple of minutes by putting various pig emojis in my responses to his comments. He still didn't get the hint, and then I pointed the typo to him. We got a good laugh out of it.

I kept referring to the protocol as PigPaxos in later conversations. With the submission deadline looming, we were unable to prepare for 100 node experiments and add the optimizations we had in mind. I told Aleksey that we should call this protocol PigPaxos officially, and reserve the name BigPaxos for the large scale protocol we can show in the next paper.

Aleksey was not very receptive to the idea, thinking PigPaxos would look too informal in a research paper. He also argued that we don't have a good way to connect the name to the algorithm, all we had was a silly typo. The connection occurred to me on Slack again. In the protocol, the relay nodes wait for the followers’ responses from its cluster and piggyback them together into a single message. Well, this was a close enough association, so we went with the name PigPaxos.

Where can we use PigPaxos?

Paxos protocols are most commonly deployed with 3 and 5 nodes. But there are also several applications that require vertically scaling Paxos to run on a large number of nodes, all within the same conflict domain. One example is consistent cloud configuration management. Configuration management is required for gating new product features, conducting experiments (A/B tests), performing application-level traffic control, performing topology setup and load balancing, monitoring and remediation, updating machine learning models (which vary from KBs to GBs), controlling applications’ behaviors (related to caching, batching, prefetching etc), and controlling chain replication topologies as in Physalia.

Another example is geo-replicated databases. A consensus group in a geo-replicated database may consist of dozens of nodes across many regions around the globe. As we show in our evaluation, PigPaxos increases throughput scalability significantly across WAN deployments with a large number of nodes. Even for Paxos clusters with a small number of (say 5) nodes, large messages (such as database replication messages as in CockroachDB and Spanner) trigger a communication bottleneck at the leader. PigPaxos's randomized relaying technique can help with those bottlenecks as well.

While the idea in PigPaxos is simple and similar aggregation-based approaches have been employed in the context of weak-consistency replication protocols, PigPaxos is novel because it shows how these aggregation-based approaches can be effectively and safely integrated into the strong consistency distributed consensus protocols. The PigPaxos technique, being a simple general technique, is applicable to many Paxos-variant protocols, including Raft, Zab, WPaxos, etc.

Summary of the paper

OK, here is where the more technical content starts. Ready? I promise this will be good. You need a break from reading Covid-19 news anyways.

There are many optimizations possible over the basic scheme we outline below, but we relegate that discussion to the paper.

Background and related work

Paxos family of protocols are employed by many cloud computing services and distributed databases due to their excellent fault-tolerance properties. Unfortunately, current Paxos deployments do not scale for more than a dozen nodes due to the communication bottleneck at the leader.

In the most generic form, Paxos runs in 3 phases. Phase-1 establishes some node as the leader, phase-2 lets the leader to  impose its will onto the followers by telling what command to accept, and phase-3 finalizes the commitment by informing the followers that consensus has been reached.

The basic protocol is rather inefficient with these three communication phases, and Multi-Paxos optimization is often adopted to cut down the unnecessary pleasantries. Multi-Paxos elects one node as a stable leader for some prolonged time, and repeats the phase-2 however many times possible under the same leader, without needing to perform another phase-1. Phase-3 also gets piggybacked to some future phase-2 to reduce communication even more.

It is evident that the leader bears the brunt of the load in Paxos and MultiPaxos. In previous work, Mencius relieved leaders' workload by rotating them.  Recent blockchain consensus protocol LibraBFT from Facebook Libra also used pipelining to improve throughput (the original reason for pipelining was to reduce the effects of a Byzantine leader on the protocol). In contrast, the pipelining in PigPaxos employs random rotation of relay nodes, rather than leader rotation, and improves the throughput scalability significantly without any side effects. Since this is a very simple technique, it is more easily applicable and implementable.

PigPaxos communication flow

As shown in Figure 1&2, the communication in Paxos is direct between the leader and the followers with a fan-out to send messages and fan-in to collect the replies. PigPaxos observes that it is possible to employ intermediate nodes/proxies to help relay and aggregate the messages in this communication pattern. Instead of sending the fan-out messages to all of the followers, the leader transmits these to a small set of relay nodes, which propagate the messages to the remaining followers. The relay nodes also act as aggregators for the fan-in communication of the followers’ responses, and pass the combined results to the leader.

For simplicity sake, PigPaxos divides the entire cluster into a small static number of  relay groups, and a single relay node is chosen from each relay group randomly for every round trip communication round.  We use PigPaxos with the MultiPaxos optimization so only the phase-2 communication is performed in the normal case.

The randomized rotation of the relay nodes provide a big relief from communication bottlenecks. True, a relay node has its work cut out if it needs to aggregate responses from 10 nodes in its cluster. But since the relay nodes randomly rotate, a particular relay node will be off the hook for the several consequent rounds, and will be able to process these messages and send to the leader without getting overwhelmed.

The randomized rotation of the relay nodes also help for improving liveness when a relay crash occurs. Moreover, to guard against crashed or sluggish follower nodes, a timeout is used for setting a time threshold for followers in the group to reply. When the timeout occurs, the relay node acks to the leader with a partial aggregation.

Paxos to PigPaxos mapping

PigPaxos generalizes the communication implementation of the Paxos protocol. Paxos has $N-1$ groups, where each group has one element and the groups do not intersect with each other. In contrast in PigPaxos there are $p$ groups where $p \in \{ 1..N-1\}$.

We note that the safety and liveness proofs of Paxos do not depend on the communication implementation between the leader and follower nodes. In Paxos, maintaining correctness in spite of failures is guaranteed by quorum size and the information exchanged within the quorums, and the proofs are oblivious to how communication of this information is achieved. Therefore, PigPaxos preserves the safety and liveness properties of Paxos, as it only modifies the communication implementation. For reasoning about liveness, the message flow pattern and the use of relay nodes requires special attention, as failure of a relay node has disproportionate impact compared to the failure of a regular follower. Since PigPaxos uses random selection of relay/aggregator nodes at each round, it circumvents this problem and retains liveness.


We implemented PigPaxos in our Paxi framework with almost no changes to the core Paxos code, as we focused only on the message passing layer and relay group orchestration. The entire protocol was implemented in just 1200 lines of code. For evaluation we deployed the protocols on a cluster of up to 25 AWS EC2 m5a nodes with 2 vCPUs and 8 GB of RAM.

The PigPaxos leader communicates with only a handful of nodes for each consensus instance. Naturally, we wanted to see if relay groups get overwhelmed by the extra communication burden. To our surprise, PigPaxos with just 2 relay groups performed the best in a 25 node deployment. This suggests that the leader still remains a bottleneck and the relay groups still have resources to spare. As shown in Figure 7, for a cluster of “25 nodes divided into 2 relay groups”, PigPaxos had nearly ~25% advantage in maximum throughput compared to a 5-relay group setup.

The performance relative to Multi-Paxos on a 25-node cluster is astonishing (Figure 8). PigPaxos shows nearly 3.5 fold improvement in throughput compared to Paxos and 10 folds improvement over EPaxos.

The benefits of PigPaxos do not stop at the large clusters. To our surprise, we observed better throughput from PigPaxos on just a 5 node cluster. Yes, PigPaxos has slightly higher latency due to the additional network hop, but it is still able to edge out  some throughput improvements.


We back our empirical results with simple back-of-the-envelope load formulas. Based on our work in the SIGMOD paper, we use a simple count of messages processed by each node as a heuristic for the node’s relative load.  To start with the leader, its communication is no longer dependent on N, or the number of nodes in the cluster, and instead it is a linear function of r, the number of relay groups, plus 2 messages (incoming and outgoing) talking to the client:

Computing the relative message load on the follower is more involved, as we need to account the roles the follower can take on and the probability of each role:

Plugging our experimental configuration into these simple formulas shows us that the relay nodes are never a bottleneck (regardless of both the number of nodes N and number of relay groups r), and keeping the number of relay nodes small can move the entire system closer to the load parity between the leader and followers. The reason the relay nodes don’t become a bottleneck is because the random alternation of the relay nodes shields them from becoming hotspots: the extra traffic load a relay node incurs in one round is offset in consecutive rounds when the node no longer serves as relay.

The drawback with using a very small number of relay nodes (say only 1 relay group in the extreme) is that the performance becomes too fragile: few sluggish or crushed nodes may force the system to wait the timeouts at the relay groups. Using more relay groups mean the leader will receive majority confirmation even in the presence of sluggish/crashed nodes holding back the response from some relay groups. Finally the load that randomly rotated relays can manage is likely to be limited in practice due to hw/sw limitations at the node,  and that is also a factor to be explored.

Future work

There are several optimizations to help the Pig go Big. In our implementation, the relay group partitioning was static to keep things simple. Using dynamic relay groups we may add even more randomness to the communication process, and can deal with failures and sluggish nodes more smoothly. Another optimization is to make the relay nodes reply to the leader before collecting all the responses from peers: collecting one fewer response is not likely to affect the ability to reach majority quorum at the leader, but it prevents incurring a timeout due to one sluggish node in the relay group. More experiments are underway to see how this Pig squeals.

Here is a link to our paper again, if you are looking for some quarantine reading.

Sunday, March 15, 2020

Millions of tiny databases

This paper is by Marc Brooker, Tao Chen, and Fan Ping from Amazon Web Services. The paper appeared at USENIX NSDI 2020 at the end of February, which was held  on-site at Santa Clara. Right after that, all conferences got canceled due to the COVID-19 outbreak. Let's hope things stabilize for NSDI 2021.

What is this paper about?

This paper is about improving the availability of Amazon Elastic Block Storage (EBS).

EBS allows users to create block devices on demand and attach them to their AWS EC2 instances. EBS is maintained using chain replication, one of my favorite distributed algorithms. Chain replication takes consensus off the data path so it does not constitute a bottleneck for throughput. Data is replicated at one server after the other in the chain, without needing a leader ---when there is a leader, there is an incast bottleneck problem. Consensus is only needed when a fault occurs (or is presumed to occur) and the chain needs to be reconfigured by means of the configuration box, which implements fault-tolerant consensus via Paxos.

This paper is about that configuration box, Physalia, which oversees the chain replication systems. The specific problem considered in this paper is this: *How do you design and operate Physalia so that the availability of the EBS is maximized?*

In other words this paper is about the second order effects of the EBS replication system. But these second order effects still become very important at the AWS scale. If you have millions of nodes in EBS that need configuration boxes, you cannot rely on a single configuration box. Secondly, yes, the configuration box should not see much traffic normally, but when it does see traffic, it is bursty traffic because things went wrong. And if the configuration box layer also caves in, things will get much much worse. The paper gives an account of "21 April 2011 cascading failure and loss of availability" as an example of this.

Rather than describing new consensus protocols, in the spirit of Paxos Made Live, this paper describes the details, choices and tradeoffs that are required to put the Physalia consensus system into production.  The higher order message from the paper is that "infrastructure aware placement and careful system design can significantly reduce the effect of network partitions, infrastructure failures, and even software bugs".

Physalia architecture

It is especially important for Physalia to be available during partitions, because that is when the chain replication will require a configuration change. Physalia offers both consistency and high availability, even in the presence of network partitions, as well as minimized blast radius of failures.

Yeah, yeah, CAP impossibility result and all that. But CAP forbids the consistency-availability combination only at the very margins. There are many ways to circumvent CAP, and Physalia's idea is to not require all keys to be available to all clients. Each key needs to be available at only three points in the network: the AWS EC2 instance that is the client of the volume, the primary copy, and the replica copy. (I had made a similar point in September at this blog post.)

Each EBS volume is assigned a unique partition key at creation time, and all operations for that volume occur within that partition key. Within each partition key, Physalia offers a transactional store with a typed key-value schema, supporting strict serializable reads, writes and conditional writes over any combination of keys.

To realize this idea for reducing the blast radius for the configuration box implementation, Physalia divides a colony into a large number of cells. Each node is only used by a small subset of cells, and each cell is only used by a small subset of clients. This is why the paper is titled "millions of tiny databases". *The Physalia configuration store for chain replication of EBS is implemented as key-value stores maintained over a large number of these cells.*

In the EBS installation of Physalia, the cell performs Paxos over seven nodes. Seven was chosen to balance several concerns: durability, tail latency, availability, resource usage.

When a new cell is created, Physalia uses its knowledge of the power and network topology of the datacenter to choose a set of nodes for the cell. The choice of nodes balances two competing priorities. Nodes should be placed close to the clients to ensure that failures far away from their clients do not cause the cell to fail. They must also be placed with sufficient diversity to ensure that small-scale failures do not cause the cell to fail. Physalia tries to ensure that each node contains a different mix of cells, which reduces the probability of correlated failure due to load or poison pill transitions.

Physalia---the reconfiguration box for chain replication in EBS--- also reconfigures its cells. For this Physalia uses the Paxos reconfiguration approach presented in Lampson's 1996 paper. (I think there is a need for more research on reconfiguration in Paxos systems to make progress on realizing more adaptive and dynamic Paxos deployments.) A significant factor in the complexity of reconfiguration is the interaction with pipelining: configuration changes accepted at log position $i$ must not take effect logically until position $i + \alpha$, where $\alpha$ is the maximum allowed pipeline length.

Physalia employs reconfiguration frequently to move cells closer to their clients. It does this by replacing far-away nodes with close nodes using reconfiguration. The small data sizes in Physalia make cell reconfiguration an insignificant portion of overall datacenter traffic. Figure 7 illustrates this process of movement by iterative reconfiguration, which complete quickly typically within a minute.

When nodes join or re-join a cell, they are brought up to speed by teaching, implemented in three modes outside the core consensus protocol.
"In the bulk mode, most suitable for new nodes, the teacher (any existing node in the cell) transfers a bulk snapshot of its state machine to the learner. In the log-based mode, most suitable for nodes re-joining after a partition or pause, the teacher ships a segment of its log to the learner. We have found that this mode is triggered rather frequently in production, due to nodes temporarily falling behind during Java garbage collection pauses. Log-based learning is chosen when the size of the missing log segment is significantly smaller than the size of the entire dataset."
This is funny. In classes, I always give the Java garbage collection example for how  synchrony assumptions may be violated.


The authors used three different methods for testing.

  1. They built a test harness, called SimWorld, which abstracts networking, performance, and other systems concepts. The goal of this approach is to allow developers to write distributed systems tests, including tests that simulate packet loss, server failures, corruption, and other failure cases, as unit tests in the same language as the system itself. In this case, these unit tests run inside the developer’s IDE (or with junit at build time), with no need for test clusters or other infrastructure. A typical test which tests correctness under packet loss can be implemented in less than 10 lines of Java code, and executes in less than 100ms.
  2. As another approach they used a suite of automatically-generated tests which run the Paxos implementation through every combination of packet loss and reordering that a node can experience. This testing approach was inspired by the TLC model checker (the model checker for TLA+), and helped them build confidence that our implementation matched the formal specification. They also used the open source Jepsen tool to test the system, and make sure that the API responses are linearizable under network failure cases. This testing, which happens at the infrastructure level, was a good complement to the lower-level tests as it could exercise some under-load cases that are hard to run in the SimWorld.
  3. The team used TLA+ in three ways: writing specifications of the protocols to check that they understand them deeply, model checking specifications against correctness and liveness properties using the TLC model checker, and writing extensively commented TLA+ code to serve as the documentation of the distributed protocols. While all three of these uses added value, TLA+’s role as a sort of automatically tested (via TLC), and extremely precise, format for protocol documentation was perhaps the most useful. The code reviews, SimWorld tests, and design meetings frequently referred back to the TLA+ models of our protocols to resolve ambiguities in Java code or written communication. 
Yay, for TLA+. Here are some motivation and examples of TLA+ use from my blog.


The paper provides these graphs from production for evaluating the performance Physalia.

MAD commentary

1. I am more of a protocols/algorithms guy. This paper investigates realization and application of protocols in production rather that introducing new protocols. But it was still a good read for me, and I enjoyed it. I think another very good work relevant to this is Facebook's Delos.

2. This is from the beginning of Section 2: The design of Physalia. If I was just given this paragraph, I could easily tell this paper is coming from industry rather than academia. Academia cares about novelty and intellectual merits. It is hard to find concerns for "easy and cheap to operate", "easy to use correctly" as part of priorities of academic work.
Physalia’s goals of blast radius reduction and partition tolerance  required careful attention in the design of the data model, replication mechanism, cluster management and even operational and deployment procedures. In addition to these top-level design goals, we wanted Physalia to be easy and cheap to operate, contributing negligibly to the cost of our dataplane. We wanted its data model to be flexible enough to meet future uses in similar problem spaces, and to be easy to use correctly. This goal was inspired by the concept of misuse resistance from cryptography (GCM-SIV, for example), which aims to make primitives that are safer under misuse. Finally, we wanted Physalia to be highly scalable, able to support an entire EBS availability zone in a single installation.
3. The paper provides the following discussion about why they implemented Physalia via independent cells, rather than cells coupling in a peer-to-peer manner like Scatter. Although they don't elaborate much on this, I agree on this point. I think a Scatter-like approach may still be [made] tolerant to partitions, but I very much agree on the complexity point.
We could have avoided implementing a separate control-plane and repair workflow for Physalia, by following the example of elastic replication or Scatter. We evaluated these approaches, but decided that the additional complexity, and additional communication and dependencies between shards, were at odds with our focus on blast radius. We chose to keep our cells completely independent, and implement the control plane as a separate system.
4. An opportunity here is that the cells are distributed to nodes, and it is possible to balance the load on each node by controlling how many leader/proposer versus followers are placed on that node. I think Physalia might already be doing this. To relieve the stress on the leader/proposer of the cell, our work on linearizable Paxos quorum reads may be applicable.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...