Friday, March 27, 2020

Zoom Distributed Systems Reading Group

A couple days ago, I tweeted this out, and was surprised how well this was received.
The involvement level distribution looks healthy. One third wants to follow offline, one third likes to attend the discussion session live, and the remaining will get more involved in the reading group by doing their readings and some volunteering to present papers.

We will start the Zoom DistSys Reading Group on Wednesday April 1st (no joke!) at 15:30 EST and meet regularly every week. We announce our meeting links (password protected) at

Following a very brief presentation of the paper, we will start discussing the paper together. After we identify 4-5 interesting questions/directions to dig deeper, we will go into breakout sessions, so that each group can work discuss on the question. At the end, we will gather again, and get summaries from breakout sessions. (I have been using this reading group format for my seminars and was very happy about it.)

We will use Google docs to collect comments, and to help the breakout sessions work on their questions together.

I will be recording the Zoom session and upload it on YouTube. After the meeting, I will write a blog post summarizing our understanding and findings of the paper, providing links to the video recording and Google documents.

Preparation before the Zoom meeting

The paper we will discuss is: Gryff: Unifying Consensus and Shared Registers. You can download the paper freely here. Here is a link to USENIX NSDI'20 presentation of the paper, and a link to the presentation slides.

This paper got on my radar after writing a summary of a related paper, Hermes: A fast fault-tolerant and linearizable replication protocol.

Aleksey will present the paper. Here is a link to his one page (for some definition of a page) summary for Gryff.

We like to collect questions about the paper before the Zoom Discussion. Please use this Google Document to write questions you like to get resolved about the paper. We will be using this document also as the discussion/collaboration document during the Zoom session.

Aleksey created a Google/YouTube account for distsysreading, and if you have suggestions/comments you can email to the distsysreading gmail address (just add to the name).

Tuesday, March 24, 2020

Hermes: A fast fault-tolerant and linearizable replication protocol

This paper (from ASPLOS'20 which was held remotely) is by Antonios Katsarakis, Vasilis Gavrielatos, M. R. Siavash Katebzadeh, Arpit Joshi, Aleksandar Dragojevic, Boris Grot, Vijay Nagarajan. The paper has its own website, where you can get to the video presentation, slides, and code.


Hermes is a replication protocol that guarantees linearizability. It enables local reads: a client can execute a read locally on any of the replicas. Hermes enables any replica to coordinate a write to a key, and supports concurrent writes to different keys quickly.

Too good to be true? You have to read the protocol section below to see how these are achieved.

But if you are a distributed systems expert, here is my shortcut explanation of the protocol. Hermes is simply chain replication (with CRAQ optimization) deployed with the following "chain" topology:
  1. The head and tail node of the chain is colocated in one node, called the coordinator
  2. The intermediate nodes of the "chain" are all parallel-connected (rather than serial) to the coordinator
In chain replication (or CRAQ) the replication goes in a serial manner, but  the chain replication paper mentioned that parallel wiring is possible as long as you have one head and one tail. And when you colocate the head and the tail in the same node, Voila!, you have Hermes. To clean the replicas for reads from any replica, the tail (i.e., the coordinator) sends an ACK message as in CRAQ. (see Figure 2 below.)

In addition to solving the latency problem in chain replication by using parallel-wiring, Hermes also allows multiple coordinators to help balance the coordinating load across nodes. Any node can be a coordinator for any key. Thanks to the logical clock timestamping (using node-ids as tie-breakers), the writes are total-ordered in Hermes. Since higher timestamped writes invalidate lower timestamped writes, the result of concurrent writes will be the same at each node, and linearizability is achieved even with local reads.

I summarize Hermes below, and then discuss how Hermes compare with Paxos-based protocols.


Writes can be initiated by any replica:
  1. the replica initiating the write (called coordinator) broadcasts an Invalidation (INV) message to the rest of the replicas (called followers) and waits on acknowledgments (ACKs)
  2. once all ACKs have been received; the write completes via a Validation (VAL) message broadcast by the coordinator replica
As the first step, the coordinator invalidates replicas for this key, so that linearizability is not violated if a client reads the key from a node that has an old value.

A read request can be served locally on any operational replica (i.e., one with a lease from the membership service). The replica returns the local value of the requested key only if it is in the Valid state. When an INV message for a key is received, the replica is placed in an Invalid state for that key, meaning that reads to the key cannot be served by the replica.

Membership service

In the write operation described above, the coordinator waits to hear an ACK from each replica. If a replica crashes, this results in the write to get stuck forever, right? To address this issue, there is a need for detecting crashed nodes.

Instead of making each node have a failure detector ---which is hard to keep consistent---, Hermes (similar to chain replication) employs an external Paxos-backed configuration/membership service that decides on the health of the nodes. This service acts as a single consistent (but not necessarily perfect) failure detector for the replicas. It becomes the sole source of "truth/perspective": While it can be mistaken in its judgment, it keeps every replica in Hermes consistent with respect to their view of which nodes are healthy and part of the protocol.

This Paxos-powered membership/configuration service changes configuration/view when needed, and at each view-change it increases the epoch number. This keeps Hermes safe (and eventually live) in a partially synchronous environment ---with bouts of asynchrony.

Well, there is still the problem with lease safety at replication nodes. Each replica need a lease from the membership service for this to work (again as in chain replication). See the fault-tolerance section below for how this is handled.

Concurrent writes

Hermes allows writes to different keys to proceed in parallel for impoving the throughput.

As for concurrent writes to the same key, invalidations plus logical timestamps impose a total order on these writes. This prevents conflicts and aborts, and ensures that those are correctly linearized at the replicas.

A coordinator node issues a write to a key only if it is in the Valid state; otherwise the write is stalled. This doesn't seem to be necessary for safety, because the higher timestamped writes will preempt the lower timestamped writes. So why does Hermes do this? I think they do this, because it get replicas see the writes concluded, even when there is a deluge of writes to the same key. This may in turn help alleviate the read starvation due to constant flood of writes to the same key. I found this in the slack channel for ASPLOS'20 from the first author:
  It is safe for a read that initially found the object invalidated with version timestamp 2 and then subsequently invalidated with a version timestamp 3 to get serialized and return the version 2 value. Intuitively this is partly safe because a write with version 3 could not have started unless the write with version 2 has been committed. 
This assumes no epoch change, I presume. A couple sections below, I will discuss about our Paxos-Quorum-Reads technique which does a similar thing, but without blocking the writes to wait for earlier writes to finish, and without requiring leases or a configuration/membership service.

Read-modify-write updates

Linearizability is not the whole story. You can get linearizability in Cassandra, using the ABD algorithm, which is not even subject to FLP. But the problem is ABD is not consensus, and it is not good alone for maintaining state machine replication. 

Hermes is trying to do more and achieve state machine replication. It enforces the replicas to have the same log in the same order (for the same key). The paper also shows how Hermes can support read-modify-write (RMW) updates, an atomic execution of a read followed by a write to a key (e.g., a compare-and- swap to acquire a lock).
  An RMW update in Hermes is executed similarly to a write, but it is conflicting. An RMW which is concurrently executed with another update operation to the same key may get aborted. Hermes commits an RMW if and only if the RMW has the highest timestamp amongst any concurrent updates to that key. Moreover, it purposefully assigns higher timestamps to writes compared to their concurrent RMWs. As a result, any write racing with an RMW to a given key is guaranteed to have a higher timestamp, thus safely aborting the RMW. Meanwhile, if only RMW updates are racing, the RMW with the highest node id will commit, and the rest will abort.
A recent paper, Gryff in NSDI20, also investigates this problem. It uses the ABD algorithm for read-write registers, and EPaxos in conjunction with consensus-after-register timestamps (carstamps) for the RMW updates. In Gryff, the RMW operations do not get aborted, they just get ordered correctly by EPaxos even after a conflict.

While we are on the topic of related work, I wonder how Hermes compares with RIFL:
Implementing Linearizability at Large Scale and Low Latency (SOSP'15). The paper does not cite RIFL, but it would be nice to compare and contrast the two protocols.


Hermes seamlessly recovers from a range of node and network faults thanks to its write replays, enabled by early value propagation.
  Node and network faults during a write to a key may leave the key in a permanently Invalid state in some or all of the nodes. To prevent this, Hermes allows any invalidated operational replica to replay the write to completion without violating linearizability. This is accomplished using two mechanisms. First, the new value for a key is propagated to the replicas in INV messages (see Figure 2). Such early value propagation guarantees that every invalidated node is aware of the new value. Secondly, logical timestamps enable a precise global ordering of writes in each of the replicas. By combining these ideas, a node that finds a key in an Invalid state for an extended period can safely replay a write by taking on a coordinator role and retransmitting INV messages to the replica ensemble with the original timestamp (i.e., original version number and cid), hence preserving the global write order.
For fault-tolerance, the membership service and leases at replicas play a central role. If one replica is partitioned out, the coordinator cannot make progress unless the membership service updates the membership to remove that replica. The membership service waits until the lease it granted to the partitioned replica expires. The lease expiration makes the replica invalid. The membership service then increases epoch number, and disseminates the new membership information to the replicas, and the coordinator (or any other replica node via the early value propagation technique) can make progress.

Protocol-level comparison to Paxos based solutions, and PQR

The round trip and a half protocol in Hermes has similarities (at least in terms of performance bottleneck characteristics) to the Phase-2 "accept" and Phase-3 "commit" the Paxos leader (via MultiPaxos optimization) performs with the followers.

The nice thing about Paxos based solutions is that there is no outside membership/reconfiguration box needed in that solution. Below let's discuss how well Paxos-based solutions can hold up to Hermes's features.

Hermes distributes the coordination load across replicas. EPaxos has the same feature due to opportunistic leaders approach. It is also possible to deploy Paxos with per-key sharding to the leaders (this was mentioned and compared with in EPaxos paper I think). In our WPaxos protocol, we improved over the per-key sharding to the leaders approach, and showed how WPaxos can outperform it by stealing keys and assigning it to the closest leaders to improve performance based on the access pattern in the workload.

Hermes does local reads from one replica. Megastore from Google also allows local reads from one replica with support from coordinators.

In our recent work, we introduced Paxos Quorum Reads (PQR) and showed how to perform linearizable reads from Paxos protocols without involving the leader and without using any leases. Since PQR does not require leases, it works in an asynchronous environment. PQR requires reading from majority of the nodes, to catch if there has been a newer pending update to the key. If there is a pending update, the clearing of the read can be done by just barriering on one replica. It is possible to relax the initial majority-quorum read and instead use fewer number of nodes by using a larger write quorum. While reading from multiple nodes in parallel requires more messages, it does not increase the latency. See this brief summary of Paxos Quorum Reads to learn more.


Hermes is evaluated over an RDMA-enabled reliable datastore with five replicas. The evaluation compares Hermes with ZAB and CRAQ. At 5% writes, the tail latency of Hermes is 3.6× lower than that of CRAQ and ZAB.

The performance improvement in Hermes, I think comes from using multiple coordinators, which was not made available to ZAB or CRAQ.

The figures show that "fewer writes=more reads" is better for Hermes because of the local reads in Hermes. On the other hand, observe that for uniform key distribution workloads, CRAQ is as good as Hermes for throughput even though the replicas there are serially wired instead of parallel-wired.

For latency, the improvement due to parallel-wiring replicas is significant.

Sunday, March 22, 2020

My Emacs setup

In January, I switched to the new Meh-Book Pro (typo intended).

Setting up a new laptop is hard because there are too many things to keep track of. Even so, rather than copying the state of my previous laptop to the new one, I decided to do the setup from scratch as this would help for software rejuvenation and getting rid of all the junk that accumulated on my laptop for five years.

(Side remark: A couple years ago, I heard from a top cloud computing expert say that datacenters also gather junk over time. He told me that once during a renovation of a datacenter they discovered servers that are sitting there without any use --that no one knows of--, and throw them out. He had likened the experience to getting rid of defunct stuff when you migrate to a new laptop.)

Anyways, I was stressed about the task, but then I remembered that past Murat has left me notes on how to setup the laptop. Fantastic!

I found out a couple things past-Murat had missed in the instructions. I am noting them here out of courtesy for future-Murat. I am paying it forward.

Install auctex for latex stuff and ispell for the spell checker Emacs uses. brew install ispell --with-lang-en

Don't forget to "brew cask install" the following: slack, filezilla, adblock, tla-plus-toolbox

Add the "home directory" to finder. (Stupid MacOS.)

Disable bluetooth from waking up the laptop to prevent the phone headphone to connect to the sleeping laptop and wake it up. Duh!

Emacs setup

Most of my new notes is about the Emacs setup, where I spent the bulk of my time.

I installed emacs using cask rather than from binary so I get to enjoy the ELPA package-manager support by typing "M-x package-show-package-list".

I had a problem using the previous starter-kit I had and couldn't get it to work. I searched and found kjhealy's kit highly recommended, so I installed that one.

This starter-kit takes care of everything and is easy to install. It sits in my .emacs.d directory. I also create myusername.el in that folder to add my customization.
(global-set-key (kbd "<f1>") 'org-timer-set-timer)
(global-set-key (kbd "<f2>") 'flyspell-auto-correct-word)
;; F3 store macro
;; F4 apply macro
(global-set-key (kbd "<f5>") 'org-schedule)
(global-set-key (kbd "<f6>") 'org-deadline)
;; (global-set-key (kbd "<f7>") 'org-agenda-do-date-earlier) ;; instead use  M -
;; (global-set-key (kbd "<f8>") 'org-agenda-do-date-later) ;; instead use  M +
(global-set-key (kbd "<f7>") 'org-store-link)
(global-set-key (kbd "<f8>") 'org-insert-link)

(global-set-key (kbd "<f12>") 'org-agenda)

I install visual-fill-column, and add this to myusername.el
;; set line width to 80 characters
;; first do this!! M-x package-install visual-fill-column
(setq-default fill-column 80)
(global-visual-line-mode 1)
(global-visual-fill-column-mode 1)
(auto-fill-mode -1)
(remove-hook 'text-mode-hook #'turn-on-auto-fill)
;;shortcut for auto-fill mode on/off
(global-set-key (kbd "C-c w") 'auto-fill-mode)
;; visual-line-mode for handling long lines
(add-hook 'text-mode-hook 'turn-on-visual-line-mode)
(add-hook 'text-mode-hook 'turn-on-visual-fill-column-mode)

Yes, I have a lot of junk DNA in that file.

I found that this starter kit uses the zenburn theme. It is nice pastel colors, but I prefer a light background. To be able to load other themes, I found that I need to remove the background and foreground colors in custom.el. Then  "M-x load-theme" works for loading other themes. I use leuven and sometimes anti-zenburn as my light-background Emacs theme.

I know I can open dozens of buffers in one Emacs windows, but I like to keep three Emacs windows open for better organization:
  • one window is for org-agenda and related files to organize my life (for this I still use the zenburn theme as that looks cool in agenda mode)
  • the other two windows are on research projects (for these I use the leuven theme) 

I simplified my pomodoro setup as well. I use <f1> to invoke org-timer-set-timer, and enter how many minutes I want the pomodoro for. I hooked a function to the org-timer-done, and call applescript to vocalize a celebratory message when the timer expires. This follows from the practices recommended in TinyHabits.
;;(defun my-org-timer-hook ()
;;  (sound-wav-play "~/.emacs.d/Bell.wav"))
(defun my-org-timer-hook ()
  (do-applescript "say \"Well done!\""))

(add-hook 'org-timer-done-hook 'my-org-timer-hook)

;; (setq visible-bell nil)
(setq org-clock-sound t) ;; Standard Emacs beep; visible-bell if above commented

I set the timer to 10 minutes most of the time, because that gets me started.  I can keep going with longer durations.

Related post: My Emacs Journey

Wednesday, March 18, 2020

PigPaxos: Devouring the communication bottlenecks in distributed consensus

This is our most recent work, started and led by Aleksey Charapko. (This is a joint post with him.) You can get the paper at The paper is currently under submission to a journal.

The story

One day I challenged Aleksey to give me a ballpark number on how much he thinks we can scale Paxos vertically. While sharding --as in CockroachDB and Spanner-- helps for scaling Paxos deployments horizontally, vertical scaling is about how many nodes you can cram in a single Paxos cluster, with a single conflict domain.

Aleksey, who is not known for being an optimist, said that we can scale Paxos to several hundreds of nodes! He said this may be possible by employing intermediate proxy nodes to relay the communication between the leader and followers, as this would relieve the communication bottleneck at the leader.

I thought "yeah, it is a neat trick, but maybe not that impressive, because it is very simple". Surely others must have tried this, and there must be a catch/drawback. We checked but couldn't find any previous work on this. At the Sigmod'19 reception at Van Gogh museum, I mentioned this idea to Miguel Castro (of PBFT fame among others). He liked the idea and said he couldn't think of anyone that studied this before.

The central idea in PigPaxos is to decouple the communication from the decision-making at the leader. PigPaxos revises the communication flow to replace the direct communication between the leader and followers in Paxos with a relay based communication flow. PigPaxos chooses relays randomly from follower clusters at each communication round to reduce contention and improve scalability of throughput. 

When Aleksey started evaluating PigPaxos, we were surprised by the  effectiveness of this simple technique. This shouldn't have been too much of a surprise because in our recent Sigmod paper we showed that leader bottleneck is the culprit behind the scalability problems of Paxos family of protocols and quantified on this bottleneck with back-of-the-envelope formulas. Still, the results from PigPaxos were beyond our expectations. We repeated our experiments many times, and double-checked everything before we could allow ourselves to believe these results. Employing relay nodes for relieving the leader, and randomly rotating the relay nodes for communication bottleneck shedding did wonders for the performance. We found that PigPaxos improved the throughput limit more than 3 folds over Paxos with negligible latency deterioration at 25 nodes. Even for as low as 9 nodes, we were able to see 1.5 folds throughput improvement with the same latency as Paxos.

This was a very fun paper to write. The writing went easy and quick. The analytical results section at the end of the paper added a lot to the paper. We put that section after the evaluation section to show that a simple back-of-the-envelope analysis formula can explain the evaluation results nicely.

What is up with the name?

When we started working on the idea, we were calling it BigPaxos, because our ultimate goal was to scale to hundreds of nodes. One day while shooting the shit about BigPaxos on Slack, Aleksey made a typo and wrote "PigPaxos". He didn't realize he made the typo (even though it is very hard to confuse b and p on the keyboard). I teased Aleksey for a couple of minutes by putting various pig emojis in my responses to his comments. He still didn't get the hint, and then I pointed the typo to him. We got a good laugh out of it.

I kept referring to the protocol as PigPaxos in later conversations. With the submission deadline looming, we were unable to prepare for 100 node experiments and add the optimizations we had in mind. I told Aleksey that we should call this protocol PigPaxos officially, and reserve the name BigPaxos for the large scale protocol we can show in the next paper.

Aleksey was not very receptive to the idea, thinking PigPaxos would look too informal in a research paper. He also argued that we don't have a good way to connect the name to the algorithm, all we had was a silly typo. The connection occurred to me on Slack again. In the protocol, the relay nodes wait for the followers’ responses from its cluster and piggyback them together into a single message. Well, this was a close enough association, so we went with the name PigPaxos.

Where can we use PigPaxos?

Paxos protocols are most commonly deployed with 3 and 5 nodes. But there are also several applications that require vertically scaling Paxos to run on a large number of nodes, all within the same conflict domain. One example is consistent cloud configuration management. Configuration management is required for gating new product features, conducting experiments (A/B tests), performing application-level traffic control, performing topology setup and load balancing, monitoring and remediation, updating machine learning models (which vary from KBs to GBs), controlling applications’ behaviors (related to caching, batching, prefetching etc), and controlling chain replication topologies as in Physalia.

Another example is geo-replicated databases. A consensus group in a geo-replicated database may consist of dozens of nodes across many regions around the globe. As we show in our evaluation, PigPaxos increases throughput scalability significantly across WAN deployments with a large number of nodes. Even for Paxos clusters with a small number of (say 5) nodes, large messages (such as database replication messages as in CockroachDB and Spanner) trigger a communication bottleneck at the leader. PigPaxos's randomized relaying technique can help with those bottlenecks as well.

While the idea in PigPaxos is simple and similar aggregation-based approaches have been employed in the context of weak-consistency replication protocols, PigPaxos is novel because it shows how these aggregation-based approaches can be effectively and safely integrated into the strong consistency distributed consensus protocols. The PigPaxos technique, being a simple general technique, is applicable to many Paxos-variant protocols, including Raft, Zab, WPaxos, etc.

Summary of the paper

OK, here is where the more technical content starts. Ready? I promise this will be good. You need a break from reading Covid-19 news anyways.

There are many optimizations possible over the basic scheme we outline below, but we relegate that discussion to the paper.

Background and related work

Paxos family of protocols are employed by many cloud computing services and distributed databases due to their excellent fault-tolerance properties. Unfortunately, current Paxos deployments do not scale for more than a dozen nodes due to the communication bottleneck at the leader.

In the most generic form, Paxos runs in 3 phases. Phase-1 establishes some node as the leader, phase-2 lets the leader to  impose its will onto the followers by telling what command to accept, and phase-3 finalizes the commitment by informing the followers that consensus has been reached.

The basic protocol is rather inefficient with these three communication phases, and Multi-Paxos optimization is often adopted to cut down the unnecessary pleasantries. Multi-Paxos elects one node as a stable leader for some prolonged time, and repeats the phase-2 however many times possible under the same leader, without needing to perform another phase-1. Phase-3 also gets piggybacked to some future phase-2 to reduce communication even more.

It is evident that the leader bears the brunt of the load in Paxos and MultiPaxos. In previous work, Mencius relieved leaders' workload by rotating them.  Recent blockchain consensus protocol LibraBFT from Facebook Libra also used pipelining to improve throughput (the original reason for pipelining was to reduce the effects of a Byzantine leader on the protocol). In contrast, the pipelining in PigPaxos employs random rotation of relay nodes, rather than leader rotation, and improves the throughput scalability significantly without any side effects. Since this is a very simple technique, it is more easily applicable and implementable.

PigPaxos communication flow

As shown in Figure 1&2, the communication in Paxos is direct between the leader and the followers with a fan-out to send messages and fan-in to collect the replies. PigPaxos observes that it is possible to employ intermediate nodes/proxies to help relay and aggregate the messages in this communication pattern. Instead of sending the fan-out messages to all of the followers, the leader transmits these to a small set of relay nodes, which propagate the messages to the remaining followers. The relay nodes also act as aggregators for the fan-in communication of the followers’ responses, and pass the combined results to the leader.

For simplicity sake, PigPaxos divides the entire cluster into a small static number of  relay groups, and a single relay node is chosen from each relay group randomly for every round trip communication round.  We use PigPaxos with the MultiPaxos optimization so only the phase-2 communication is performed in the normal case.

The randomized rotation of the relay nodes provide a big relief from communication bottlenecks. True, a relay node has its work cut out if it needs to aggregate responses from 10 nodes in its cluster. But since the relay nodes randomly rotate, a particular relay node will be off the hook for the several consequent rounds, and will be able to process these messages and send to the leader without getting overwhelmed.

The randomized rotation of the relay nodes also help for improving liveness when a relay crash occurs. Moreover, to guard against crashed or sluggish follower nodes, a timeout is used for setting a time threshold for followers in the group to reply. When the timeout occurs, the relay node acks to the leader with a partial aggregation.

Paxos to PigPaxos mapping

PigPaxos generalizes the communication implementation of the Paxos protocol. Paxos has $N-1$ groups, where each group has one element and the groups do not intersect with each other. In contrast in PigPaxos there are $p$ groups where $p \in \{ 1..N-1\}$.

We note that the safety and liveness proofs of Paxos do not depend on the communication implementation between the leader and follower nodes. In Paxos, maintaining correctness in spite of failures is guaranteed by quorum size and the information exchanged within the quorums, and the proofs are oblivious to how communication of this information is achieved. Therefore, PigPaxos preserves the safety and liveness properties of Paxos, as it only modifies the communication implementation. For reasoning about liveness, the message flow pattern and the use of relay nodes requires special attention, as failure of a relay node has disproportionate impact compared to the failure of a regular follower. Since PigPaxos uses random selection of relay/aggregator nodes at each round, it circumvents this problem and retains liveness.


We implemented PigPaxos in our Paxi framework with almost no changes to the core Paxos code, as we focused only on the message passing layer and relay group orchestration. The entire protocol was implemented in just 1200 lines of code. For evaluation we deployed the protocols on a cluster of up to 25 AWS EC2 m5a nodes with 2 vCPUs and 8 GB of RAM.

The PigPaxos leader communicates with only a handful of nodes for each consensus instance. Naturally, we wanted to see if relay groups get overwhelmed by the extra communication burden. To our surprise, PigPaxos with just 2 relay groups performed the best in a 25 node deployment. This suggests that the leader still remains a bottleneck and the relay groups still have resources to spare. As shown in Figure 7, for a cluster of “25 nodes divided into 2 relay groups”, PigPaxos had nearly ~25% advantage in maximum throughput compared to a 5-relay group setup.

The performance relative to Multi-Paxos on a 25-node cluster is astonishing (Figure 8). PigPaxos shows nearly 3.5 fold improvement in throughput compared to Paxos and 10 folds improvement over EPaxos.

The benefits of PigPaxos do not stop at the large clusters. To our surprise, we observed better throughput from PigPaxos on just a 5 node cluster. Yes, PigPaxos has slightly higher latency due to the additional network hop, but it is still able to edge out  some throughput improvements.


We back our empirical results with simple back-of-the-envelope load formulas. Based on our work in the SIGMOD paper, we use a simple count of messages processed by each node as a heuristic for the node’s relative load.  To start with the leader, its communication is no longer dependent on N, or the number of nodes in the cluster, and instead it is a linear function of r, the number of relay groups, plus 2 messages (incoming and outgoing) talking to the client:

Computing the relative message load on the follower is more involved, as we need to account the roles the follower can take on and the probability of each role:

Plugging our experimental configuration into these simple formulas shows us that the relay nodes are never a bottleneck (regardless of both the number of nodes N and number of relay groups r), and keeping the number of relay nodes small can move the entire system closer to the load parity between the leader and followers. The reason the relay nodes don’t become a bottleneck is because the random alternation of the relay nodes shields them from becoming hotspots: the extra traffic load a relay node incurs in one round is offset in consecutive rounds when the node no longer serves as relay.

The drawback with using a very small number of relay nodes (say only 1 relay group in the extreme) is that the performance becomes too fragile: few sluggish or crushed nodes may force the system to wait the timeouts at the relay groups. Using more relay groups mean the leader will receive majority confirmation even in the presence of sluggish/crashed nodes holding back the response from some relay groups. Finally the load that randomly rotated relays can manage is likely to be limited in practice due to hw/sw limitations at the node,  and that is also a factor to be explored.

Future work

There are several optimizations to help the Pig go Big. In our implementation, the relay group partitioning was static to keep things simple. Using dynamic relay groups we may add even more randomness to the communication process, and can deal with failures and sluggish nodes more smoothly. Another optimization is to make the relay nodes reply to the leader before collecting all the responses from peers: collecting one fewer response is not likely to affect the ability to reach majority quorum at the leader, but it prevents incurring a timeout due to one sluggish node in the relay group. More experiments are underway to see how this Pig squeals.

Here is a link to our paper again, if you are looking for some quarantine reading.

Sunday, March 15, 2020

Millions of tiny databases

This paper is by Marc Brooker, Tao Chen, and Fan Ping from Amazon Web Services. The paper appeared at USENIX NSDI 2020 at the end of February, which was held  on-site at Santa Clara. Right after that, all conferences got canceled due to the COVID-19 outbreak. Let's hope things stabilize for NSDI 2021.

What is this paper about?

This paper is about improving the availability of Amazon Elastic Block Storage (EBS).

EBS allows users to create block devices on demand and attach them to their AWS EC2 instances. EBS is maintained using chain replication, one of my favorite distributed algorithms. Chain replication takes consensus off the data path so it does not constitute a bottleneck for throughput. Data is replicated at one server after the other in the chain, without needing a leader ---when there is a leader, there is an incast bottleneck problem. Consensus is only needed when a fault occurs (or is presumed to occur) and the chain needs to be reconfigured by means of the configuration box, which implements fault-tolerant consensus via Paxos.

This paper is about that configuration box, Physalia, which oversees the chain replication systems. The specific problem considered in this paper is this: *How do you design and operate Physalia so that the availability of the EBS is maximized?*

In other words this paper is about the second order effects of the EBS replication system. But these second order effects still become very important at the AWS scale. If you have millions of nodes in EBS that need configuration boxes, you cannot rely on a single configuration box. Secondly, yes, the configuration box should not see much traffic normally, but when it does see traffic, it is bursty traffic because things went wrong. And if the configuration box layer also caves in, things will get much much worse. The paper gives an account of "21 April 2011 cascading failure and loss of availability" as an example of this.

Rather than describing new consensus protocols, in the spirit of Paxos Made Live, this paper describes the details, choices and tradeoffs that are required to put the Physalia consensus system into production.  The higher order message from the paper is that "infrastructure aware placement and careful system design can significantly reduce the effect of network partitions, infrastructure failures, and even software bugs".

Physalia architecture

It is especially important for Physalia to be available during partitions, because that is when the chain replication will require a configuration change. Physalia offers both consistency and high availability, even in the presence of network partitions, as well as minimized blast radius of failures.

Yeah, yeah, CAP impossibility result and all that. But CAP forbids the consistency-availability combination only at the very margins. There are many ways to circumvent CAP, and Physalia's idea is to not require all keys to be available to all clients. Each key needs to be available at only three points in the network: the AWS EC2 instance that is the client of the volume, the primary copy, and the replica copy. (I had made a similar point in September at this blog post.)

Each EBS volume is assigned a unique partition key at creation time, and all operations for that volume occur within that partition key. Within each partition key, Physalia offers a transactional store with a typed key-value schema, supporting strict serializable reads, writes and conditional writes over any combination of keys.

To realize this idea for reducing the blast radius for the configuration box implementation, Physalia divides a colony into a large number of cells. Each node is only used by a small subset of cells, and each cell is only used by a small subset of clients. This is why the paper is titled "millions of tiny databases". *The Physalia configuration store for chain replication of EBS is implemented as key-value stores maintained over a large number of these cells.*

In the EBS installation of Physalia, the cell performs Paxos over seven nodes. Seven was chosen to balance several concerns: durability, tail latency, availability, resource usage.

When a new cell is created, Physalia uses its knowledge of the power and network topology of the datacenter to choose a set of nodes for the cell. The choice of nodes balances two competing priorities. Nodes should be placed close to the clients to ensure that failures far away from their clients do not cause the cell to fail. They must also be placed with sufficient diversity to ensure that small-scale failures do not cause the cell to fail. Physalia tries to ensure that each node contains a different mix of cells, which reduces the probability of correlated failure due to load or poison pill transitions.

Physalia---the reconfiguration box for chain replication in EBS--- also reconfigures its cells. For this Physalia uses the Paxos reconfiguration approach presented in Lampson's 1996 paper. (I think there is a need for more research on reconfiguration in Paxos systems to make progress on realizing more adaptive and dynamic Paxos deployments.) A significant factor in the complexity of reconfiguration is the interaction with pipelining: configuration changes accepted at log position $i$ must not take effect logically until position $i + \alpha$, where $\alpha$ is the maximum allowed pipeline length.

Physalia employs reconfiguration frequently to move cells closer to their clients. It does this by replacing far-away nodes with close nodes using reconfiguration. The small data sizes in Physalia make cell reconfiguration an insignificant portion of overall datacenter traffic. Figure 7 illustrates this process of movement by iterative reconfiguration, which complete quickly typically within a minute.

When nodes join or re-join a cell, they are brought up to speed by teaching, implemented in three modes outside the core consensus protocol.
"In the bulk mode, most suitable for new nodes, the teacher (any existing node in the cell) transfers a bulk snapshot of its state machine to the learner. In the log-based mode, most suitable for nodes re-joining after a partition or pause, the teacher ships a segment of its log to the learner. We have found that this mode is triggered rather frequently in production, due to nodes temporarily falling behind during Java garbage collection pauses. Log-based learning is chosen when the size of the missing log segment is significantly smaller than the size of the entire dataset."
This is funny. In classes, I always give the Java garbage collection example for how  synchrony assumptions may be violated.


The authors used three different methods for testing.

  1. They built a test harness, called SimWorld, which abstracts networking, performance, and other systems concepts. The goal of this approach is to allow developers to write distributed systems tests, including tests that simulate packet loss, server failures, corruption, and other failure cases, as unit tests in the same language as the system itself. In this case, these unit tests run inside the developer’s IDE (or with junit at build time), with no need for test clusters or other infrastructure. A typical test which tests correctness under packet loss can be implemented in less than 10 lines of Java code, and executes in less than 100ms.
  2. As another approach they used a suite of automatically-generated tests which run the Paxos implementation through every combination of packet loss and reordering that a node can experience. This testing approach was inspired by the TLC model checker (the model checker for TLA+), and helped them build confidence that our implementation matched the formal specification. They also used the open source Jepsen tool to test the system, and make sure that the API responses are linearizable under network failure cases. This testing, which happens at the infrastructure level, was a good complement to the lower-level tests as it could exercise some under-load cases that are hard to run in the SimWorld.
  3. The team used TLA+ in three ways: writing specifications of the protocols to check that they understand them deeply, model checking specifications against correctness and liveness properties using the TLC model checker, and writing extensively commented TLA+ code to serve as the documentation of the distributed protocols. While all three of these uses added value, TLA+’s role as a sort of automatically tested (via TLC), and extremely precise, format for protocol documentation was perhaps the most useful. The code reviews, SimWorld tests, and design meetings frequently referred back to the TLA+ models of our protocols to resolve ambiguities in Java code or written communication. 
Yay, for TLA+. Here are some motivation and examples of TLA+ use from my blog.


The paper provides these graphs from production for evaluating the performance Physalia.

MAD commentary

1. I am more of a protocols/algorithms guy. This paper investigates realization and application of protocols in production rather that introducing new protocols. But it was still a good read for me, and I enjoyed it. I think another very good work relevant to this is Facebook's Delos.

2. This is from the beginning of Section 2: The design of Physalia. If I was just given this paragraph, I could easily tell this paper is coming from industry rather than academia. Academia cares about novelty and intellectual merits. It is hard to find concerns for "easy and cheap to operate", "easy to use correctly" as part of priorities of academic work.
Physalia’s goals of blast radius reduction and partition tolerance  required careful attention in the design of the data model, replication mechanism, cluster management and even operational and deployment procedures. In addition to these top-level design goals, we wanted Physalia to be easy and cheap to operate, contributing negligibly to the cost of our dataplane. We wanted its data model to be flexible enough to meet future uses in similar problem spaces, and to be easy to use correctly. This goal was inspired by the concept of misuse resistance from cryptography (GCM-SIV, for example), which aims to make primitives that are safer under misuse. Finally, we wanted Physalia to be highly scalable, able to support an entire EBS availability zone in a single installation.
3. The paper provides the following discussion about why they implemented Physalia via independent cells, rather than cells coupling in a peer-to-peer manner like Scatter. Although they don't elaborate much on this, I agree on this point. I think a Scatter-like approach may still be [made] tolerant to partitions, but I very much agree on the complexity point.
We could have avoided implementing a separate control-plane and repair workflow for Physalia, by following the example of elastic replication or Scatter. We evaluated these approaches, but decided that the additional complexity, and additional communication and dependencies between shards, were at odds with our focus on blast radius. We chose to keep our cells completely independent, and implement the control plane as a separate system.
4. An opportunity here is that the cells are distributed to nodes, and it is possible to balance the load on each node by controlling how many leader/proposer versus followers are placed on that node. I think Physalia might already be doing this. To relieve the stress on the leader/proposer of the cell, our work on linearizable Paxos quorum reads may be applicable.

Tuesday, March 10, 2020

Recent Media diet: Breakfast of the champions, and The Messiah

Breakfast of the champions

I like  Kurt Vonnegut's writing a lot. I had read Cat's Craddle and TimeQuake when I was a graduate student interested in reading mostly science fiction. Of course these books are nominally science fiction. Cat's Cradle was written in the presence of the threat of nuclear destruction at cold war and Cuban crisis. The underlying theme in both Cat's Cradle and TimeQuake is free will, or rather the illusion of free will. (After Kurt Vonnegut, Ted Chiang picked up this free will and fate theme and does beautiful work on this.)

Somehow this book, the Breakfast of Champions, was more depressing than any other book I ever read. I didn't get this depressed even with the Cixin Liu's final book, Death's End, of the Three Body Problem trilogy.

Kurt Vonnegut deals with very dark content in this book. An underlying theme is again free will, but this time woven around the "human condition" and misery. His light style makes this darker... He weaponizes satire and sarcasm very effectively. He drops a couple references to how meaningless is life causally and moves on with his sarcastic freestyle writing. He drops a rape reference causally, and moves on. That is cruel. That is many times more cruel than discussing these longer and acknowledging their real weight. The way he makes passing remarks of injustice, bad, pain... is very depressing.

The writing in Breakfast of the Champions got too heavy for me very quickly.  I couldn't finish the audiobook. I gave it up, but again picked up after a week, because his writing is really good. He is a master of words. But maybe his books should come with trigger warnings for people who are prone to depression and suicidal thoughts.

The Messiah (with a spoiler at the very end)

I don't watch very little TV or Netflix. Last thing I watched was The Good Place (highly recommend it!). I think I came across The Messiah in a Netflix preview, and had time to kill, so watched the first episode. But then I kept watching and binged through the show in two days. Crazy...

I liked it a lot. It is a very engaging show. The main lesson from the show is that it is easy to manipulate people. We are very gullible, and the system we built, called the modern world, is actually quite fragile, and things can take a nose dive quickly.

The way things are going these days, with interesting/upset elections, tendency to move towards totalitarianism in many countries, increased friction in Middle East, viruses, market crashes, all it takes to make things unstable is some information warfare and psychological operations. If you can convince people of an apocalyptic future, you can manipulate them.

I am especially concerned about this with the prevalence of social media manipulation. Disinformation operations are real. Many social media strategists and bots manipulate news and perceptions skillfully, and not-so-skillfully. people get radicalized on the web, aided by the click rate maximizing algorithms feedback.
Lerner also divides psychological warfare operations into three categories: 
  1. White propaganda (Omissions and Emphasis): Truthful and not strongly biased, where the source of information is acknowledged.
  2. Grey propaganda (Omissions, Emphasis and Racial/Ethnic/Religious Bias): Largely truthful, containing no information that can be proven wrong; the source is not identified.
  3. Black propaganda (Commissions of falsification): Inherently deceitful, information given in the product is attributed to a source that was not responsible for its creation.
Lerner points out that grey and black operations ultimately have a heavy cost, in that the target population sooner or later recognizes them as propaganda and discredits the source. 
The last sentence made me snort. This sounds so naive now. In my mind, I heard the algorithmic machine say "hold my beer, Lerner".

SPOILER about the show:

Here is my guess about the show.

The first season is done. I guess that in the next season, we will see Cibril emerging as the real Messiah. That will make a season that can be as interesting as the first season.

Sunday, March 8, 2020

Blockchain podcasts

I came across this list of Crypto podcasts recently. The list is so long, it is practically useless.

I googled to see what would be 5-10 high quality podcasts on blockchains. I worked with those shorter lists.

I found that many of the recommended podcasts are coming from the finance angle, and some of them are mostly for speculating on coins for investing. I hated most of those.

Another type of podcasts were from people who got involved in the area with Bitcoin (through the Bitcoin magazine and conferences). I didn't like those podcasts either as these guys tend to be fanatical, and also not very technical. But since they are in the field and trenches, I did not write them off completely and listened to some select episodes.

Overall I am disappointed with the quality of the podcasts in this domain. Please recommend me good podcasts you are aware of. There should be 3-4 gems among this huge list of podcasts. (Epicenter and Unchained was sort of OK.)

"Block Zero" podcast by Kevin Rose (of Digg fame) stood out in the list. I knew this was going to be good, because Kevin Rose is a good communicator, smart guy, and a newcomer to the scene. I listened to 5 episodes from Block Zero,  here is what I make of it.

  1. Blockchain people have a problem with buying into their own propaganda. I think the reason why blockchain people are so self-unaware is because money distorts their judgment. They have all the incentive to see only the good things about their platform and only bad things about other platforms.
  2. And yes, blockchain people have money (or potential money). They are able to raise millions of dollars either via VC or token sales, but I don't think what they build matches what they raised.
  3. All the blockchain people in the podcasts said "We are a platform, all other services/applications can plug to us."  Everyone wants everyone else to integrate into their platform.
  4. As a result there is an abundance of platforms but no clear killer application. Previously, people would say "but this is trustless" when asked about what is the application. Now the word thrown around is DeFi (which is not even a word). The value proposition of blockchains in finance or decentralized finance is not clear. It is interesting.. One application I liked (from episode 1) was the use of Bitcoin for donations. If you give Bitcoin as donation, you can track how it is actually used as it is spend. Yes, this suggest using the total transparency in the chain works, and lack of privacy as a feature rather than treating it as a bug. Haha, trackable money.  

Again, my stance on blockchain work is that I am very happy that there is activity here. I think something good will come out eventually from this activity. This is already a net gain for distributed consensus field.

Friday, March 6, 2020

I have seen things

Yesterday Twitter spoke and told me that I am an old man, with overwhelming decisiveness. If this has been with any more votes, I would be declared part of the vulnerable population for Corona virus, one of the expendables as people seem to refer to them.

Twitter followers... I thought we were friends!

But, Ok, I get the point. I have seen things.

I was born in 1976. I understand that in the eyes of millennials 1976 is around the same time period as 1796. My son sometimes asks me if TV was invented when I was a child.

Yes, we had a black and white TV when I was growing up. And a dial phone, that was tethered to the wall.

At 7th grade, my dad got us a Commodore 64, and I played Boulderdash, Load Lode runner, Falcon Patrol etc.

At 9th grade, I saw the TV broadcasting the first Gulf War. I didn't follow it, because at that time it didn't feel like Turkey was part of Middle East. Back then we thought Turkey was part of Europe. We used to call it the bridge to Europe.

In 1993, when I started college, I used the Mosaic browser. The Netscape browser arrived couple years later. My university had Sun workstations, and also black and green terminals.

My university was named Middle East Technical University. (Yeah, somehow, I still didn't connect the dots then...) We learned Pascal, C, Lisp, ML, Prolog. C++ was new back then, the department started a C++ course when I was in my 3rd year.

I started an MS+PhD at The Ohio State University in 1998. My flight from Istanbul to JFK had a lot of cigarette smokers -- smoking was permitted on the planes then. My flight from JFK to Columbus was a slow propeller plane.

In 2000, I was among the first to use 802.11b cards and WAPs at home, because one of my roommates was working at a company developing WAPs and writing drivers for 802.11b cards.

I have seen peer-to-peer networks become a thing, and then not become a thing.

I got my first cell phone in 2004. It was a Motorola Razr. I didn't like it... I didn't have much use for it.

In 2007 iPhone was released, and I bought one. I liked the iPhone a lot.

Yeah, I have seen things, but that doesn't make me old. It was interesting times, and I feel lucky I got to see computers and information technologies taking off.

Not having a Facebook account doesn't make me old.  I am a Twitter person, I never saw much point in Facebook.

I also want to come clean and say it. I never watched the Game of Thrones, not a single episode. I just know that winter is involved (learned this from Twitter memes), and there is a red wedding where a lot of people died (learned this from this CS paper).

When you miss a lot of things in popular culture, you may start to become irrelevant. But I think missing a couple things is OK. I don't want to be the same as everyone else. "If you only read the books that everyone else is reading, you can only think what everyone else is thinking." -- Haruki Murakami, Norwegian Wood

In order to delay old age, I try to start new things and get involved in new experiences. Entropy is hard to beat, I know. But, not today old age, not today.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...