Monday, April 6, 2020

What Pasha taught me about life

Pasha joined our family when he was barely 2 months old. Shortly after that the Covid-19 quarantine started, and we have spent our lives 24/7 with Pasha.
We are all big fans of Pasha, but I particularly admire him as I am in awe of his approach to life. I fired my life-coach and decided to follow Pasha's teachings instead.

Here are the things I learned from Pasha.

Play hard, rest easy

Pasha has a lot of energy in the mornings. He bounces off the walls, climbs to our curtains, and playfully harasses our wrists and ankles. If we try to snuggle with him in the morning, he runs away, to continue his parkour route around the house.

At 11am, he crashes. He sleeps where he deems fit. If it is sunny, he finds a sunny spot in front of a window. But he also doesn't mind sleeping on the carpet in a busy room, oblivious of the feet traffic in the area.

After his nap, Pasha wakes up, a new cat, refreshed, and does a rinse and repeat.

Resting is important. Some people say work hard, party hard. No, Pasha says "work is play, play hard, and then rest easy." Pasha's philosophy is to embrace the boredom and tranquility, so he can play hard again.

Treat work as play, and play as work

Pasha doesn't differentiate between work and play. He takes his play seriously and not-seriously at the same time. I know this sounds like a zen koan, but let me try to explain. Pasha knows he is playing but he also knows he is training as well. It looks like he is wired to do deliberate practice in his playing/training. He goes the extra length when jumping or stretching even when he knows this is for play, and this is not a real prey he is chasing.

By training hard through his play time (which he observes religiously), he is cultivating essential skills for when they could be needed.

Pasha has intense concentration when playing. I envy this, I wish I could summon my focus as quickly and as intensely as he can. I will be working on this taking inspiration from Pasha.

Pursue your curiosity

Pasha follows his curiosity to the end, undeterred by any obstacles. Scolding him or relocating him to another room doesn't help. If he got interested in finding out about something (e.g., breakfast table, laundry room), he keeps chasing that relentlessly. Yesterday he fell into the toilet bowl, and had to get another bath.

Curiosity kills the cat, but satisfaction brought him back. The thing is we can't get mad at Pasha for the troubles he causes when following his curiosity. It is very apparent that he is wired this way. There is no point in getting angry at him for something outside his control.

I am a curious person, but looking at Pasha I know there is a lot of room to improve in this department. I wish I could be as driven by curiosity as Pasha.

Dance to your own beat

Pasha determines his schedule, and he doesn't let us dictate our schedule on to him. He naps always around the same times. He takes a nap from 11pm to little past midnight in the chair next to me in my study. When I am ready to go to bed, he wakes up without fail. I guess he gets some more sleep during the night, because he is up early in the morning full of energy.

Pasha enjoys life. He looks fully present and content at every moment. He doesn't have high expectations from life, and he can entertain himself. He is definitely handling this quarantine better than us, and probably the life thing as well.

Treasure your family

Pasha is not needy of attention and affection. When we give him affection he welcomes it (only briefly if he is on the prowl), and he doesn't shy back to show his affection. Pasha is very patient with our kids, especially our 5 year old daughter. The reason we didn't get a cat earlier was because we waited for her to grow up more. My daughter can act like Elmyra Duff sometimes, but Pasha tolerates this well, and in fact he shows a lot of affection to her.

Lick Like yourself

Pasha takes a good chunk of time taking care of his hygiene. Even the quarantine could not cause an erosion of his personal care standards. Pasha surely knows how to treat and take care of himself.

Here is Pasha again. He grew up a lot in one month's time.

Closing

Hey, it's my blog, and I can post whatever I like. If you find this post nonsensical, you do you. I learned from Pasha to be indifferent to others' opinion/evaluation.

Saturday, April 4, 2020

WormSpace: A modular foundation for simple, verifiable distributed systems

This paper is by Ji-Yong Shin, Jieung Kim, Wolf Honore, HernĂ¡n Vanzetto, Srihari Radhakrishnan, Mahesh Balakrishnan, Zhong Shao, and it appeared at SOCC'19. 

The paper introduces the Write-Once Register (WOR) abstraction, and argues that the WOR should be a first-class system-building abstraction. By providing single-shot consensus via a simple data-centric API, the WOR acts as a building block for providing distributed systems durability, concurrency control, and failure atomicity.

Each WOR is implemented via a Paxos instance, and leveraging this, WormSpace (Write-Once-Read-Many Address Space) organizes the address space into contiguous write-once segments (WOSes) and provides developers with a shared address space of durable, highly available, and strongly consistent WORs to build on. For example, a sequence of WORs can be used to impose a total order, and a set of WORs can keep decisions taken by participants in distributed transaction protocols such as 2PC.


To demonstrate its utility, the paper implements three applications over WormSpace. These applications are built entirely over the WOR API, yet provide efficiency comparable to or better than handcrafted implementations.
  • WormPaxos, a Multi-Paxos implementation
  • WormLog, a distributed shared log (omitted in my summary for brevity)
  • WormTX, a distributed, fault tolerant transaction coordinator
The main benefit of WormSpace is that compared to Paxos, developers do not need to reason about or understand Paxos protocols to build applications on top, instead the developer has the flexibility to choose low-level implementation details. The WormSpace API enables a more optimized design (as demonstrated for WormTX) without the need to modify the system (to rewire the communication pattern).

If you like to see how extensions of some of these ideas can be put in action, watch this video by Mahesh Balakrishnan (one of the authors of WormSpace) and Jason Flinn (U Michigan) describe the Delos Storage system for Facebook's Control Plane.

The paper is well-written, and I really enjoyed reading this paper. I think this paper should receive more attention. The paper has another track, which I will not discuss here: Formal verification of WormSpace and reuse of this proof for verifying systems built on top, via the Certified Concurrent Abstraction Layer (CCAL) approach. Sometimes, when a paper includes many ideas, it may dilute the focus and hurt its reception/recognition.

In my summary below, I use many sentences from the paper verbatim. I am saving my MAD questions and in-depth review to the Zoom paper discussion on Wednesday 15:30 EST. Our Zoom Distributed Systems Reading Group meeting is open to anyone who is passionate about distributed systems. Here is an invitation to the Slack workspace for the reading group, where we will conduct pre-meeting discussion and announce the password-protected link to the meeting.

The WormSpace system

The WOR abstraction hides the logic for distributed coordination under a data-centric API: a client can capture a WOR; write to a captured WOR; and read the WOR.

The address space is divided into write-once segments (WOSes) of fixed size. Segments are explicitly allocated via an alloc call that takes in a segment ID and succeeds if it is as yet unallocated. Once a client has allocated a WOS, any client in the system can operate on WORs within the segment. Specifically, it can capture a WOR; write to it; and read from it.

Clients must capture an address before writing to it to coordinate replicated servers to make the write atomic and immutable. The capture call is similar to a preemtable lock (e.g. phase1, prepare of Paxos): the lock must be acquired to write, but it can be stolen by others. A successful capture call returns a unique, non-zero captureID; a subsequent write by the same thread is automatically parameterized with this ID, and succeeds if the WOR has not been captured by some other client in the meantime.

The WormSpace design is similar to a write-once distributed key-value store: WORs are associated with 64-bit IDs (consisting of segment IDs concatenated with offsets within the segment) and mapped to partitions, which in turn consist of replica sets of wormservers. Partitioning occurs at WOS granularity; to perform an operation on a WOR within a WOS, the client determines the partition storing the segment (via a modulo function) and issues the operation to the replica set.

Each WOR is implemented via a single-shot Paxos consensus protocol, with the wormservers within a partition acting as a set of acceptors. In the context of a single WOR, the wormservers act identically to Paxos acceptors; a capture call translates to a phase 1a prepare message, whereas a write call is a phase 2a accept message. The read protocol mirrors a phase 1a message, but if it encounters a half-written quorum, it completes the write. Each wormserver maintains a map from WOR IDs to the acceptor state for that single-shot Paxos instance. If a map entry is not found, the WOR is treated as unwritten.

The client-side library layers the logic for enforcing write-once segments. Each WOS segment is implemented via a set of data WORs, a single metadata WOR, and a single trim WOR. Allocating the WOS requires writing to the metadata WOR. If two clients race to allocate a WOS, the first one to capture and write the WOR wins.

WormPaxos


WormPaxos is an implementation of Multi-Paxos over WormSpace, exposing a conventional state machine replication (SMR) API to applications. Implementing Multi-Paxos over WormSpace is simple: the sequence of commands is stored on the WormSpace address space. In WormPaxos, servers that wish to replicate state act as WormSpace clients, and are called WP-servers. They can propose new commands by preparing and writing to the next free address; and learn commands by reading the address space in sequential order.

In WormPaxos, a WP-server becomes a sticky leader simply by using a batch capture on a WOS; accordingly, leadership strategies such as sticky leader, rotating leader, etc. can be implemented simply as policies on who should call the batch capture and when. The leader's identity can be stored within the metadata for each segment, obviating the need for WormSpace to know about the notion of a leader or the leadership strategies involved. If the leader crashes, a new leader that allocates the next WOS can batch capture the WOS of the previous leader, complete partially finished operations, and fill in junk values to unwritten WORs to prevent holes in the SMR/Multi-Paxos log.

WormTX

2PC is known to be a blocking protocol in the presence of crash failures.  WormTX shows how it can be made non-blocking leveraging on WormSpace. A number of variant protocols is presented to show how the efficiency can be gradually improved.


  • [Variant A8: 8 message delays] An obvious solution is to simply store the votes in a set of per-RM WORs. In the WOR-based 2PC protocol, an RM initiates the protocol by contacting the TM (message delay #1); the TM contacts the RMs (#2); they capture the WOR (#3 and #4), and then write to it (#5 and #6); send back their decision to the TM (#7), which sends back a commit message to all the RMs (#8). 
  • [Variant B6: 6 message delays] Each RM can allocate a dedicated WOS for its decisions and batch capture the WOS in advance. 
  • [Variant C5: 5 message delays] TM can directly observe the decision by listening for write notifications on the WOS.
  • [Variant D4: 4 message delays] Individual RMs can directly listen to each other’s WOSes; this brings us down to 4 message delays.
  • [Variant E3: 3 message delays] We do not need a TM, since the final decision is a deterministic function of the WORs, and any RM can time-out on the commit protocol and write a no vote to a blocking RM's WOR to abort the transaction. The initiating RM can simply contact the other RMs on its own to start the protocol (combining #1 and #2 of variant A8), bringing down the number of delays to 3. This variant is not described by Gray and Lamport' Consensus of Transaction Commit paper.
  • [Variant F2: 2 message delays]: This works only if RMs can spontaneously start and vote. 

Evaluation


Thursday, April 2, 2020

Gryff: Unifying consensus and shared registers

This paper is by Matthew Burke, Audrey Cheng, and Wyatt Lloyd, and appeared in NSDI'20. Here is a link to the paper, slides, and video presentation.

Straight talk (from the middle of the book)

  • The model of the paper is a great contribution. Stable versus Unstable ordering is a good framework to think in. Carstamps (consensus after registers) logical clock timestamping is a good way to realize this ordering. I think carstamps will see good adoption, as it is clear, concrete, and useful.
  • Constructing a hybrid of EPaxos and ABD is a novel idea.
  • The performance of Gryff is not good. A straightforward distributed key-value sharded implementation of Paxos would do a better job. I think Hermes is a better choice than Gryff with read-write and read-modify-write operations.

Introduction

Recently we see a lot of interest in unifying consensus and shared registers, the topic of the paper. I think this is because of the popularity of distributed key-value stores/systems. While consensus is often used for implementing distributed key-value stores, this is a bit of an overkill. You don't need to know the previous value of the key, if you are overwriting it with a write operation to the key. For the write operation, the new value is not a function of the old value.  As Gryff says in the abstract: "Consensus is inefficient for workloads composed mostly of reads and writes". ABD is good enough for that, and it even gives you linearizability to the face of concurrent asynchronous execution with crash faults. However, ABD for shared registers is too weak to implement stronger synchronization primitives. You would still need to use the consensus gun for the occasional read-modify-write operation.

So this led many people to think about providing read/write operations and read-modify-write operations separately for implementing distributed key-value systems. Hermes (ASPLOS'20), which I reviewed earlier, is an example of this. Fine-Grained Replicated State Machines (NSDI'20), which we will discuss in an upcoming Zoom DistSys reading group, also looks at a related problem.

Consensus vs. Shared Registers

A big contribution of the paper is the framework it introduces for thinking about read write operations and read-modify write operations.


Applying commands in the same order on all replicas requires an ordering mechanism that is stable, i.e., a replica knows when a command's position is fixed and it will never receive an earlier command. In asynchronous systems where processes can fail, consensus protocols are used to agree on this stable ordering.

Shared register protocols provide a linearizable ordering of operations. That ordering does not have to be stable, however, because each write operation fully defines the state of the object. Thus, a replica can safely apply a write w4 even if it does not know about earlier writes. If an earlier write w3 ever does arrive, the replica simply ignores that write because it already has the resulting state from applying w3 and then w4. Figure 1b shows shared register ordering where there is a total order of all writes (denoted by <) without stability.

A shared object in Gryff exposes the following interface:
  • READ(): returns the value of the object
  • WRITE(v): updates the value of the object to v
  • RMW( f (·)): atomically reads the value v, updates value to f (v), and returns v

Carstamps for correct ordering

The paper says that AQS (active quorum system) protocol [2010] which tried to unify consensus and shared registers, has a subtle bug that allows rmw to be misplaced/misordered. The paper says that, to simplify reasoning about correctness, it is best to enforce the interaction at a deeper level, in the ordering mechanism, by imposing a structural order in the timestamps.

To leverage this insight, the paper introduces consensus-after-register timestamps, or carstamps. Carstamps allow writes and rmws to concurrently modify the same state without serializing through a leader or incurring additional round trips. Reads use carstamps to determine consistent values without interposing on concurrent updates.



Gryff Protocol

The name Gryff stands for Griffin, a hybrid between lion and eagle, as an attribution for the protocol being a hybrid between EPaxos and ABD.

The only difference in read-write in Gryff from multi-writer ABD is that replicas maintain a carstamp associated with the current value of the shared object instead of a tag so that rmws are properly ordered with respect to reads and writes.
  • Write. The rmwc field is reset to 0 (Line 15 of Algorithm 1).
  • Reads. "We make the same observation as Georgiou et al. [26] that the second phase in the read protocol of multi-writer ABD is redundant when a quorum already store the value and associated carstamp chosen in the first phase."
I like the succinct description of EPaxos in the paper; it depicts EPaxos as a generalization of Paxos with its three phases:
 EPaxos is a consensus protocol that provides optimal commit latency in the wide-area. It has three phases in failure-free executions: PreAccept, Accept, and Commit. If a command commits on the fast path (i.e., If the coordinator receives a fast/supermajority quorum of responses that all contain the same dependencies), the coordinator returns to the client after the PreAccept phase and skips the Accept phase (where it builds the final dependencies for the command by taking the union of all the dependencies that it received in the PreAccept phase). Otherwise, the command commits on the slow path after the Accept phase. Commands that do not read state complete at the beginning of the Commit phase; commands that do read state complete after a single replica, typically the coordinator, executes the command to obtain the returned state. The purpose of the PreAccept and Accept phases is to establish the dependencies for a command, or the set of commands that must be executed before the current command. The purpose of the Commit phase is for the coordinator to notify the other replicas of the agreed-upon dependencies.
Gryff makes three high-level modifications to EPaxos to unify its stable ordering with the unstable ordering of the multiwriter ABD read and write protocol.
  1. A base update attribute, base, is decided by the replicas during the same process that establishes the dependencies and the approximate sequence number for a rmw.
  2. A rmw completes after a quorum execute it.
  3. When a rmw executes, it chooses its base update from between its base attribute and the result of the previously executed rmw prev. The result of the executed rmw is applied to the value and carstamp of the executing replica.
The first change adapts EPaxos to work with the unstable order of writes by fixing the write upon which it will operate. The second change adapts EPaxos to work with reads that bypass its execution protocol and directly read state. In other words, the second change makes the ABD protocol able to read the EPaxos outcome. This makes the commit phase a two-phase operation; the rmw coordinator should see the commit completed by a majority quorum.

In sum, Gryff makes reads more efficient by performing them with ABD (with the above-mentioned Georgiou optimization) instead of EPaxos where a supermajority quorum would be needed. While Gryff uses two-round ABD writes, I think that may also be reduced to one round write with a trick for inferring the higher TS value to propose time with (like using HLC clock timestamps) in an optimistic way and learn the higher timestamp if that fails, and complete in two rounds.

On the other hand, I am starting to like the invalidation idea in Hermes more. In contrast to ABD used in Gryff, Hermes allows linearizable reads while writes are ongoing, and local reads at that.
  A coordinator node issues a write to a key only if it is in the Valid state; otherwise the write is stalled. This doesn't seem to be necessary for safety, because the higher timestamped writes will preempt the lower timestamped writes. So why does Hermes do this? I think they do this, because it get replicas see the writes concluded, even when there is a deluge of writes to the same key. This may in turn help alleviate the read starvation due to constant flood of writes to the same key. I found this in the slack channel for ASPLOS'20 from the first author:
  It is safe for a read that initially found the object invalidated with version timestamp 2 and then subsequently invalidated with a version timestamp 3 to get serialized and return the version 2 value. Intuitively this is partly safe because a write with version 3 could not have started unless the write with version 2 has been committed.

Proxying reads

The base Gryff read protocol provides reads with single round-trip time latency from the coordinator to the nearest quorum including itself (1 RTT) when there are no concurrent updates. Otherwise, reads have at most 2 RTT latency. The paper discusses how read latency can be further improved in deployments across wide area networks.

Because the round-trip time to the replica that is colocated with a client process is negligible relative to the interreplica latency, replicas can coordinate reads for their colocated clients and utilize their local state in the read coordinator protocol to terminate after 1 RTT more often. When using this optimization, we say that the coordinating replica is a proxy for the client process's read.

  • Propagating Extra Data in Read Phase 1. The proxy includes in the Read1 (i.e., read-phase1) messages its current value v and carstamp cs. Upon receiving a Read1 message with this additional information, a replica applies the value and carstamp before returning its current value and carstamp. This has the effect of ensuring every replica that receives the Read1 messages will have a carstamp (and associated value) at least as large as the carstamp at the proxy when the read was invoked.
  • Updating the Proxy’s Data. The proxy also applies the values and carstamps that it receives in Read1Reply messages as it receives them and before it makes the decision of whether or not to complete the read after the first phase. If every reply contains the same carstamp, then the read completes after 1 RTT even if the carstamp at the proxy when the read was invoked is smaller than the carstamp contained in every reply.

Evaluation

Gryff is implemented in the same framework as EPaxos and MultiPaxos and its performance is evaluated in a geo-replicated setting. The evaluation shows that, for workloads with moderate contention, Gryff reduces p99 read latency to ∼56% of EPaxos, but has ∼2x higher write latency. This tradeoff allows Gryff to reduce service-level p50 latency to ∼60% of EPaxos for large-scale web applications whose requests fan-out into many storage-level requests.

My big problem with the evaluation is that it doesn't use leader-leases optimization in MultiPaxos to allow serving of reads locally at the leader. This standard optimization would likely lead to MultiPaxos giving the best result for read latencies in the evaluations.

Another thing missing in the evaluation is a comparison with Cassandra. Cassandra implements read/write registers via ABD-like algorithm and can give linearizability if you configure the quorums accordingly. Cassandra also has CASPaxos for compare-and-set for conditional write, which can be used to implement read-modify-write.



The evaluation shows less blocking for rmws in Gryff. Gryff achieves 2 RTT rmws when there are no conflicts and 3 RTT when there are. While Gryff must still block the execution of rmws until all dependencies have been received and executed, Gryff experiences significantly less blocking than EPaxos. This is because EPaxos needs to have dependencies on writes, but Gryff’s rmw protocol does not keep track of dependencies on writes.


We see that Gryff and EPaxos each achieve a slightly higher maximum throughput than MultiPaxos due to their leaderless structure. (This is of course at low conflict rates, because at high conflict rates EPaxos and Gryff pay a stiff price). It is easy to access MultiPaxos to per-key sharded Paxos, and that would compete and out-do Gryff and EPaxos.  For achieving best performance (for both throughput and latency) for a WAN key-value deployment, however, I suggest using our WPaxos protocol, as it can adapt to locality of access as well.

Our first Zoom DistSys reading group meeting

We did our first Zoom DistSys reading group meeting on April 1st, Wednesday 15:30 EST. We discussed the Gryff paper.

I didn't have much Zoom  experience, and this was very experimental reaching out to the world at large to run a reading group with whomever is interested.

20 people attended.  As I was introducing the format, one person starting writing chat messages, saying "this is so boring", etc.  He had connected with a phone, and the video was showing him walking probably in a market. This should have been a red flag. The meeting participants asked me to remove him, because he was pinging them and bothering them as well. That was our troll.

I had taken measures to stop zoom-bombing, since I had heard this was an issue.
  • Only the hosts and cohosts could share screen. 
  • I made two co-hosts to help with moderation. 
  • I had selected the option to disallow joining after removal.
I removed the troll, and there was no incidents after that.

The meeting took 90 minutes. The presentation from the slides was for 30 minutes. The general discussion was 25 minutes, and we used 25 minutes for the breakout rooms for focused deeper discussion on selected questions. And we had a 10 minutes of wrap up at the end, where we summarized discussion from the breakout sessions.

This meeting was the first time I used the breakout room session. I was not sure how well the breakout rooms would work, but it turned out great. I used automatic assigning of participants to the room and mentioned that I would switch users between rooms if requested. We had 3 breakout rooms, and around 5 people per room. This enabled us to meet each other in the breakout rooms, and we had a more relaxed and productive conversation. One participant in my breakout room had joined from India (2:30 am local time) and another had joined from Germany (11:30 pm local time). It was nice meeting and discussing with people passionate about distributed systems around the globe.

In the wrap up after the breakouts, I performed a poll and learned that 90% of the participants has read the paper before the meeting. This improved the quality of the discussion. It is not easy to understand a paper just by watching a 25 minute presentation.

Here are the documents for offline viewing. But you had to be there; there is a big advantage for live participation.




On the whole, the Zoom meeting went better than I expected. After we get more experience with the mechanics of Zoom, I think things will run better.

I am now hopeful that we can make this meeting sustainable. I hope we will be able to get enough volunteers for presenting papers. The paper presentation is only for 20-30 minutes, because we assume participants read the paper before the presentation. Volunteering to be a presenter is a good commitment to make for learning more about a specific paper/topic. We ask the presenter to give us a summary 4-5 days before the presentation. This gives enough time for others to get a head start for preparing for the discussion.

Here are the next papers, we will discuss in the upcoming meetings:

Friday, March 27, 2020

Zoom Distributed Systems Reading Group

A couple days ago, I tweeted this out, and was surprised how well this was received.
The involvement level distribution looks healthy. One third wants to follow offline, one third likes to attend the discussion session live, and the remaining will get more involved in the reading group by doing their readings and some volunteering to present papers.

We will start the Zoom DistSys Reading Group on Wednesday April 1st (no joke!) at 15:30 EST and meet regularly every week. We announce our meeting links (password protected) at https://join.slack.com/t/distsysreadinggroup/shared_invite/zt-dc1swfth-QSv0sDBBjS05jbz84sIRYA.

Following a very brief presentation of the paper, we will start discussing the paper together. After we identify 4-5 interesting questions/directions to dig deeper, we will go into breakout sessions, so that each group can work discuss on the question. At the end, we will gather again, and get summaries from breakout sessions. (I have been using this reading group format for my seminars and was very happy about it.)

We will use Google docs to collect comments, and to help the breakout sessions work on their questions together.

I will be recording the Zoom session and upload it on YouTube. After the meeting, I will write a blog post summarizing our understanding and findings of the paper, providing links to the video recording and Google documents.

Preparation before the Zoom meeting

The paper we will discuss is: Gryff: Unifying Consensus and Shared Registers. You can download the paper freely here. Here is a link to USENIX NSDI'20 presentation of the paper, and a link to the presentation slides.

This paper got on my radar after writing a summary of a related paper, Hermes: A fast fault-tolerant and linearizable replication protocol.

Aleksey will present the paper. Here is a link to his one page (for some definition of a page) summary for Gryff.

We like to collect questions about the paper before the Zoom Discussion. Please use this Google Document to write questions you like to get resolved about the paper. We will be using this document also as the discussion/collaboration document during the Zoom session.

Aleksey created a Google/YouTube account for distsysreading, and if you have suggestions/comments you can email to the distsysreading gmail address (just add gmail.com to the name).

Tuesday, March 24, 2020

Hermes: A fast fault-tolerant and linearizable replication protocol


This paper (from ASPLOS'20 which was held remotely) is by Antonios Katsarakis, Vasilis Gavrielatos, M. R. Siavash Katebzadeh, Arpit Joshi, Aleksandar Dragojevic, Boris Grot, Vijay Nagarajan. The paper has its own website, where you can get to the video presentation, slides, and code.

Introduction

Hermes is a replication protocol that guarantees linearizability. It enables local reads: a client can execute a read locally on any of the replicas. Hermes enables any replica to coordinate a write to a key, and supports concurrent writes to different keys quickly.

Too good to be true? You have to read the protocol section below to see how these are achieved.

But if you are a distributed systems expert, here is my shortcut explanation of the protocol. Hermes is simply chain replication (with CRAQ optimization) deployed with the following "chain" topology:
  1. The head and tail node of the chain is colocated in one node, called the coordinator
  2. The intermediate nodes of the "chain" are all parallel-connected (rather than serial) to the coordinator
In chain replication (or CRAQ) the replication goes in a serial manner, but  the chain replication paper mentioned that parallel wiring is possible as long as you have one head and one tail. And when you colocate the head and the tail in the same node, Voila!, you have Hermes. To clean the replicas for reads from any replica, the tail (i.e., the coordinator) sends an ACK message as in CRAQ. (see Figure 2 below.)

In addition to solving the latency problem in chain replication by using parallel-wiring, Hermes also allows multiple coordinators to help balance the coordinating load across nodes. Any node can be a coordinator for any key. Thanks to the logical clock timestamping (using node-ids as tie-breakers), the writes are total-ordered in Hermes. Since higher timestamped writes invalidate lower timestamped writes, the result of concurrent writes will be the same at each node, and linearizability is achieved even with local reads.

I summarize Hermes below, and then discuss how Hermes compare with Paxos-based protocols.

Protocol



Writes can be initiated by any replica:
  1. the replica initiating the write (called coordinator) broadcasts an Invalidation (INV) message to the rest of the replicas (called followers) and waits on acknowledgments (ACKs)
  2. once all ACKs have been received; the write completes via a Validation (VAL) message broadcast by the coordinator replica
As the first step, the coordinator invalidates replicas for this key, so that linearizability is not violated if a client reads the key from a node that has an old value.

A read request can be served locally on any operational replica (i.e., one with a lease from the membership service). The replica returns the local value of the requested key only if it is in the Valid state. When an INV message for a key is received, the replica is placed in an Invalid state for that key, meaning that reads to the key cannot be served by the replica.

Membership service

In the write operation described above, the coordinator waits to hear an ACK from each replica. If a replica crashes, this results in the write to get stuck forever, right? To address this issue, there is a need for detecting crashed nodes.

Instead of making each node have a failure detector ---which is hard to keep consistent---, Hermes (similar to chain replication) employs an external Paxos-backed configuration/membership service that decides on the health of the nodes. This service acts as a single consistent (but not necessarily perfect) failure detector for the replicas. It becomes the sole source of "truth/perspective": While it can be mistaken in its judgment, it keeps every replica in Hermes consistent with respect to their view of which nodes are healthy and part of the protocol.

This Paxos-powered membership/configuration service changes configuration/view when needed, and at each view-change it increases the epoch number. This keeps Hermes safe (and eventually live) in a partially synchronous environment ---with bouts of asynchrony.

Well, there is still the problem with lease safety at replication nodes. Each replica need a lease from the membership service for this to work (again as in chain replication). See the fault-tolerance section below for how this is handled.

Concurrent writes

Hermes allows writes to different keys to proceed in parallel for impoving the throughput.

As for concurrent writes to the same key, invalidations plus logical timestamps impose a total order on these writes. This prevents conflicts and aborts, and ensures that those are correctly linearized at the replicas.

A coordinator node issues a write to a key only if it is in the Valid state; otherwise the write is stalled. This doesn't seem to be necessary for safety, because the higher timestamped writes will preempt the lower timestamped writes. So why does Hermes do this? I think they do this, because it get replicas see the writes concluded, even when there is a deluge of writes to the same key. This may in turn help alleviate the read starvation due to constant flood of writes to the same key. I found this in the slack channel for ASPLOS'20 from the first author:
  It is safe for a read that initially found the object invalidated with version timestamp 2 and then subsequently invalidated with a version timestamp 3 to get serialized and return the version 2 value. Intuitively this is partly safe because a write with version 3 could not have started unless the write with version 2 has been committed. 
This assumes no epoch change, I presume. A couple sections below, I will discuss about our Paxos-Quorum-Reads technique which does a similar thing, but without blocking the writes to wait for earlier writes to finish, and without requiring leases or a configuration/membership service.

Read-modify-write updates

Linearizability is not the whole story. You can get linearizability in Cassandra, using the ABD algorithm, which is not even subject to FLP. But the problem is ABD is not consensus, and it is not good alone for maintaining state machine replication. 

Hermes is trying to do more and achieve state machine replication. It enforces the replicas to have the same log in the same order (for the same key). The paper also shows how Hermes can support read-modify-write (RMW) updates, an atomic execution of a read followed by a write to a key (e.g., a compare-and- swap to acquire a lock).
  An RMW update in Hermes is executed similarly to a write, but it is conflicting. An RMW which is concurrently executed with another update operation to the same key may get aborted. Hermes commits an RMW if and only if the RMW has the highest timestamp amongst any concurrent updates to that key. Moreover, it purposefully assigns higher timestamps to writes compared to their concurrent RMWs. As a result, any write racing with an RMW to a given key is guaranteed to have a higher timestamp, thus safely aborting the RMW. Meanwhile, if only RMW updates are racing, the RMW with the highest node id will commit, and the rest will abort.
A recent paper, Gryff in NSDI20, also investigates this problem. It uses the ABD algorithm for read-write registers, and EPaxos in conjunction with consensus-after-register timestamps (carstamps) for the RMW updates. In Gryff, the RMW operations do not get aborted, they just get ordered correctly by EPaxos even after a conflict.

While we are on the topic of related work, I wonder how Hermes compares with RIFL:
Implementing Linearizability at Large Scale and Low Latency (SOSP'15). The paper does not cite RIFL, but it would be nice to compare and contrast the two protocols.

Fault-tolerance

Hermes seamlessly recovers from a range of node and network faults thanks to its write replays, enabled by early value propagation.
  Node and network faults during a write to a key may leave the key in a permanently Invalid state in some or all of the nodes. To prevent this, Hermes allows any invalidated operational replica to replay the write to completion without violating linearizability. This is accomplished using two mechanisms. First, the new value for a key is propagated to the replicas in INV messages (see Figure 2). Such early value propagation guarantees that every invalidated node is aware of the new value. Secondly, logical timestamps enable a precise global ordering of writes in each of the replicas. By combining these ideas, a node that finds a key in an Invalid state for an extended period can safely replay a write by taking on a coordinator role and retransmitting INV messages to the replica ensemble with the original timestamp (i.e., original version number and cid), hence preserving the global write order.
For fault-tolerance, the membership service and leases at replicas play a central role. If one replica is partitioned out, the coordinator cannot make progress unless the membership service updates the membership to remove that replica. The membership service waits until the lease it granted to the partitioned replica expires. The lease expiration makes the replica invalid. The membership service then increases epoch number, and disseminates the new membership information to the replicas, and the coordinator (or any other replica node via the early value propagation technique) can make progress.

Protocol-level comparison to Paxos based solutions, and PQR

The round trip and a half protocol in Hermes has similarities (at least in terms of performance bottleneck characteristics) to the Phase-2 "accept" and Phase-3 "commit" the Paxos leader (via MultiPaxos optimization) performs with the followers.

The nice thing about Paxos based solutions is that there is no outside membership/reconfiguration box needed in that solution. Below let's discuss how well Paxos-based solutions can hold up to Hermes's features.

Hermes distributes the coordination load across replicas. EPaxos has the same feature due to opportunistic leaders approach. It is also possible to deploy Paxos with per-key sharding to the leaders (this was mentioned and compared with in EPaxos paper I think). In our WPaxos protocol, we improved over the per-key sharding to the leaders approach, and showed how WPaxos can outperform it by stealing keys and assigning it to the closest leaders to improve performance based on the access pattern in the workload.

Hermes does local reads from one replica. Megastore from Google also allows local reads from one replica with support from coordinators.

In our recent work, we introduced Paxos Quorum Reads (PQR) and showed how to perform linearizable reads from Paxos protocols without involving the leader and without using any leases. Since PQR does not require leases, it works in an asynchronous environment. PQR requires reading from majority of the nodes, to catch if there has been a newer pending update to the key. If there is a pending update, the clearing of the read can be done by just barriering on one replica. It is possible to relax the initial majority-quorum read and instead use fewer number of nodes by using a larger write quorum. While reading from multiple nodes in parallel requires more messages, it does not increase the latency. See this brief summary of Paxos Quorum Reads to learn more.

Evaluation

Hermes is evaluated over an RDMA-enabled reliable datastore with five replicas. The evaluation compares Hermes with ZAB and CRAQ. At 5% writes, the tail latency of Hermes is 3.6× lower than that of CRAQ and ZAB.

The performance improvement in Hermes, I think comes from using multiple coordinators, which was not made available to ZAB or CRAQ.

The figures show that "fewer writes=more reads" is better for Hermes because of the local reads in Hermes. On the other hand, observe that for uniform key distribution workloads, CRAQ is as good as Hermes for throughput even though the replicas there are serially wired instead of parallel-wired.

For latency, the improvement due to parallel-wiring replicas is significant.

Sunday, March 22, 2020

My Emacs setup

In January, I switched to the new Meh-Book Pro (typo intended).

Setting up a new laptop is hard because there are too many things to keep track of. Even so, rather than copying the state of my previous laptop to the new one, I decided to do the setup from scratch as this would help for software rejuvenation and getting rid of all the junk that accumulated on my laptop for five years.

(Side remark: A couple years ago, I heard from a top cloud computing expert say that datacenters also gather junk over time. He told me that once during a renovation of a datacenter they discovered servers that are sitting there without any use --that no one knows of--, and throw them out. He had likened the experience to getting rid of defunct stuff when you migrate to a new laptop.)

Anyways, I was stressed about the task, but then I remembered that past Murat has left me notes on how to setup the laptop. Fantastic!

I found out a couple things past-Murat had missed in the instructions. I am noting them here out of courtesy for future-Murat. I am paying it forward.

Install auctex for latex stuff and ispell for the spell checker Emacs uses. brew install ispell --with-lang-en

Don't forget to "brew cask install" the following: slack, filezilla, adblock, tla-plus-toolbox

Add the "home directory" to finder. (Stupid MacOS.)

Disable bluetooth from waking up the laptop to prevent the phone headphone to connect to the sleeping laptop and wake it up. Duh!

Emacs setup

Most of my new notes is about the Emacs setup, where I spent the bulk of my time.

I installed emacs using cask rather than from binary so I get to enjoy the ELPA package-manager support by typing "M-x package-show-package-list".

I had a problem using the previous starter-kit I had and couldn't get it to work. I searched and found kjhealy's kit highly recommended, so I installed that one.

This starter-kit takes care of everything and is easy to install. It sits in my .emacs.d directory. I also create myusername.el in that folder to add my customization.
(global-set-key (kbd "<f1>") 'org-timer-set-timer)
(global-set-key (kbd "<f2>") 'flyspell-auto-correct-word)
;; F3 store macro
;; F4 apply macro
(global-set-key (kbd "<f5>") 'org-schedule)
(global-set-key (kbd "<f6>") 'org-deadline)
;; (global-set-key (kbd "<f7>") 'org-agenda-do-date-earlier) ;; instead use  M -
;; (global-set-key (kbd "<f8>") 'org-agenda-do-date-later) ;; instead use  M +
(global-set-key (kbd "<f7>") 'org-store-link)
(global-set-key (kbd "<f8>") 'org-insert-link)

(global-set-key (kbd "<f12>") 'org-agenda)


I install visual-fill-column, and add this to myusername.el
;; set line width to 80 characters
;; first do this!! M-x package-install visual-fill-column
(setq-default fill-column 80)
(global-visual-line-mode 1)
(global-visual-fill-column-mode 1)
(auto-fill-mode -1)
(remove-hook 'text-mode-hook #'turn-on-auto-fill)
;;shortcut for auto-fill mode on/off
(global-set-key (kbd "C-c w") 'auto-fill-mode)
;; visual-line-mode for handling long lines
(add-hook 'text-mode-hook 'turn-on-visual-line-mode)
(add-hook 'text-mode-hook 'turn-on-visual-fill-column-mode)

Yes, I have a lot of junk DNA in that file.


I found that this starter kit uses the zenburn theme. It is nice pastel colors, but I prefer a light background. To be able to load other themes, I found that I need to remove the background and foreground colors in custom.el. Then  "M-x load-theme" works for loading other themes. I use leuven and sometimes anti-zenburn as my light-background Emacs theme.

I know I can open dozens of buffers in one Emacs windows, but I like to keep three Emacs windows open for better organization:
  • one window is for org-agenda and related files to organize my life (for this I still use the zenburn theme as that looks cool in agenda mode)
  • the other two windows are on research projects (for these I use the leuven theme) 


I simplified my pomodoro setup as well. I use <f1> to invoke org-timer-set-timer, and enter how many minutes I want the pomodoro for. I hooked a function to the org-timer-done, and call applescript to vocalize a celebratory message when the timer expires. This follows from the practices recommended in TinyHabits.
;;(defun my-org-timer-hook ()
;;  (sound-wav-play "~/.emacs.d/Bell.wav"))
(defun my-org-timer-hook ()
  (do-applescript "say \"Well done!\""))

(add-hook 'org-timer-done-hook 'my-org-timer-hook)

;; (setq visible-bell nil)
(setq org-clock-sound t) ;; Standard Emacs beep; visible-bell if above commented

I set the timer to 10 minutes most of the time, because that gets me started.  I can keep going with longer durations.

Related post: My Emacs Journey

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...