### The Impact of RDMA on Agreement

This paper appeared in PODC 2019. PODC stands for Principles of Distributed Computing. It is a theoretical distributed systems conference, first established in 1982. I had published my first ever paper, "Resettable Vector Clocks" at PODC 2000. The conference was held in Portland and, as a fresh graduate student attending it, I was awe-struck at the conference. On day one of the conference, I saw Leslie Lamport interrupting a talk asking a question and protesting loudly. Then Keith Marzullo in the audience (I think) pointed out the misunderstanding and Leslie Lamport said "Nevermind" and calmed down. I also noticed that the person sitting next to me was, oh my God, Nancy Lynch! I couldn't believe my luck seeing all these distributed systems royalty in person. Also in this conference, on day three, Eric Brewer gave his CAP theorem talk.

Good times! Anyways, back to the paper.

## Contributions of the paper

Under the message-passing model, BFT consensus requires $n \geq 3fP +1$ (where fP is the maximum number of faulty processes) even for a synchronous system, and CFT consensus requires $n \geq 2fP +1$ for a partially synchronous system. But this paper considers BFT and CFT consensus using the RDMA model, and argues that this new model opens the way to improve resilience (in terms of the fraction of node failures tolerated) compared to the classical message passing model.

• With Byzantine failures, the paper gives an algorithm that only requires $n \geq 2fP + 1$ processes  and decides in two (network) delays in nonfaulty executions
• With crash failures, it gives an algorithm that only requires $n \geq fP + 1$ processes and decides in two delays

The paper argues that by employing RDMA it can tap in to two features not available in message passing systems:

RDMA enables a remote process to access local memory directly using the network interface card (NIC) without involving CPU. To provide RDMA access to a remote process p, the CPU has to register that memory region for access by p (called Queue Pair). The CPU must also specify what access level (r, w, rw) is allowed to the memory region in each protection domain for queue pairs. The protections are dynamic; and they can be changed over time. The paper assumes that Byzantine processes cannot change permissions illegally, i.e., the kernel is trusted.

Here is the thing though. In this new RDMA (i.e., M&M) model there are two distinct entities: processes and memories. That is, the paper adds m memories to the system in addition to the existing n processes, and then claims that it reduces the n required to tolerate f failures. Additionally, it constrains the model so that only up to a minority of the memories can fail, and the memories can only fail by crushing and not in a Byzantine way. I wish the paper was more upfront with the fact that the reduction in n come only after m nonByzantine memories are added to the system.

## BFT protocol

The paper presents a 2-deciding algorithm for weak Byzantine agreement with $n \geq 2fP +1$ processes and $m \geq 2fM +1$ memories, by composing two sub-algorithms:

1. a slow one (Robust Backup) that always works, and
2. a fast one (Cheap Quorum) that gives up under hard conditions

The Robust Backup algorithm is developed in two steps:

1. Developing non-equivocating broadcast to prevent Byzantine processes from sending different values to different processes
2. Using the framework of Clement et al. (combined with non-equivocating broadcast primitive) to convert a message-passing CFT consensus algorithm into a BFT algorithm

The Clement et al. transformation is able to restrict Byzantine behavior to crash failures by

• Employing trusted message-passing primitives, T-send and T-receive, using non-equivocation and signature verification on every message
• Including the full history of processes with each message, and then locally verifying whether a received message is consistent with the protocol

To apply the Clement et al. construction, the paper first shows that shared-memory with SWMR registers (and no memory failures) can implement these primitives, and then  shows how M&M model can implement shared-memory with SWMR registers. I avoid describing these two steps, but the main idea in these are that before delivering a message(k,m) from q, each process p checks that no other process saw a different value from q.

In contrast to the Robust Backup which used only static permissions, the Cheap Quorum algorithm uses dynamic permissions to decide in two delays in executions in which the system is synchronous and there are no failures. The Cheap Quorum is not in itself a complete consensus algorithm; as it may panic/abort in executions with a fault. If Cheap Quorum aborts, it outputs an abort value, which is used to initialize the Robust Backup so that their composition preserves weak Byzantine agreement. (This composition is inspired by the framework in the Next 700 Byzantine protocols paper.)

The Cheap Quorum algorithm has a special process L, which serves both as a leader and a follower. Other processes act only as followers.

• The memory is partitioned into n + 1 regions denoted Region[p] for each p, and plus Region[L] in which L proposes a value
• Some of the permissions are dynamic; a panicking process may remove L's write permission to Region[L]

To put the Cheap Quorum and Robust Backup algorithms together safely, we must ensure that the Robust Backup decides on a value v if Cheap Quorum decided v previously. To do that, Robust Backup decides on a preferred value if at least f+1 processes have this value as input. For this, it uses the classic crash-tolerant Paxos algorithm (run under the Robust Backup algorithm to ensure Byzantine tolerance) but with an initial set-up phase that ensures this safe decision. In the set-up phase, all processes send each other their input values. Each process p waits to receive n-f such messages, and adopts the value with the highest priority that it sees. This becomes the value that p uses as its input to Paxos.

## CFT Protocol

For CFT, the paper proposes Protected Memory Paxos, which is similar to Disk Paxos but uses permissions to remove two delays from Disk Paxos. Initially some fixed leader L has exclusive write permission to all memories; if another process becomes leader, it takes the exclusive permission. Having exclusive permission permits a leader L to optimize execution, because L can do two things simultaneously:

1. write its consensus proposal and
2. determine whether another leader took over

If L succeeds in (1), it knows no leader M took over because M would have taken the permission. Thus L avoids the last read in Disk Paxos, saving two delays. Protected Paxos allows the crash of all but one process ($n \geq fP +1$) and a minority of memories ($m \geq 2fM +1$).

The paper also proposes Aligned Paxos which aligns Paxos and Protected Memory Paxos so that their decisions are coordinated. This way Aligned Paxos improves resilience of Protected Paxos further by giving a 2-deciding algorithm that tolerates crashes of a minority of the combined set of memories and processes.

## Discussion

While the paper emphasizes that these algorithms are for RDMA, there is no implementation of the algorithms on RDMA, and there is no evaluation.  Technically the results of the paper is for the M&M model (which is a more general theoretical model), but of course claiming the results for the RDMA model is more attractive because RDMA receives a lot of attention recently.

I don't know much about RDMA, but I hear it is a bit fickle, and may buckle under load. Is the minority memory crash assumption in these algorithms sufficient to take care of this? Or under certain load, is it possible to have some silent failure (of maybe more than majority node) which violate safety in any of these algorithms?

Another point raised in our Zoom reading group discussion is that the Protected Memory Paxos algoirthm is actually not as not as fast as Fast Paxos. In Fast Paxos the 2 network delay for consensus solution is measured from the clients to the replicas. In Protected Memory Paxos, the 2 network delays are measured from the leader to the disks, and not from the client to the disks.