Design and Analysis of a Logless Dynamic Reconfiguration Protocol

This paper appeared in OPODIS'21 and describes dynamic reconfiguration in MongoDB.

So, what is dynamic reconfiguration? The core Raft protocol implements state machine replication (SMR) using a static set of servers. (Please read this to learn about how MongoDB adopted Raft for a pull-based SMR.) To ensure availability in the presence of faults, SMR systems must be able to dynamically (and safely) replace failed nodes with healthy ones. This is known as dynamic reconfiguration.

MongoDB logless reconfiguration

Since its inception, the MongoDB replication system has provided a custom, ad hoc, legacy protocol for dynamic reconfiguration of replicas. This legacy protocol managed configurations in a logless fashion, i.e., each server only stored its latest configuration. It decoupled reconfiguration processing from the main database operation log. The legacy protocol, however, was known to be unsafe in certain cases. 

Revising that legacy protocol, this paper presents a redesigned safe reconfiguration protocol, MongoRaftReconfig, with rigorous safety guarantees. A primary goal of the protocol was to keep design and implementation complexity low.

Why didn't MongoDB use Raft re-configuration protocol? 

The Raft consensus protocol (2014) provided a dynamic reconfiguration algorithm (a critical safety bug was found later, showing that reconfiguration protocols are tricky). Raft uses the main operation log (oplog) for both normal operations and reconfiguration operations. This coupling imposes fundamental restrictions on the operation of the two logs.

MongoRaftReconfig avoids this by separating the oplog and "config state machine" (CSM), allowing reconfigurations to bypass the oplog SMR. We will revisit this in the evaluation section. Decoupling the CSM from the main operation log SMR also  allows for a logless optimization: it is sufficient to store only the latest version of the config state. This allows the CSM to avoid complexities related to garbage collection of old log entries and simplifies the mechanism for state propagation between servers.

Below, we present the MongoRaftReconfig protocol and discuss its correctness. The paper includes TLA+ model checking and a manual proof, which are used for verifying MongoRaftReconfig’s key safety properties.


MongoRaftReconfig protocol

Raft reconfiguration consists of two alternate algorithms: single server membership change and joint consensus. This paper focuses exclusively on the single server membership change protocol. The single server change approach aims to simplify reconfiguration by allowing only reconfigurations that add or remove a single server.

As in Raft-reconfig, by restricting to a single server change at a time, MongoRaftReconfig ensures that all quorums of two adjacent configurations (C to C') overlap with each other. MongoRaftReconfig also imposes additional restrictions to ensure

  • deactivation of old configurations (to prevent them from executing disruptive operations --e.g. electing a primary or committing a write), and
  • state transfer from the old configuration to the new configuration before the new one becomes active.

Let's formalize this a bit. 

A configuration is defined as a tuple (m,v,t), where m is a member set, v is a numeric configuration version, and t is the numeric term of the configuration. The v and t together tie the reconfiguration protocol, CSM, to the replicated state machine protocol for the oplog as we discuss below in the algorithm description.  This is achieved by totally ordering configurations by their (version, term) pair, where term is compared first, followed by version.

Reconfigurations can only be executed on primary servers, and they update the primary's current local configuration C to the specified configuration C'. As in RaftReconfig, in MongoRaftReconfig any reconfiguration that moves from C to C' is required to satisfy the quorum overlap condition i.e. QuorumsOverlap(C.m,C'.m).  The below conditions must also be satisfied before a primary server in term T can execute a reconfiguration out of its current configuration C.

  • Q1. Config Quorum Check: There must be a quorum of servers in C.m that are currently in configuration C.
  • Q2. Term Quorum Check: There must be a quorum of servers in C.m that are currently in term T.
  • P1. Oplog Commitment: All oplog entries committed in terms =< T must be committed on some quorum of servers in C.m.

Q1, when coupled with the election restrictions  (as we discuss below in elections section), achieves deactivation by ensuring that configurations earlier than C can no longer elect a primary.

Q2 ensures that term information from older configurations is correctly propagated to newer configurations, while P1 ensures that previously committed oplog entries are properly transferred to the current configuration, ensuring that any primary in a current or later configuration will contain these entries.

A big insight in the algorithm was to realize that the CSM and oplog protocols need to be combined/pinned together for safety, and this is done by using the (v,t) as a pair, and requiring that CSM commits the config on the most recent term, and the RSM follows the committed configs in sequential order.  

After a reconfiguration has occurred on a primary, the updated configuration needs to be communicated to secondaries. In MongoRaftReconfig, config state propagation is implemented by the SendConfig action, which transfers configuration state from one server to another. Secondaries receive information about the configurations of other servers via periodic heartbeats. They determine whether one configuration is newer than another using the total lexicographical ordering on the (version, term) pair. A secondary can update its configuration to any that is newer than its current configuration.

When a node runs for election in MongoStaticRaft, it must ensure its log is appropriately up to date and that it can garner a quorum of votes in its term. In MongoRaftReconfig, there is an additional restriction on voting behavior that depends on configuration ordering. If a replica set server is a candidate for election in configuration Ci, then a prospective voter in configuration Cj may only cast a vote for the candidate if Cj is less than or equal to Ci .

Furthermore, when a node wins an election, it must update its current configuration with its new term before it is allowed to execute subsequent reconfigurations. That is, if a node with current configuration (m, v, t) wins election in term t', it will update its configuration to (m, v, t') before allowing any reconfigurations to be processed. This behavior is necessary to deactivate concurrent reconfigurations that may occur on primaries in a different term.


Correctness

LeaderCompleteness property states that if a log entry has been committed in term T, then it must be present in the logs of all primary servers in terms > T.

ElectionSafety is a key, auxiliary lemma that is required in order to show LeaderCompleteness. Election safety state that: For all s, t \in Server such that s ̸= t, it is not the case that both s and t are primary and have the same term.

In order not to violate the property that all quorums of any two configurations overlap (which MongoStaticRaft relies on for safety), MongoRaftReconfig must appropriately deactivate past configurations before creating new configurations. Deactivated configurations cannot elect a new leader or execute a reconfiguration. Otherwise, the old primary (which still thinks it is primary) can institute a reconfiguration, and pull the rug (one crucial node needed for majority intersection) under the new primary, and this causes the violation the property that all quorums of any two configurations overlap, which MongoStaticRaft relies on for safety.

In addition to deactivation of configurations, MongoRaftReconfig must also ensure that term information from one configuration is properly transferred to subsequent configurations, so that later configurations know about elections that occurred in earlier configurations. For example, if an election occurred in term T in configuration C, even if C is deactivated by the time C' is created, the protocol must also ensure that C' is aware of the fact that an election in T occurred in C.

Moreover, MongoRaftReconfig also ensures that newer configurations appropriately disable commitment of log entries in older terms. CSM only moves ahead through committed configs sequentially: the CSM can choose the next config and commit it only if its current one is committed. The primary must write the current config again with its latest term and wait for it to be propagated to a majority.

The paper is accompanied with TLA+ models, which seem really nice. I will start playing with them.  I think the people who worked on the TLA+ models had a deep understanding of the protocol. This reminds me of this quote from Byron Cook's recent talk (recommended watch).

You (the formal methods person) become the only one who actually understands the system right. They don't understand it...  There's fantasy, the documentation, the code, and the individuals. And no one agrees on what's going on for any complex system.


Evaluation

As we mentioned in the introduction, in standard Raft, the main operation log is used for both normal operations and reconfiguration operations. This coupling imposes fundamental restrictions on the operation of the two logs.

This behavior is stronger than necessary for safety: it is not strictly necessary to commit these log entries before executing a reconfiguration. The only fundamental requirements are that previously committed log entries are committed by the rules of the current configuration, and that the current configuration has satisfied the necessary safety preconditions. Raft achieves this goal implicitly, but more conservatively than necessary, by committing the entry Cj and all entries behind it. This ensures that all previously committed log entries, in addition to the uncommitted operations U , are now committed in Cj , but it is not strictly necessary to pipeline a reconfiguration behind commitment of U.

MongoRaftReconfig avoids this by separating the oplog and config state machine and their rules for commitment and reconfiguration, allowing reconfigurations to bypass the oplog if necessary. Note that Oplog Commitment (P1) is easier to satisfy (if not already satisfied) than Raft's insistence on committing all the entries that happened to fall before Cj in the oplog.  

P1. Oplog Commitment: All oplog entries committed in terms =< T must be committed on some quorum of servers in C.m.


The evaluation section simulates a degraded disk scenario to highlight the benefit of the decoupled CSM execution. It argues that decoupling CSM execution allows MongoRaftReconfig to successfully reconfigure the system in such a degraded state, restoring oplog write availability by removing the failed nodes and adding in new, healthy nodes.

The paper examines the degraded disk scenario to mention a caveat, and to argue that even under that caveat MongoRaftReconfig provides an advantage. "Note that if a replica set server experiences a period of degradation (e.g. a slow disk), both the oplog and reconfiguration channels will be affected, which would seem to nullify the benefits of decoupling the reconfiguration and oplog replication channels. In practice, however, the operations handled by the oplog are likely orders of magnitude more resource intensive than reconfigurations, which typically involve writing a negligible amount of data. So, even on a degraded server, reconfigurations should be able to complete successfully when more intensive oplog operations become prohibitively slow, since the resource requirements of reconfigurations are extremely lightweight."


Related work

When presenting Paxos, for reconfiguration, Lamport proposed limiting the length of the command pipeline window to $\alpha > 0$ and only activating the new config chosen at slot i after slot $i + \alpha$. Depending on the value of $\alpha$, this approach either limits throughput or latency of the system. 

In contrast, in MongoDB, the wait on command commit is only done on-demand when reconfiguration is happening.



Matchmaker Paxos is a realization/implementation of Vertical Paxos with a more integrated deployment to the Paxos protocol. Vertical Paxos requires an external master, which is itself implemented using state machine replication. The matchmakers in MM are analogous to that external master/Paxos-box and show that such a reconfiguration does not require a nested invocation of state machine replication. Matchmaker Paxos uses and generalizes the approach in Vertical Paxos for reconfiguration and is OK with lazy/concurrent state transfer.

I had also written about reconfiguration for atomic storage, but that is an easier problem than reconfiguration on a state machine replication system.

Comments

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Foundational distributed systems papers

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Dude, where's my Emacs?

There is plenty of room at the bottom