ParallelRaft: Out-of-Order Executions in PolarFS

This paper (2021) dives deeper in the Parallel Raft protocol introduced with PolarFS.

PolarFS (VLDB'18) is a distributed file system with ultra-low latency and high availability, developed by Alibaba. Its variant of the Raft consensus protocol, ParallelRaft,  relaxes Raft's strict serialization constraints. ParallelRaft allows state machines to commit and execute log entries out of order.

This study provides a formal specification of ParallelRaft in TLA+, proves its correctness, and identifies consistency issues caused by ghost log entries. To address these issues, the refined ParallelRaft-CE protocol limits parallelism during leader transitions.


Introduction

Raft is a tight-ass. It enforces sequential commitment and execution of log entries. No funny business. Multi-Paxos is a bit rebellious. It allows out-of-order commitment, but it still insists on ordered execution. In order not to leave any performance on the table, PolarFS's ParallelRaft rebels completely: it not only allows out-of-order commitment, but out-of-order execution as well. 

How does ParallelRaft keep consistency of state machine replication across replicas? The trick is to permit out-of-order execution if commands are conflict-free. It is the oldest trick in the book, the generalized Paxos idea. Each command contains a Logical Block Address (LBA). Commands accessing non-overlapping LBAs are conflict-free and can execute in parallel (i.e., out of order). Overlapping commands must execute in log order.

Ok, still this is a big claim. Concurrency is a wild beast! Did you expect this to be simple? What about leader turnovers? There is a lot to consider there. This paper complains that the original PolarFS paper was too quick to call it a day without specifying the protocol completely or providing an opensource implementation. They suspect that, in the presence of leader turnovers, ParallelRaft protocol is vulnerable to inconsistency induced by ghost log entries.

To investigate, they introduce an intermediate protocol, ParallelRaft-SE (Sequential Execution), which allows out-of-order commitment but mandates sequential execution. The point of ParallelRaft-SE is to establish a refinement mapping to Multi-Paxos. They then relax it to get ParallelRaft-CE (Concurrent Execution), which supports both out-of-order commitment and execution. They show that ParallelRaft-CE is prone to state inconsistency due to ghost log entries, and they propose to mitigate this issue using stricter leader election rules and log recovery mechanisms, forbidding out-of-order execution during leader takeover.


Out-of-Order Executions and Ghost Log Entries

A key challenge of out-of-order execution is determining conflicts when log entries are missing. How do you track conflicts with holes in the log? ParallelRaft introduces a Look Behind Buffer (LBF) that tracks the LBAs of up to K preceding log entries. If no holes exceeding K exist, conflict detection is feasible.

However, the paper argues ghost log entries can cause inconsistency as in Figure 1.

In phase 1, leader s1 creates entries 1-4, with only entries 1-2 being committed before s1 fails. When s3 becomes leader in phase 2, it doesn't receive s1's uncommitted entries 3-4 during recovery. S3 then adds its own entries 3-6, commits entry 6 (which sets y←2), and executes this operation before failing.

In phase 3, new leader s2 unexpectedly receives entry 4 from the now-recovered s1 during recovery. After committing this entry, s2 must execute it (setting y←3). This operation should have executed before entry 6 according to ordering rules, but s3 has already executed y←2.

This "ghost log entries" phenomenon--where s1's uncommitted entries disappear during s3's leadership only to reappear during s2's leadership-- creates inconsistent execution order. The actual execution conflicts with what should have happened based on log position, leading to incorrect conflict determination and system inconsistency.


The proposed ParallelRaft-CE patch

ParallelRaft-CE addresses ghost entries by refining ParallelRaft-SE. The key idea is to track the generation moment of log entries by the sync number, and barrier entries from old terms.

The protocol creates a clear boundary between terms by allowing concurrent operations only for log entries with the same term, while enforcing sequential processing for entries from different terms. It tracks entry generation through the sync number and adds a "date" variable to each log entry (similar to Paxos proposal numbers) to determine recency.

During recovery, the new leader collects log entries from a majority of nodes and selects the latest entries based on their "date" values. This ensures that already committed entries remain in the log, previously uncommitted entries are either properly committed or discarded, and no ghost entries can persist undetected across term changes. By introducing this controlled concurrency model where the sync number is always less than or equal to a node's term, ParallelRaft-CE effectively uses leader election as a sequentialization bottleneck to clean up potential inconsistencies.

ParallelRaft-CE has three key components:

1. Log Synchronization Mechanism: Followers accept only log entries matching their sync number (equivalent to the current term). Entries from different terms are rejected.

2. Leader Election Mechanism: Similar to Raft, but with an additional check to ensure no ghost entries exist. Leaders are responsible for reconciling logs and eliminating inconsistencies.

3. Log Recovery Mechanism: During recovery, the new leader collects logs from a majority of nodes and applies a conflict resolution process. Uncommitted entries from the previous term are either committed or discarded. 

In sequential Raft, leader transition is simple, as the selected leader is already caught up: Logs of nodes in Raft do not have holes, and the log matching property between nodes is guaranteed. But in ParallelRaft, since there are holes, leader election needs to be augmented with recovery as in Paxos leader transitions. This is the price to pay for being relaxed in out-of-order commit and execution.


The ParallelRaft-CE protocol limits out-of-order execution to entries within the same term. Log entries from different terms follow a sequential recovery and commitment process to  eliminate ghost entries and guarantee state consistency.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

My Time at MIT

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects