Canopus: A scalable and massively parallel consensus protocol

This paper is by Sajjad Rizvi, Bernard Wong, and Srinivasan Keshav, and appeared in CoNext17.

The goal in Canopus is to achieve high throughput and scalability with respect to the number of participants. It achieves high throughput mainly by batching, and achieves scalability by parallelizing communication along a virtual overlay leaf only tree (LOT).

Canopus trades off latency for throughput. It also trades off fault-tolerance for throughput.

The protocol


Canopus divides the nodes into a  number of super-leaves. In the figure there are 9 super-leaves, each super-leaf with 3 physical nodes (pnodes). A LOT is overlayed on these 27 pnodes so that the pnode N emulates all of its ancestor vnodes 1.1.1, 1.1, and 1. The root node 1 is emulated by all of the pnodes in the tree.

Canopus divides execution into a sequence of consensus cycles. In the first cycle, each node within the super-leaf exchanges the list of proposed commands with other super-leaf peers. Every node then orders these commands in same deterministic order, coming to a decentralized consensus (in a manner very similar to this synchronous consensus algorithm) at this super-leaf. This creates a virtual node that combines the information of the entire super-leaf. In consecutive cycles, the virtual nodes exchange their commands with each other. At the top level, for the emulation of the root of the LOT tree, every physical node has all commands in the same order and consensus has been reached.

Instead of directly broadcasting requests to every node in the group, Canopus uses the LOT overlay for message dissemination, and this helps reduce network traffic across oversubscribed links. It looks bad to have multiple cycles for consensus across wide area network (WAN) deployments. Moreover, even if some super-leaves have no requests to get consensus currently, they still need to participate at all the cycles of the consensus. There is some consolation that the lower-level cycles are done with nearby datacenters first, and only at the top level, nodes talk to nodes across the entire WAN.

Fault-tolerance

Canopus assumes that an entire rack of servers (where each super-leaf resides) never fails, and that the network is never partitioned. If these happen, the entire system loses progress. For example, two node failures in the same super-leaf makes the entire twenty-seven nodes in the system become unavailable.

The reason distributed consensus is hard is because the parties involved don't have access to the same knowledge (the same point of view) of the system state. To solve that problem in the base level, Canopus assumes that a reliable broadcast functionality is available within a super-leaf (thanks to ToR switches). This reliable broadcast ensures that all the live nodes in a super-leaf receive the same set of messages. (In 2015, I had suggested a more general way of implementing all-or-nothing broadcasts without assuming the reliable broadcast functionality assumed here.)

This atomic reliable broadcast assumption takes care of the information asymmetry problem where a physical node A in a super-leaf crashes right after it sends a message to node B but before it could send it to node C. However, I think it is still possible to run into that problem due to slight misalignment in timeouts of the nodes B and C. Let's say A's broadcast delayed significantly ---maybe A was suffering from a long garbage collection stall. Node B times out and move to the next cycle declaring A crashed. Node C receives A's message just before its timeout and adds A's proposals to its consensus cycle. Even with closely synchronized timeouts at B and C, and with reliable broadcast by A, this problem is still bound to occur.

The higher level cycles exploit that the super-leaf level is reliable, so consensus is just implemented by reaching out to one node from each super-leaf to fetch their state. The node fetching the states from other super-leaves then share them with the other physical nodes in this super-leaf. The paper does not discuss this but if the node fetching state dies, another node from the super-leaf should take over to complete the task. While the paper claims that the design of Canopus is simple, I really don't like all these cornercases that creep up.

Canopus is said to operate in loosely-synchronized cycles but the synchronization protocol is not explained well in the paper. So I am unsure about how well they could work in practice. The paper also mentions pipelining of consensus rounds and it is unclear whether there could be other synchronization problems in maintaining these pipelining. The evaluation section does not provide any experiments where faults are present. The code is not available, so it is unclear how much fault-tolerance is implemented.

Local reads

To provide linearizability to read request, Canopus just delays answering the read request for the next consensus round to make sure that all concurrently received update requests are ordered through the consensus process. This allows Canopus to always read from the local replica. While the read delaying achieves linearizability, but it also kills the SLAs and is not very practical/useful.

Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom