Tuesday, October 19, 2010

Mencius: building efficient replicated state machines for Wide-Area-Networks (WANs)

I will write my short summary of the Mencius paper from OSDI08 here. Writing this helped me understand a couple subtle points better, so I hope it will also help others.

Replicated state machines is a common approach for achieving fault-tolerance. State machine replication works by replicating the state and function across multiple servers (replicas) and using consensus to agree on the sequence of the commands the replicas execute. Assuming deterministic service, since all replicas execute the same sequence of commands, they all keep the same state, and strong consistency guarantees are provided to the face of server failures.

Paxos is a fault-tolerant implementation of replicated state machines. Paxos has become hugely popular because it is the only formally proven protocol that works in the face of asynchronous model. Paxos preserves safety (no incorrect decisions are taken by replicas) to the face of violations of timing assumptions on the channels and servers. Paxos satisfies progress (a decision is taken by replicas) when timing assumptions are satisfied, channels deliver messages, and majority of replicas are alive. I will not attempt to explain Paxos here, as that would take at least another post. Although the protocol is simple, significant time and effort is needed to internalize how and why Paxos works.

WANs are becoming a focus area in cloud computing, as services and data need to be replicated among multiple datacenters in different regions of the country and the world. Paxos is unsuitable for WANs due to the single leader requirement. In Paxos during normal operation only one server act as the leader: All client requests should be forwarded to that leader, and that leader then clears this with the other replicas (via a one-to-all followed by all-to-one traffic).

This single entry-point requirement leads to three bottlenecks. First is the bandwidth bottleneck due to the unbalanced communication pattern (in one-to-all and all-to-one, only the links adjacent to the leader are used). Second is the computational bottleneck at leader, as the leader needs to process more messages. The paper calls a deployment/load that reaches the bandwidth bottleneck first as network-bound and that reaches the CPU bottleneck first as CPU-bound. (Somehow I doubt whether CPU-bound load/deployment is possible in practice.) Finally, the third problem is the extra latency imposed for clients that are far away from the leader. These clients need to send requests to all the way to a far away server (even though a nearby replica may exist) and wait for the reply from that server.

Mencius is a multileader version of Paxos and tries to eliminate the single entry-point requirement in Paxos. Mencius achieves load balancing by partitioning consensus sequence numbers (consensus requests/instances) among multiple servers. e.g., if we have 3 servers, server 0 is responsible for acting as a leader for consensus instances numbered 0,3,6..., server 1 for 1,4,7..., and server 2 for 2,5,8...

This load balancing helps distribute the network bandwidth and CPU load better. In contrast to Paxos where only the channels adjacent to the leader server gets utilized, now since all servers may get to act as leader all channels get utilized. Network-bound limitation is alleviated this way. Similarly, CPU-bound limitation is alleviated as all servers pitch in for the leader role. Finally, the clients can send their requests to a nearby server to avoid the extra latency for sending the requests to a faraway server. (The paper assumes that each site has a server and a group of clients, so the clients know the nearest server to themselves is the server in the same site.)

Mencius adapts to client and network load variance by using simple consensus. In simple consensus only one special server per consensus instance (let's call this coordinator) can propose any command (including a no-op), the others can only propose no-op for that instance. A benefit of using simple consensus is that servers can learn a skipped no-op without having to have a majority of servers to agree on it first. This enables, servers with low client load to ckip their turns without having to have a majority of the servers agree on it first. The servers can piggyback SKIP messages to other messages, e.g., ack messages to other servers.

By using multiple leaders, however, Mencius loses out from the "serve reads locally at the leader" optimization possible in Paxos. Since there is only one leader in Paxos, that leader may use leases with other replicas. While the lease holds the replicas promise they will not choose a new leader, so the leader is guaranteed that it has the latest information and can serve reads locally. In Mencius, since there are several leaders, this leasing cannot be based on leader, so this serve-reads locally optimization is not applicable. It may still be possible to use a more complicated lease based on partitioning of the data and serve reads locally.

Of course, Mencius is not addressing the core question of can you slash down latency by serving read and even the write requests locally? This is a hard question. There are inherent tradeoffs between consistency and availability and low-latency that would rule out Paxos completely. So, I need to touch the CAP theorem a bit to explain this.

In his blog, Dan Abadi gives the most lucid explanation of the CAP theorem I have ever read. Dan proposes that the acronym is not CAP, it should actually be PACELC. If there is currently a partition, the tradeoff is between the availability and consistency (PAC corresponds to this part). Else, if no partition, then the tradeoff is between low-latency and consistency (ELC corresponds to this part).

Paxos is, of course, PCEC since it always chooses consistency, and hence the checking required with other replicas. A truly low-latency protocol should probably be PAEL or PCEL, but then it should also provide some consistency in the E part as well to be useful. Yahoo's PNUTS system, which is PCEL, attempts to answer this question (putting some consistency in the E part as well) by partitioning the data with respect to the owner of the data and providing “timeline consistency” where replicas may not be consistent with each other but updates are applied in the same order at all replicas via asynchronous replication. For example for a social network application, the server in Buffalo is assigned as the leader for Murat's status page. So Murat can access and update his data with little latency in Buffalo, and this data is gradually replicated to other sites in the world. But, PNUTS is just one point in the design space, there should be more choices to explore.

By the way, Yahoo is opening a green datacenter in Buffalo, check it out.

3 comments:

Daniel Abadi said...

Hi Murat,

Nice post, good summary, and thanks for the kind words.

Ivan said...

Very interesting post, I didn't know about a multiple coordinators implementation of Paxos and this is very interesting.

However, I think that there is at least one other consensus algorithm that exists from Chandra and Toueg which is known as the rotating coordinator algorithm and which also resolves the consensus problem in an asynchronous world. Nevertheless I'am convinced that Paxos performs well due to the small number of messages to achieve a consensus in steady-state, i.e., 0(2n) messages. With the chandra-toueg consensus i think it's O(3n) plus an additional cost for reliable broadcast for the last (decide) message.
Here is a link to the algorithm original paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.113.498

Anyway, I really appreciate your digest!! (every new extension of Paxos necessitates a digest like the original Paxos paper :D )

Murat said...

Ivan,

The rotating coordinator algorithm is not a multileader (multiple entry points) approach to consensus. The rotating coordinator algorithm is used for solving only one instance of consensus. If sufficient number of replicas do not think the coordinator is alive, the next replica acts as the coordinator and tries to get the system to decide on that particular instance.

So in one sense, the rotating coordinator algorithm embeds an eventually correct leader election service that Paxos relies on in the consensus service implementation.

Actually, 8 years before the rotating coordinators algorithm a similar rotating coordinators algorithm has been presented in "Consensus in the presence of partial synchrony" by Dwork, Lynch, Stockmeyer. I think that 1996 algorithm provided pretty much the same guarantees as the Paxos algorithm. But as you mentioned Paxos is more efficient and simpler.

Thank you for the kind words. Good to know these posts are providing a service.