Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log

This paper appeared in NSDI'20 and is authored by Cong Ding, David Chu, Evan Zhao, Xiang Li, Lorenzo Alvisi, and Robbert van Renesse. The video presentation of the paper is really nice and gives a good overview of the paper. Here is a video presentation of our discussion of the paper, if that is your learning style, or whatever. (If you like to participate in our paper discussions, you can join our Slack channel.)


Background

The problem considered is building a fault-tolerant scalable shared log. One way to do this is to employ a Paxos box for providing order and committing the replication to the corresponding shards. But as the number of clients increase the single Paxos box becomes the bottleneck, and this does not scale. Corfu had the nice idea to divorce ordering and replication. The ordering is done by the Paxos box, i.e., the sequencer, and it assigns unique sequence numbers to the data. Then the replication is offloaded to the clients, which contact the storage servers with the sequence number to commit the data. This technique achieves scalability as the number of clients increase.


A limitation of Corfu is that any change in the set of storage servers makes Corfu unavailable until new configuration has been committed to all storage servers and clients. Corfu requires all the clients and storage servers to have the same mapping function which maps sequence numbers to specific shard. This paper provides a simple (almost trivial) idea for solving this problem and improving over Corfu to maintain globally ordered shared logs with seamless reconfiguration of the log-shards.

Scalog  

Scalog turns the Corfu decoupling strategy on its head with a judo move. By first replicating the record and then assigning sequence number to the record via a batch watermarking strategy, it solves the unavailability problem Corfu faces during reconfiguration. (This also takes care of the problem Corfu faces  with a client who took a sequencing number and crashed without replicating it, leaving a gap in the log.)


In Scalog clients write records directly to storage servers, where they are (trivially) FIFO ordered without the mediation of a global sequencer. Records received by a storage server are then immediately replicated across the other storage servers in the same shard via FIFO channels. Periodically, each storage server reports the lengths of the log segment to an ordering layer. To produce a total order out of these local/shard ordered log-segments, the ordering layer in Scalog summarizes the *fully replicated prefix* of the primary log segment of each storage server in a cut, which it then shares with all storage servers.



The ordering is done at the Paxos box in the ordering layer by releasing version vector like watermarks across sharded logs based on the fully-replicated log-segment progress heard from each log. The ordering layer interleaves not only records but also other reconfiguration events. As a result, all storage servers see the same update events in the same order. The storage servers use these cuts to deterministically assign a unique global sequence number to each durable record in their log segments using the deterministic ordering inside each.

Remark: Kafka also provides horizontal scalability of the logs via sharding. However, thanks to the version vector watermarking, Scalog can impose a global order on the logs which is very useful for many applications such as those that need to do multi-object transactions across shards (as in Scalog-Store application shown in the paper). The global order is also very useful for debugging problems across shards/subsystem, which many deployed systems run into.

As we have seen, this batch-based total order imposition provides both seamless reconfiguration and scalability to Scalog. These two figures explain the Scalog architecture and operation very nicely. The aggregators serve as soft-state buffer to batch communication in front of the Paxos box to alleviate the communication the box needs to endure.


In a storage shard, f+1 storage servers are sufficient to tolerate upto f crash failures. Due to its loosely decoupled coordination which can handle any number of shards you throw at it, Scalog also takes an interesting approach to fault-tolerance. It uses two servers in a shard to tolerate the crash of one server, and instead of trying to resuscitate the crashed server (which may take a long time), it prescribes the client to finalize this shard, and start a new shard to continue operation. After all, adding a new shard is frictionless and no-effort in Scalog.

Evaluation

Scalog is evaluated extensively and is shown to give good results. To evaluate Scalog at scale, the authors use a combination of real experiments and emulation. They use a 10 Gbps infrastructure and SSDs, and consider 4KB records. With 17 shards, each with two storage servers, each processing 15K writes/sec, they show that Scalog achieves a total throughput of 255K totally ordered writes/sec. Further, through emulation, they demonstrate that, at a latency of about 1.6 ms, Scalog can handle about 3,500 shards, or about 52M writes/sec. This means that Scalog can deliver throughput almost two orders of magnitude higher than Corfu’s, with comparable latency. They also provide experiments to show how reconfigurations impact Scalog, and how well Scalog handles failures.




Discussion

Embrace small-coordination and punt things to the client

The main contributions in Scalog are that:
  1. it allows applications to customize data placement 
  2. it supports reconfiguration with no loss in availability
  3. it recovers quickly from failures
The first two features are mostly client's responsibility. The thing about Scalog is that it provides small-coordination (as in small government), and it gets out of the way of the client, so the client can customize its data placement and add new shards without leading to a slowdown or loss in availability in the system. The third feature, recovering quickly from failures, is also an artifact of the small-coordination provided by Scalog.

How does Scalog compare with consensus protocols?

Scalog is not a consensus protocol. It assumes Paxos/consensus box to provide the strong-consistency/linearizability to the face of fault-tolerance. From the Paxos side, the closest idea to Scalog is SDPaxos. SDPaxos also separates ordering from replication, and replication is done first and ordering is done over the already replicated requests. On the other hand, SDPaxos does not provide horizontal scaling as in Scalog. There must be a separate SDPaxos overseeing each shard, and even then we will need a similar watermarking trick as in Scalog to impose a posteriori global ordering on the logs.

Scalog's ordering is a posteriori ordering. There is no immediate read operation supported in Scalog. Reads are supported either through subscribe operation (which provides streaming batch reads) or via readRecord(l,s) operation, where l is the sequence number and s is the shard. In either case, it seems like the reads are reads from the past and not real-time reads. I wonder if there is a loss of functionality because of this. For example does this cause problems for log-based global RSM maintenance. The paper cites analytics applications for use of Scalog, and it is not very clear if we could have responsive command servers maintained using the RSM approach over Scalog shared logs.

How does Scalog compare with AWS Physalia?

AWS Physalia presented large-scale sharded configuration boxes which oversee sharded chain replication systems. I think Physalia is a better engineered system since it takes into account partitioning problems, and can reconfigure the coordination cells to relocate closer to the storage servers to handle some partitioning problems.

The batch-style asynchronous acknowledgements in Scalog is bad for usability. They make it hard for the clients to determine when there is a problem and where there is a problem. The clients can not differentiate between a problem in the shard and in the ordering layer. Are the two servers in the shard partitioned from each other? Or is the entire shard partitioned from the ordering layer. Is the ordering layer functioning OK and having delays or is it down or unreachable. These limit the actions the clients can take to respond and adopt to problems.

Of course Physalia does not provide an ordering across shards, but maybe an ordering layer can also be added similar to the watermarking ideas presented in Scalog.

Comments

Anonymous said…
You may want to read this paper

Biely, M., Milosevic, Z., Santos, N., & Schiper, A. (2012, October). S-Paxos: Offloading the leader for high throughput state machine replication. In 2012 IEEE 31st Symposium on Reliable Distributed Systems (pp. 111-120). IEEE.

It is the same as SDPaxos, but presented in a different way.

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Designing Data Intensive Applications (DDIA) Book