Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB

This paper (SIGMOD 2019) discusses the design of causal consistency and cluster-wide logical clock for MongoDB. 

Causal consistency preserves Lamport's happened-before (transitive and partial) order of events in a distributed system. If an event A causes another event B, causal consistency ensures that every process in the system observes A before observing B. Causal consistency guarantees read-your-writes, write-follows-reads, monotonic reads, and monotonic writes. Causal consistency is the strictest consistency level where the system can still make progress (always-available one-way convergence) in the presence of partitions as discussed in the "Consistency, Availability, and Convergence" paper. This makes causal-consistency a very attractive model for modern distributed applications.

The paper provides a simple and balanced design for causal-consistency in MongoDB. The ideas are simple, yet the implementation and operationalizing (backwards compatibility, security assurance, operator-mistake tolerance, efficiency, simplicity, performance) make this an interesting read. This paper describes a large scale, production-grade implementation of causal consistency in a multisharded, replicated, distributed database using hybrid logical clocks (HLCs) and gossiping, adding the signing of logical time ranges, and introducing performance optimizations necessary for systems at scale.

Design decisions

Type of clock

For the type of clock, the paper considers and discards Lamport's logical clock (due to lack of physical clock affinity), vector clock (due to O(N) space used in messages), and wall clocks (due to difficulty/unavailability of precise synchronization).

Hybrid Logical Clocks (HLCs) are chosen because they avoided all of those pitfalls. The paper mentions that, even prior to causal consistency, MongoDB used ordering similar to hybrid logical time for maintaining order in the operation log used for replication of a single shard, and that they already had some of the foundation needed to synchronize the events across the cluster. So this approach was best aligned with the product requirements.

Dependency tracking

The paper considers and discards full dependency tracking (due to the performance overhead caused by the need to attach and process the dependency graph on all events), bolt-on layer tracking (due to its complexity, overhead, and the addition of another failure point), and dependency tracking on the client/driver (due to its requirement of explicit application support across all client).

They take the explicit dependency tracking approach as it can be presented in a form of causality segments where all data that a customer reads happened-after the latest event observed by the client. This approach leads to a simple implementation when using HLCs (or logical time really). See the algorithm section below.

Clock synchronization

The paper considers and discards the use of heartbeats in favor of forced advance of Stable Cluster Time (SCT) --the logical time persisted in the oplog. To advance SCT MongoDB performs a no-op write as this is the only way to increase Stable Cluster Time in a strictly increasing sequence. Advancing SCT allows providing low latency for all types of workloads limited only by the node's write throughput.

The algorithm

The logical clock of HLC is incremented only when there are new operation log entries, representing system state change, rather than for any arbitrary event. Sending and receiving messages are not state-updating events and don't tick the clock. The ClusterTime is incremented only when there is a write to a primary node's replication operation log (oplog). Cluster nodes (mongod, mongos, config server, clients) always track and include the greatest known ClusterTime when sending a message.

Each message from each data node in the causally consistent session (i.e. a thread of execution) receives the operationTime - the last known logical time persisted in the oplog on that node. Follow-up read or write operations originated in this session attach the highest known operationTime to the request metadata. The data node that receives the request waits for its operation log to catch up to the requested operationTime.

And that's all folks.

See the below pseudocode for realizing this idea. Note that ClusterTime is represented by a <Time><Increment> pair: where <Time> is a 32 bit count of seconds since the Unix epoch (physical time) and <Increment> is a 32 bit integer that allows us to distinguish writes that occurred within the same second. Notice how the wall-clock factors into ClusterTime in the following algorithm.

MongoDB enables causal consistency in client sessions. Each session where causal consistency is enabled tracks signed ClusterTime. MongoDB provides an API to pass signed ClusterTime between sessions to enable clients to extend causally consistent chain of operations across multiple sessions, or even clients. This also allows making causally consistent reads from secondary nodes that could be useful in some scenarios including geo-replicated low-latency reads.

Protecting against malicious attacks

As discussed above nodes advance their logical clocks to the maximum ClusterTime that they receive in the client messages, but a malicious client could modify their maximum ClusterTime sent in a message. For example, a malicious client could send the max clock value, which once written to the oplogs of replica set nodes, will not be incrementable and the nodes will be unable to accept any further writes against the database. The only way to recover from this situation would be to unload the data, clean it, and reload back with the correct OpTime. This malicious attack would take the affected shard offline, affecting the availability of the entire system. To mitigate this risk, MongoDB added a HMAC-SHA1 signature that is used to verify the value of the ClusterTime on the server. ClusterTime values can be read by any node, but only MongoDB processes can sign new values. The signature cannot be generated by clients.

Every time the mongod or mongos (MongoDB server data node or query router) receives a message that includes a ClusterTime that is greater than the value of its logical clock, they will validate it by generating the signature using the key with the keyId from the message. If the signature does not match, the message will be rejected.

As a performance optimization signing is done as a range of time, so not every increment needs to be verified. As the cluster time grows linearly, it is possible to mask the least significant bits in the representation to ‘1’, sign it, and cache the signature. Chances are, the next message will have the ClusterTime value incremented by 1, resulting in the same masked value. Then, mongod can reuse the cached signature.

Related reading

Implementation of the cluster-wide logical clock provides the basis for many features that require cluster-wide data ordering such as change streams resumability and multi-document cross-cluster ACID transactions which require building cluster-wide snapshots. 

We had recently reviewed "Checking Causal Consistency of MongoDB", and it is a good read to get an in depth understanding of causal consistency. 

Finally, the MongoDB documentation page on causal-consistency is useful to see the guarantees you can get under various configurations of write-concern/read-concern parameters. In short, in the presence of node failures or network partitions, causally consistent sessions guarantee causal consistency only for reads with majority readConcern and writes with majority writeConcern. In the absence of node failures or network partitions causal-consistency is guaranteed by all configurations including read concern local and write concern w1.


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom