Amazon MemoryDB: A Fast and Durable Memory-First Cloud Database

Key-value stores have simple semantics which make it cumbersome to perform complex operations involving multiple keys. Redis addresses this by providing low latency access to complex data structures like lists, sets, and hashes. Unfortunately, Redis lacks strong consistency guarantees and durability in the face of node failures, which limits its applicability beyond caching use cases.

Amazon MemoryDB is a fully managed in-memory database service that leverages Redis's performance strengths while overcoming its durability limitations. It uses Redis as the in-memory data processing engine but offloads persistence to an AWS-internal transaction log service (internally known as the journal). This decoupled architecture provides in-memory performance with microsecond reads and single-digit millisecond writes, while ensuring across availability zone (AZ) durability, 99.99% availability, and strong consistency in the face of failures.

This Sigmod 2024 paper describes MemoryDB's architecture. Unfortunately, it doesn't provide any description of the transaction log/journal service. We will have to wait for another paper to be published on the journal service.


Redis/Valkey

Redis (or now Valkey) supports over 200 commands on 10 data structures like hash tables, sorted sets, and hyperloglogs. It allows combining commands into all-or-nothing transactions, and also supports executing Lua scripts atomically on the server.

For horizontal scaling, Redis splits its key space into 16,384 slots across shards, each with a primary node for writes and optional read-only replicas. Clients discover the slot-to-shard mapping and directly interact with the appropriate shard. Multi-key transactions work as long as all keys belong to the same slot.

Redis uses asynchronous logical replication. Writes are executed and performed on the primary and then are asynchronously replicated to secondaries. Replicas offer a consistent view of past data without any control over the lag from the primary. This approach ensures high availability but does not guarantee durability in the event of primary failures. Redis uses a quorum-based approach for election, however, due to asynchronous replication, a failover will cause a permanent loss of the writes that were not replicated to the new leader at the time of failure.


MemoryDB

MemoryDB decouples durability from the in-memory execution engine by offloading it to the transaction log/journal service. This transaction log provides strongly consistent commits across multiple AZs and is responsible for propagating writes to replicas. This allows offering the full Redis API without invasive modifications, because as far as the Redis replicas are concerned the Redis's original asynchronous replication strategy is being used. However, behind the curtains, the internal AWS transaction log service provides multi-AZ durability by acknowledging writes only after they are durably committed across multiple AZs, providing 11 9s of durability.


The paper does not explain this but I infer from the architecture and paper that, the transaction log is per shard, and is used for connecting the shard primary to the secondaries. Transaction log does not seem to be shared across different shards/primaries.


Availability, Recovery and Resilience

MemoryDB also addresses several challenges with Redis leader election and recovery processes. Unlike Redis, MemoryDB's leader election mechanism ensures only fully caught-up replicas can be promoted as the new primary, maintaining data consistency across failures. This is achieved by leveraging the transaction log's conditional append API. In the transaction log, each log entry is assigned a unique identifier, and it is required that each conditional append request must specify the identifier of the entry it intends to follow as a precondition. Acquiring leadership is done by appending a specific log entry to the transaction log. Mutual exclusion and serialization is satisfied thanks to this conditional append. 

I initially taught, why would they need to require replicas to be fully-caught-up before they can run for leader election? Since the transaction log would have all the acknowledged writes, the replica that has been elected leader can get caught up from the log afterwards. However, after reading their leader election procedure and seeing how simple that is, I appreciated how this requirement keeps the overall recovery simple (cornercase-free) as well: "Ensuring consistent failover becomes simpler leveraging the append API. Only replicas that have observed the latest write’s unique identifier will succeed at appending the leadership entry to the log." If they allowed any replica to be a leader, recovery may still have some edge cases. It might be possible that for that new leader replica, the transaction log service node in its AZ may be out, and this would delay the catching up or would require interference.

A leader once elected is also granted a predetermined lease period. The leader can append entries to the log to extend the lease before it expires. To enforce leader singularity, secondaries are preventing from campaigning during the backoff period after observing a renewal. The lease granularity is at the shard level.

The paper includes this cryptic sentence without elaboration: "This [lease or leadership singularity] optimization improves read throughput and latency, but also improves write performance by reducing the total number of operations that must be handled through consensus [30]." The reasoning for the read throughput is clear because the leaseholding primary can serve linearizable/strong-consistency reads locally. Thanks to the lease, it is guaranteed that no other leader could have been elected and updated the data. But what should we make of "but also improves write performance". Is this due to avoiding a conditional put operation to the transaction log when you are the leader? This is unclear. While the writing in the paper is generally OK, it has this problem of assuming the reader has more context/information about the system then they do. And the lack of any description about the transaction log service makes the problem worse. (Update: see the postscript at the end of the post.)

For efficient recovery, MemoryDB relies on its durable snapshots stored in S3 rather than requiring a primary node. Recovering replicas fetch the latest snapshot and replay the log up to a recorded position to rebuild a recent data view. By making recovery snapshot-dominant and bounding the log replay, MemoryDB optimizes for faster restarts. Its monitoring service continuously tracks cluster data and snapshot freshness to trigger new snapshots when staleness exceeds thresholds.


Evaluation

The performance of MemoryDB and open-source Redis (version 7.0.7) was evaluated across several Amazon EC2 Graviton3 instance types (r7g.large to r7g.16xlarge) using read-only, write-only, and mixed read-write workloads. The workloads were generated using redis-benchmark with 10 EC2 instances sending traffic to the databases, pre-filled with 1 million keys.

For read-only workloads, MemoryDB outperformed Redis across all instance types, leveraging its Enhanced IO multiplexing feature to aggregate multiple client connections into a single connection, which improves processing efficiency.

However, for write-only workloads, as expected Redis outperformed MemoryDB, achieving a maximum throughput close to 300K ops/sec compared to MemoryDB's 185K ops/sec. Since MemoryDB commits every write to the transaction log to ensure multi-AZ durability for every write operation, this results in higher request latency and lower throughput for sequential blocking write requests.  Pushing a single shard in ideal conditions (with higher client counts, pipelining, or larger payloads), they say they can achieve up to 100MB/s write throughput.

In terms of latency, Redis delivered sub-millisecond median latencies and up to 3ms at p99 for write-only workloads, while MemoryDB had 3ms median and up to 6ms p99 latencies due to the transaction log commitment. For mixed read-write workloads, both offered sub-millisecond median latencies, with Redis up to 2ms and MemoryDB up to 4ms at p99.


Validating and Maintaining Consistency at Scale

MemoryDB employs several mechanisms to ensure safety:

  • During upgrades, replicas are upgraded first, and the leader node is upgraded last. This preserves read throughput capacity from replicas while the leader is still running the old version.
  • An upgrade protection mechanism safeguards the replication stream by indicating the engine version that produced it. If a replica with an older engine version observes a replication stream from a newer version, it stops consuming the transaction log to prevent data corruption.
  • MemoryDB maintains a running checksum of the entire transaction log and periodically injects the current checksum into the log itself. Snapshots store the checksum value and a positional identifier for the last log entry they capture, as well as a checksum covering the snapshot data itself. Snapshot correctness is verified by rehearsing the restoration process on an off-box cluster before applying the snapshot to the primary cluster.

As far as protocol and software safety is concerned, the internal transaction log replication protocol is modeled and verified using TLA+ to reason about the system's behavior and catch bugs early in the development process.  MemoryDB uses P as well for new feature development, which proved helpful in catching bugs early. Finally, MemoryDB uses Porcupine, a linearizability checker, to test the consistency of concurrent client command histories, ensuring linearizability guarantees are met.


PS: I missed this earlier, but Brooker's post on MemoryDB (which emphasizes the compositionality aspect of the architecture) has a Bonus material on fencing that goes into detail about how interacting with the journal works. https://brooker.co.za/blog/2024/04/25/memorydb.html

Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

Always Measure One Level Deeper

Dude, where's my Emacs?

There is plenty of room at the bottom

Know Yourself