Building a Database on S3

Hold your horses, though. I'm not unveiling a new S3-native database. This paper is from 2008. Many of its protocols feel clunky today. Yet it nails the core idea that defines modern cloud-native databases: separate storage from compute. The authors propose a shared-disk design over Amazon S3, with stateless clients executing transactions. The paper provides a blueprint for serverless before the term existed.


SQS as WAL and S3 as Pagestore

The 2008 S3 was painfully slow, and 100 ms reads weren't unusual. To hide that latency, the database separates "commit" from "apply". Clients write small, idempotent redo logs to Amazon Simple Queue Service (SQS) instead of touching S3 directly. An asynchronous checkpoint by a client applies those logs to B-tree pages on S3 later.

This design shows strong parallels to modern disaggregated architectures. SQS becomes the write-ahead log (WAL) and logstore. S3 becomes the pagestore. Modern Aurora follows a similar logic: the log is replicated, and storage materializes pages independently. Ok, in Aurora the primary write acknowledgment is synchronous after storage quorum replication, and of course Aurora does not rely on clients to pull logs manually and apply like this 2008 system, but what I am trying to say is the philosophy is identical.


Surviving SQS and building B-link Trees on S3

As mentioned above, to bypass the severe latency of writing full data pages directly to S3, clients commit transactions by shipping small redo log records to SQS queues. Subsequently, clients act as checkpointers, asynchronously pulling these queued logs and applying the updates to their local copies before writing the newly materialized B-tree pages back to S3. This asynchronous log-shipping model means B-tree pages on S3 can be arbitrarily out-of-date compared to the real-time logs in SQS. Working on such stale state seems impossible, but the authors bound the staleness: writers (and probabilistically readers) run asynchronous checkpoints that pull batches of logs from SQS and apply them to S3, keeping the database consistent despite delays.

SQS, however, throws a wrench in the works. I was initially very surprised by the paper’s description of SQS (the 2008 version). It said that a queue might hold 200 messages, but a client requesting 100 could randomly receive only 20. This is because, to provide low latency, SQS does a best-effort poll of a subset of its distributed servers and immediately returns whatever it finds. But don’t worry, the other messages aren’t lost. They sit on servers not checked in that round. But the price of this low-latency is that FIFO ordering isn’t guaranteed. The database handles this mess by making log records idempotent, and ensures that out-of-order or duplicate processing never corrupts data.

The commit protocol in the paper actually starts simple: clients send log records straight to Pending Update (PU) queues. But the problem with this naive direct-write approach is that if the client crashes mid-commit, only some records might make it to the queue, and this breaks atomicity. To fix this issue, the paper proposes an Atomicity protocol: clients first dump all logs plus a final “commit” token into a private ATOMIC queue, then push everything to the public PU queues. This guarantees all-or-nothing transactions, but it’s pricey, since every extra SQS message adds up. At $2.90 per 1,000 transactions, it’s almost twenty times the $0.15 of the naive direct-write approach. So here, consistency comes at a literal monetary cost!

The big picture here is about how brutally complex it is to build a real database on dumb cloud primitives. They had to implement Record Managers, Page Managers, and buffer pools entirely on the client side, in order to cluster tiny records into pages. For distributed coordination, they hack SQS into a locking system with dedicated LOCK queues and carefully timed tokens. On top of that, they have to handle SQS quirks, with idempotent log records as we discussed above. The engineering effort is massive.

Finally, to address the slow and weakly consistent S3 reads, the database leans on lock-free B-link trees. That lets readers keep moving while background checkpoints/updates by clients split or reorganize index pages. In B-link trees, each node points to its right sibling. If a checkpoint splits a page, readers just follow the pointer without blocking. Since update corruption is still a risk, a LOCK queue token ensures only one thread checkpoints a specific PU queue at a time. (I told you this is complicated.) The paper admits this is a serious bottleneck: hot-spot objects updated thousands of times per second simply can’t scale under this design.


Isolation guarantees

In order to prioritize extreme availability, the system throws traditional isolation guarantees out the window. The paper says ANSI SQL-style isolation and strict consistency cannot survive at scale in this architecture. The atomicity protocol prevents dirty reads by ensuring only fully committed logs leave a client’s private queue, but commit-time read-write and write-write conflicts are ignored entirely! If two clients hit the same record, the last-writer wins. So lost updates are common. To make this usable, the authors push consistency up to the client. For ensuring monotonic reads, each client tracks the highest commit timestamp it has seen, and if it sees any older version from S3 it rejects it and rereads. For monotonic writes, the client stamps version counters on log records and page headers. Checkpoints sort logs and defer any out-of-order SQS messages so each client’s writes stay in order.

I was also surprised by the discussion of stronger isolation in the paper. The paper claims snapshot isolation hasn’t been implemented in distributed systems yet, because it strictly requires a centralized global counter to serialize transactions. This is flagged as a fatal bottleneck and single point of failure.

Looking back, we find this claim outdated. Global counters aren’t a bottleneck for Snapshot Isolation: Amazon Aurora stamps transactions with a Global Log Sequence Number (GLSN) via a primary writer, but still scales (vertically) cleanly without slowing disaggregated storage. More importantly, modern distributed database systems implement snapshot isolation, using loosely synchronized physical clocks (and hybrid logical clocks) to give global ordering with no centralized counter at all. Thank God for synchronized clocks!


Conclusion

While the paper had to work around the messy 2008 cloud environment, it remains impressive for showing how to build a serverless database architecture on dumb object storage. In recent years, S3 has become faster, and in 2020 it gained strong read-after-write consistency for all PUTs and DELETEs. This made it much easier to build databases (especially for analytical workloads) over S3 directly, and this led to the modern data lake and lakehouse paradigms. We can say this paper laid some groundwork for systems like Databricks (Delta Lake), Apache Iceberg, and Snowflake.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

The F word

The Agentic Self: Parallels Between AI and Self-Improvement

Learning about distributed systems: where to start?

Foundational distributed systems papers

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Advice to the young

Agentic AI and The Mythical Agent-Month

Are We Becoming Architects or Butlers to LLMs?

Welcome to Town Al-Gasr