PNUTS: Yahoo!'s hosted data serving platform

Given that web users are scattered across the globe, it is critical to have data replicas on multiple continents for low-latency access and high-availability. Supporting general transactions is typically unnecessary, since web applications tend to manipulate only one record at a time. Moreover, supporting general transactions (with serializability) over a globally-replicated distributed system is expensive and impractical. On the other hand, it is still useful to have a stronger consistency guarantee than the very weak "eventual consistency". It is often acceptable to read slightly stale data but having replicas use the same order of updates is important because out of order updates may expose undesirable (outside the invariant, unsafe) state at some replicas.

The most important contribution of the PNUTS paper is an asynchronous replication architecture that achieves serializability of writes to each record. This feat is achieved by using record-level masters.

Record-level masters
One way to serialize writes is to forward all the writes to a master replica, and the master replica serializes the writes according to the order it receives them, and the other replicas adopt this ordering determined by the master. The problems with this approach are that 1) writes are subject to WAN latency till the master, and 2) if the master goes down, writes become unavailable until another master takes over.

The truly decentralized approach, which provides low-latency and high-availability for writes, is to let any replica process any write request locally, but that of course means concurrent conflicting updates.

PNUTS proposes a middle ground. Instead of a single master to serialize all the write requests, PNUTS designates per-record masters to serialize the write requests for that record. So, at any time, each replica is serving as a master for a portion of the records in the system, and for each record there is a designated master. The designated master can be adaptively changed to suit the workload, the replica receiving the majority of recent write requests for a particular record becomes the master for that record. So, PNUTS achieve low-latency writes. PNUTS also achieve high-availability because in the event of a master failure, the records which are assigned to other masters are unaffected by this failure.

More clarifications about PNUTS architecture are as follows. PNUTS uses full replication; all the records are fully replicated at all replicas. Read requests can be satisfied by a local replica, while write requests must be satisfied by the master record. Banking on write locality, if a record is written by the same site three consecutive times, the new master for that record is designated to be the replica at that site. The consistency (serializability and timeline-consistent updates) guarantees provided by PNUTS is at the level of single record transactions. PNUTS makes no guarantees as to consistency for multi-record transactions.

Yahoo! Message Broker (YMB)
YMB is a publish subscribe system that is very crucial for the operation of PNUTS. (PNUTS + YMB is known as the Sherpa data services platform.) YMB takes care of asynchronous replication of updates in PNUTS. For asynchronous replication, YMB guarantees "timeline consistency" where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas. (Some replicas may be slow and lag behind other replicas in updates.)

Surprisingly, YMB also handles master handovers and failure/recovery in PNUTS. So, the presentation of PNUTS is very intertwined with the YMB, and unfortunately details of YMB (a propriatery technology of Yahoo!) operation are missing in the paper. YMB is described as this infallible service, but the question is what happens when YMB fails. Neither YMB and PNUTS are open-source as of 2011.

Discussion
I was worried that it would be hard to handle failure/recovery of a replica. Then I realized that replicas are per record-level. Records are always modified in an atomic manner; they are atomic objects. So we can use the last write wins rule (by copying the latest version of the record), and recovery is done. We don't have multiple objects tainted, we don't have to worry about that kind of a messy inconsistency.

The presentation of the paper is unfortunately not the best organization of the material possible. I would have liked it better if the paper covered the core contribution, record-level masters concept, in more detail, including detailed explanation of master handovers and replica failure/recovery operations. Instead in the paper there was a separate thread (and a lot of coverage) on table-scans, scatter-gather operations, bulk loading. I guess some users need such operations, but their motivations were not clear to me. And in any case these are provided just as best-effort operations.

Additional links

Comments

Digitalbit said…
This comment has been removed by a blog administrator.

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Hints for Distributed Systems Design

Foundational distributed systems papers

Learning about distributed systems: where to start?

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The demise of coding is greatly exaggerated

SIGMOD panel: Future of Database System Architectures

Dude, where's my Emacs?

There is plenty of room at the bottom