Posts

Showing posts from October, 2014

Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks

Image
This paper appeared in SRDS 2013, and is concerned with the snapshot isolation problem for distributed databases/data stores.

What is snapshot isolation (SI)? (I took these definitions almost verbatim from the paper.)
SI is a multiversion concurrency control scheme with 3 properties:
1) Each transaction reads from a consistent snapshot, taken at the start of the transaction and identified by a snapshot timestamp. A snapshot is consistent if it includes all writes of transactions committed before the snapshot timestamp, and if it does not include any writes of aborted transactions or transactions committed after the snapshot timestamp.
2) Update transactions commit in a total order. Every commit produces a new database snapshot, identified by the commit timestamp.
3) An update transaction aborts if it introduces a write-write conflict with a concurrent committed transaction. Transaction T1 is concurrent with committed update transaction T2, if T1 took its snapshot before T2 committed a…

Facebook's software architecture

Image
I had summarized/discussed a couple papers (Haystack, Memcache caching) about Facebook's architecture before.

Facebook uses simple architecture that gets things done. Papers from Facebook are refreshingly simple, and I like reading these papers.

Two more Facebook papers appeared recently, and I briefly summarize them below.

TAO: Facebook's distributed data store for the social graph (ATC'13) A single Facebook page may aggregate and filter 100s of items from the social graph. Since Facebook presents each user with customized content (which needs to be filtered with privacy checks) an efficient, highly available, and scalable graph data store is needed to serve this dynamic read-heavy workload.

Before Tao, Facebook's web servers directly accessed MySql to read or write the social graph, aggressively using memcache as a look aside cache (as it was explained in this paper).

The Tao data store implements a graph abstraction directly. This allows Tao to avoid some of the fund…

Paper Summary: A self-configurable geo-replicated cloud storage system

Image
This paper is a followup work to the Pileus work, which I had covered here. Pileus aimed to help developers find a suitable consistency/latency combination for their application deployment. In Pileus, the configuration of primary and secondary nodes is assumed to be fixed (some storage nodes are designated as primary nodes, which hold the master data, while others are secondary nodes). The developer uses an SLA to state ranked preferences for latency and consistency of the reads that would make most sense for the application, and using this SLA, Pileus provides dynamic tuning of the performance of the application by deciding which read to forward to which replica/master.

This paper introduces a followup system, called Tuba, where the configuration is not fixed, and can be changed on the fly. Tuba extends Pileus to address the problem of finding an optimal configuration of primary and secondary replicas that maximizes the overall utility and minimizes the cost for the application.

Tuba…

Consistent snapshot analogies

Image
Last week I taught distributed snapshot in my CSE 586: Distributed Systems class. While I teach snapshot, I invariably find myself longing for analogies to provide some intuition about this concept. The global state captured by a distributed snapshot (say using Lamport/Chandy marker algorithm) does not correspond to the "state of the system at initiation of the snapshot". Furthermore, it also may not correspond to a "state of the system from initiation to current state during this computation". This is because while the snapshot taking is progressing in the system, the underlying system computation is also proceeding and changing the state of the system progressively. (Distributed snapshot is not allowed to stop/freeze underlying system computation as that reduces availability.)

For those curious about the question, "what good is a snapshot then?": The snapshot captures a reachable state from initiation state, and from the snapshot state the current state…

Popular posts from this blog

I have seen things

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Frugal computing

Learning about distributed systems: where to start?

Fine-Grained Replicated State Machines for a Cluster Storage System

My Distributed Systems Seminar's reading list for Spring 2020

Cross-chain Deals and Adversarial Commerce

Book review. Tiny Habits (2020)

Zoom Distributed Systems Reading Group