Posts

Showing posts from April, 2020

Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log

Image
This paper appeared in NSDI'20 and is authored by Cong Ding, David Chu, Evan Zhao, Xiang Li, Lorenzo Alvisi, and Robbert van Renesse. The video presentation of the paper is really nice and gives a good overview of the paper. Here is a video presentation of our discussion of the paper, if that is your learning style, or whatever. (If you like to participate in our paper discussions, you can join our Slack channel.)


Background The problem considered is building a fault-tolerant scalable shared log. One way to do this is to employ a Paxos box for providing order and committing the replication to the corresponding shards. But as the number of clients increase the single Paxos box becomes the bottleneck, and this does not scale. Corfu had the nice idea to divorce ordering and replication. The ordering is done by the Paxos box, i.e., the sequencer, and it assigns unique sequence numbers to the data. Then the replication is offloaded to the clients, which contact the storage servers with…

Elle: Inferring Isolation Anomalies from Experimental Observations

Image
This paper is by Kyle Kingsbury (of Jepsen fame) and Peter Alvaro (of beach wandering fame) and is available on Arxiv.

Adya et.al. (2000) showed that transaction isolation anomalies can be defined in terms of cycles over a Direct Serialization Graph (DSG) that captures the dependencies between transactions. Unfortunately, it was hard to utilize this DGS technique for isolation anomaly checking in practice because many database systems do not have any concept of a version order, or they do not expose that ordering information to clients. This paper shows that it is possible to use an encoding trick on the client side to emulate/maintain that ordering information and ensure that the results of database reads reveal information about their version history. The solution they find is the list data structure, which is supported by many databases. The paper also shows that lighter weight data structures, such as sets, can also be useful for checking violations of weaker isolation properties.

Fine-Grained Replicated State Machines for a Cluster Storage System

Image
This paper appeared in NSDI 2020 and was authored by Ming Liu and Arvind Krishnamurthy, University of Washington; Harsha V. Madhyastha, University of Michigan; Rishi Bhardwaj, Karan Gupta, Chinmay Kamat, Huapeng Yuan, Aditya Jaltade, Roger Liao, Pavan Konka, and Anoop Jawahar, Nutanix.

The paper presents the design and implementation of a consistent and fault-tolerant metadata index for a scalable block storage system via distributed key-value abstraction. The key idea is to use fine-grained replicated state machines (fRSM), where every key-value pair in the index is treated as a separate RSM to reduce tail-latency in key-value access and provide robustness to key access skews.

Motivation The problem arised from Nutanix's business in building private clouds for enterprises to enable them to instantiate VMs that run legacy applications. A cluster management software determines which node to run each VM on, migrating them as necessary. And Stargate provides a virtual disk abstractio…

DistSys Reading Group second meeting: Wormspace

Image
We had our second Zoom DistSys Reading Group on Wednesday. The meeting is open to all who are working on distributed systems. Join our Slack channel for paper discussion and meeting links (password protected).

I had summarized Wormspace paper before the meeting. It is a great paper. This week I was the presenter, and we started the meeting with my presentation for 30 minutes. Here is a link to my slides.

In the presentation I made sure to emphasize the benefit provided by WormSpace. It is an abstraction that enable developers to use distributed consensus as a building block for applications. Developers don't need to understand how distributed consensus via Paxos works. The API hides the details and complexity of Paxos under a data-centric API: capture, write, and read. The API is at a low enough level to enable efficient designs on top (as demonstrated for WormTX) without the need to open the Paxos box.  Bunching WORs in WOS was also a very useful decision for improving the program…

What Pasha taught me about life

Image
Pasha joined our family when he was barely 2 months old. Shortly after that the Covid-19 quarantine started, and we have spent our lives 24/7 with Pasha.
Pasha adopted our family on Thursday night.
We follow his command now. pic.twitter.com/JnFrPC50ug — Murat Demirbas (@muratdemirbas) March 10, 2020 We are all big fans of Pasha, but I particularly admire him as I am in awe of his approach to life. I fired my life-coach and decided to follow Pasha's teachings instead.

Here are the things I learned from Pasha.

Play hard, rest easy Pasha has a lot of energy in the mornings. He bounces off the walls, climbs to our curtains, and playfully harasses our wrists and ankles. If we try to snuggle with him in the morning, he runs away, to continue his parkour route around the house.

At 11am, he crashes. He sleeps where he deems fit. If it is sunny, he finds a sunny spot in front of a window. But he also doesn't mind sleeping on the carpet in a busy room, oblivious of the feet traffic in …

WormSpace: A modular foundation for simple, verifiable distributed systems

Image
This paper is by Ji-Yong Shin, Jieung Kim, Wolf Honore, Hernán Vanzetto, Srihari Radhakrishnan, Mahesh Balakrishnan, Zhong Shao, and it appeared at SOCC'19. 

The paper introduces the Write-Once Register (WOR) abstraction, and argues that the WOR should be a first-class system-building abstraction. By providing single-shot consensus via a simple data-centric API, the WOR acts as a building block for providing distributed systems durability, concurrency control, and failure atomicity.

Each WOR is implemented via a Paxos instance, and leveraging this, WormSpace (Write-Once-Read-Many Address Space) organizes the address space into contiguous write-once segments (WOSes) and provides developers with a shared address space of durable, highly available, and strongly consistent WORs to build on. For example, a sequence of WORs can be used to impose a total order, and a set of WORs can keep decisions taken by participants in distributed transaction protocols such as 2PC.


To demonstrate its …

Gryff: Unifying consensus and shared registers

Image
This paper is by Matthew Burke, Audrey Cheng, and Wyatt Lloyd, and appeared in NSDI'20. Here is a link to the paper, slides, and video presentation.

Straight talk (from the middle of the book)The model of the paper is a great contribution. Stable versus Unstable ordering is a good framework to think in. Carstamps (consensus after registers) logical clock timestamping is a good way to realize this ordering. I think carstamps will see good adoption, as it is clear, concrete, and useful.Constructing a hybrid of EPaxos and ABD is a novel idea.The performance of Gryff is not good. A straightforward distributed key-value sharded implementation of Paxos would do a better job. I think Hermes is a better choice than Gryff with read-write and read-modify-write operations.
Introduction Recently we see a lot of interest in unifying consensus and shared registers, the topic of the paper. I think this is because of the popularity of distributed key-value stores/systems. While consensus is often…

Our first Zoom DistSys reading group meeting

Image
We did our first Zoom DistSys reading group meeting on April 1st, Wednesday 15:30 EST. We discussed the Gryff paper.

I didn't have much Zoom  experience, and this was very experimental reaching out to the world at large to run a reading group with whomever is interested.

20 people attended.  As I was introducing the format, one person starting writing chat messages, saying "this is so boring", etc.  He had connected with a phone, and the video was showing him walking probably in a market. This should have been a red flag. The meeting participants asked me to remove him, because he was pinging them and bothering them as well. That was our troll.

I had taken measures to stop zoom-bombing, since I had heard this was an issue.
Only the hosts and cohosts could share screen. I made two co-hosts to help with moderation. I had selected the option to disallow joining after removal. I removed the troll, and there was no incidents after that.

The meeting took 90 minutes. The present…

Popular posts from this blog

I have seen things

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Learning about distributed systems: where to start?

My Distributed Systems Seminar's reading list for Fall 2020

Fine-Grained Replicated State Machines for a Cluster Storage System

My Distributed Systems Seminar's reading list for Spring 2020

Cross-chain Deals and Adversarial Commerce

Book review. Tiny Habits (2020)

Zoom Distributed Systems Reading Group