SRDS Day 2

- October 04, 2024

Ok, continuing on the SRDS day 1 post, I bring you SRDS day 2.

Here are my summaries from the keynote, and from the talks for which I took some notes.

Mahesh Balakrishnan's Keynote

Mahesh's keynote was titled "An Hourglass Architecture for Distributed Systems". His talk focused on the evolution of his perspective on distributed systems research and the importance of abstraction in managing complexity. He began by reflecting on how in his PhD days, he believed the goal was to develop better protocols with nice properties. He said, he later realized that the true challenge in distributed systems research lies in creating new abstractions that simplify these complex systems. Complexity creeps into distributed systems through failures, asynchrony, and change. Mahesh also confessed that he didn't realize the extent to the importance of managing change until his days in industry.

While other fields in computer science have successfully built robust abstractions (such as layered protocols in networking, and block storage, process, address space abstractions in operating systems), distributed systems have lagged behind in this aspect. We did have foundational papers on state machine replication, group communication, and shared registers when the field was in its theory phase, but when the field moved in to practical deployment we somehow ended up with lots and lots of noncomposable protocols. Even in a single company you would have parallel silos and teams working on similar but noncompatible stacks, for example, mysql over raft, zookeeper over zab, scribe over logdevice, and zippydb over multipaxos.

We need a narrow waist for distributed systems stack to unite, reconciliate, and manage these different silos. Mahesh proposes the concept of a "shared log abstraction" as a solution. The shared log abstraction aims to disaggregate consensus from the rest of the system, providing a narrow waist for distributed systems similar to how block device APIs work for operating systems. The insight is to realized that log is but an address-space and counter, and you can RAID the address-space, and can use a passive key-value store for storage of the log. Mahesh started on this idea with the Corfu project in 2012, and continued through various iterations and implementations in the last decade and half.

Mahesh then talked about his production experience at Facebook starting from 2017, where he and his team developed Delos, a production-ready storage system built on this shared log concept. He described the challenges they faced and the solutions they developed.

They first built a proof of concept, delosv1, that uses ZooKeeper as the shared log and RocksDB as storage. This was a hacky prototype more of an exercise in "how to build a production-ready storage system in 8 months". The second step made this project more intresting: virtualizing consensus via the virtuallog, and achieving RAIDing the log. The key to doing that was the metastore, which maintained a mapping of which sections of the address space in the log is mapped to where. This enabled upgrading the system on the fly, and Mahesh decided to stay back at Facebook extending on his sabbatical working on this. Mahesh had wrote about this Delos work before and I had provided a summary of this here.

Mahesh then discusses the evolution of this work highlighting the importance of addressing the "code velocity problem" - the challenge of making changes to production systems safely and quickly. This is a big problem because it keeps hundreds of engineers busy just managing on-the-fly system upgrades and migrations, keeping systems aligned/upgrading/patching in flight.

This led to the development of a stacking state-machines abstraction, allowing for easier updates and modifications to the system. The idea is to have app layer on top of engine on top of base, and like tcp/ip headers each one puts a header-state. Mahesh wrote about this in SOSP 21, which I summarized here. He said that the layered abstraction also led to problems about engineers having disagreements about what goes at which layer, and due to overhead in header appending and better architectures would be possible.

Mahesh concluded by reflecting on the success of this approach in production, handling over 40 billion transactions per day at Meta now as the control plane which handles any metadata applications. The abstraction has its cost, but since control plane does not necessitate super low latency transactions, this is not an issue, and the benefits (for simplifying operations, allowing rapid change/adaptation) are huge. Recently Mahesh wrote a short note on these benefits, which I summarized here. Mahesh's work demonstrates a successful bridge between theoretical research and practical implementation in large-scale distributed systems.

Simpler is Better: Revisiting Mencius State Machine Replication

The authors are B. Cui, A. Charapko.

Aleksey has called for designing for practical fault tolerance. He said avoid algorithm-only fault-tolerance, which doesn't take the cost of fault-tolerance into account. For practical adn succcesful deployments, you need to reduce the performance-impact of fault-tolerance. So fault-tolerance/recovery should be a first-class citizen, and it is important to maintain a performance-gradient and predictability in your recovery. Aleksey had written a summary of his paper in his blog, so I will not explain more here.

ARES II: Tracing the Flaws of a (Storage) God

The authors are C. Georgiou, N. Nicolaou, A. Trigeorgi.

I had talked about atomic storage and concurrent reconfiguration in my blog a couple times before. Turns out ARES: Adaptive, Reconfigurable, Erasure coded, Atomic Storage was one of the latest work on this topic. This paper built on ARES and put a monitoring infrastructure to find where the bottlenecks are and refactored couple things to make ARES even more performant.

Pre-LogMGAE: Identification of Log Anomalies Using a Pre-trained Masked Graph Autoencode

The authors are A. Wu, Y. Kwon.

This paper introduces a new and generalized approach to log-based anomaly detection in software systems. It uses a masked graph autoencoder with contrastive learning for self-supervised pretraining, addressing limitations of current methods that treat logs as simple sequences. The approach incorporates Graph Attention Networks with Gated Recurrent Units to capture both long-term and short-term dependencies in log events. I like this work because this could help improving anomaly detection in multi-source system logs. It would be a nice service to have in your company and point it to different services logs as your offerings grow.