Posts

Showing posts with the label reading-group

Deep reading

How many hours does it take you to read a ~10 page research paper? — Murat Demirbas (Distributolog) (@muratdemirbas) February 8, 2022 Twenty years ago, a well-known professor in computer networking field told me that he reviews any paper in 30 minutes. Not just read the paper, but also write the review, mind you. All in 30 minutes! I said "I am slow it takes me 4 hours to read a paper". I lied. It actually took me 8+ hours to read papers, because I was a graduate student and didn't have much background and paper reading experience. Things have improved, but it still takes me 4-8 hours to read a paper and understand everything so I can write an informative conference review or blog post about the paper. (Of course, I am talking about good research papers, not content-free publication for the sake of publication papers.) Maybe because of that early encounter with that flamboyant networking professor, I always felt I am very slow in reading a paper. I checked with fellow ...

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Image
This paper appeared recently in OSDI 2020 . This paper is about clock synchronization in the data center. I presented this paper for our distributed systems zoom meeting group . I took a wider view of the problem by explaining time synchronization challenges and fundamental techniques to achieve precise time synchronization. I will take the same path in this post as well. It is a bit circuitous road, but it gives a scenic pleasurable journey. So let's get going. The benefits of better time synchronization For any distributed system, timestamping and ordering of events is a very important thing. Processes in a distributed system run concurrently without knowing what the other processes are doing at the moment.  Processes learn about each other's states only by sending and receiving messages and this information by definition come from the past state of the nodes. The process needs to compose the coherent view of the system from these messages and all the while the system is movi...

Read papers, Not too much, Mostly foundational ones

Here is my advice to people who want to develop competence and expertise in their fields. Read papers By papers, I mean technical research papers, not white papers or blog posts.  By read, I mean read rigorously and critically .  Not too much If you read rigorously and critically, you cannot read too many papers.  Moreover, learning by doing is the only way to internalize and grok a concept. If you read papers all day, you don't have time to try things yourself.  If you are a PhD student, maybe read two or three papers a week (but, remember, rigorously and actively). If you are not in academia, maybe read one paper a week or two.    Mostly foundational ones While there are exceptions, it is better to prioritize: seminal work over incremental work, general principled work over point-solutions, work introducing techniques/tools over work applying techniques A big exception is good expository papers. Unfortunately, the academia treats them as something the cat...

High availability in cheap distributed key value storage

Image
This paper is authored by Thomas  Kim, Daniel Lin Kit, Gregory Ganger, Michael  Kaminsky, and David Andersen. It appeared in SOCC 2020. The paper talks about using NVMM for building a distributed key value store. NVMMs are a new technology. They are slowly rolling into the datacenters, but there are still questions about their performance and how many writes it could handle before a write wear. NVMMs have very low latency comparable to DRAM, yet they are 10 times cheaper. Awesome, right? Unfortunately they don't have the same high bandwidth as DRAM or SSDs. Also, they are still not anywhere as cheap as SSDs, and it may not be affordable to want to build all NVMM key-value stores.  Before reading this paper, it is important to understand that this is all about money, money, money. The cost of NVMM and SSDs influences the design decisions, and lead to this non-straightforward heterogeneous design. If cost was not an issue, we could even have DRAMs for all replicas in K-V ...

Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions

Image
This paper occurred in VLDB'19 and is authored by Hua Fan and Wojciech Golab.  The paper is about providing strict serializability in geo-replicated databases. The technique it uses can be summarized as "geo-replicate-ahead transactions". First the transaction T is replicated to all the parties across the datacenters. The parties then check to see that the watermark rises above T to ensure that all transactions including and preceding T has been replicated successfully to all the parties. Then, the execution of T can be done asynchronously at each party. This should remind you of the SLOG paper :  SLOG uses a deterministic architecture to move most of this communication outside of conflict boundaries, enabling these transactions to be processed at high throughput. More specifically, SLOG relies on deterministic processing to avoid two phase commit. Once all parties agree to the plan, processing occurs (mostly) independently on each node, with the system relying on the pl...

My Distributed Systems Seminar's reading list for Fall 2020

For the Fall semester distributed systems seminar, we will discuss these papers: Bipartisan Paxos: A Family of Fast, Leaderless, Modular State Machine Replication Protocols eXtreme Modelling in Practice     Starling: A Scalable Query Engine on Cloud Function Services Lambada: Interactive Data Analytics on Cold Data using Serverless Cloud Infrastructure   Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication   Scalable State-Machine Replication   Designing Distributed Systems Using Approximate Synchrony in Data Center Networks   Armada: Low-Effort Verification of High-Performance Concurrent Programs   Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions   Consolidating Concurrency Control and Consensus for Commits under Conflicts    Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency   Near-Optimal Latency Versus Cost Tradeoffs in Geo-Distributed St...

The Impact of RDMA on Agreement

Image
This paper appeared in PODC 2019. PODC stands for Principles of Distributed Computing. It is a theoretical distributed systems conference, first established in 1982. I had published my first ever paper, " Resettable Vector Clocks " at PODC 2000. The conference was held in Portland and, as a fresh graduate student attending it, I was awe-struck at the conference. On day one of the conference, I saw Leslie Lamport interrupting a talk asking a question and protesting loudly. Then Keith Marzullo in the audience (I think) pointed out the misunderstanding and Leslie Lamport said "Nevermind" and calmed down. I also noticed that the person sitting next to me was, oh my God, Nancy Lynch! I couldn't believe my luck seeing all these distributed systems royalty in person. Also in this conference, on day three, Eric Brewer gave his CAP theorem talk.   Good times! Anyways, back to the paper.  Contributions of the paper Under the message-passing model, BFT consensus requires ...

Streamlet: textbook streamlined blockchains

Image
In this 2020 technical report , the authors claim to present a very simple BFT consensus protocol. I am taking it that "textbook" in the title is used as praise for simplicity of the algorithm.  Nakamoto consensus is definitely simpler than Streamlet, but in contrast to Nakamoto consensus Streamlet provides deterministic finalization and does not suffer from the costly proof-of-work (POW) and the low-throughput POW induces. And, among the protocols that provide deterministic finalization, including PBFT, Tendermint, HotStuff, yes, Streamlet is the simplest protocol. So, yes, Streamlet deserves your attention. Streamlet protocol Streamlet is an amalgamation of existing techniques. It borrows stuff from Casper, HotStuff, and Elaine Shi's earlier work. But that's OK. The goal is to show the simplest protocol, not necessarily the most novel or the most efficient protocol. Here is the protocol, verbatim from the introduction of the paper. In Streamlet, each epoch has a de...

The FuzzyLog: A partially ordered shared log (OSDI'18)

Image
This paper appeared in OSDI 18, and is authored by Joshua Lockerman, Jose M. Faleiro, Juno Kim, Soham Sankaran, Daniel J. Abadi, James Aspnes, Siddhartha Sen, Mahesh Balakrishnan. The paper does not suggest many novel ideas or algorithms, but rather shows a nice amalgamation of existing ideas and algorithms for providing a more scalable version of a shared log across multiple datacenters. Having a shared log, and funneling all updates through a globally shared log  provides simplicity for building fault-tolerant consistent distributed systems. Tango paper from SOSP'13 , upon which they build here, argued for the simplicity benefits of such a system. Recently Jay Kreps wrote a nice and accessible explanation of the benefits of using globally shared logs for building distributed systems. Despite the simplicity it provides, maintaining a shared log with a system-wide total order is impractical, because it is expensive : At large scale, the single sequencer will be a bottleneck for the...

State-Machine Replication for Planet-Scale Systems (Eurosys 2020)

Image
This paper appeared at Eurosys 2020, and is authored by Vitor Enes, Carlos Baquero, Tuanir França Rezende, Alexey Gotsman, Matthieu  Perrin, and Pierre Sutra. The paper introduces Atlas, a consensus protocol similar to EPaxos, that builds on the Fast Paxos idea. Most people call these protocols leaderless, but I think that is a misnomer. I call these protocols as opportunistic/any-leader protocols. In these protocols, any node can propose a command by becoming an opportunistic leader and using a fast path. If there is no other simultaneously proposed command by another opportunistic leader or if this command commutes with all simultaneously proposed commands, the fast path is enough for achieving consensus. In the presence of concurrent non-commuting commands, the protocol will have to take a slow path, the purpose of which is to establish (teach to the other nodes) the dependencies learned for this command in the fast path attempt. This opportunistic/any-leader family of protocols...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems