Foundational distributed systems papers

I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.) 

I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section.


Time and State in Distributed Systems

Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM,  1978.

Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985.

Virtual Time and Global States of Distributed Systems. Mattern, F. 1988.

Practical uses of synchronized clocks in distributed systems. B. Liskov, 1991.



Expository papers and blog posts

There is No Now. Justin Sheehy, ACM Queue 2015

Why Logical Clocks are Easy. Carlos Baquero and Nuno Preguiça, ACM Queue 2016.

Hybrid logical clocks

Logical clocks and Vector clocks modeling in TLA+/PlusCal


Impossibility results 

Coordinated Attack or Two Generals problem is a fundamental impossibility in distributed systems. It started more of folks theorem, so instead of pointing to a paper, I provide a link to the wikipedia page. 

Impossibility of Distributed Consensus with One Faulty Process, Fischer, Lynch and Patterson, JACM, 1985

Unreliable Failure Detectors for Reliable Distributed Systems, Tushar Deepak Chandra and Sam Toueg, Journal of the ACM, 1996.

Harvest, Yield, and Scalable Tolerant Systems, Armando Fox, Eric A. Brewer, 1999

CAP Twelve years later: How the rules have changed, Eric Brewer, 2012.

Expository papers and blog posts

A Brief Tour of FLP Impossibility

Paper summary: Perspectives on the CAP theorem


Consensus and state machine replication

Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. B. Oki and B. Liskov. SOSP 1988

Implementing Fault-Tolerant Services Using the State Machine Approach: a Tutorial, Fred Schneider, 1990

How to build a highly available system with consensus, Butler Lampson, 1996

The Part-Time Parliament (Paxos) Leslie Lamport, 1998. (See context

Practical Byzantine Fault Tolerance. Miguel Castro, Barbara Liskov. OSDI 1999.

Chain Replication for Supporting High Throughput and Availability. Robbert van Renesse and Fred B. Schneider, OSDI 2004. 

ZooKeeper: Wait-free coordination for Internet-scale systems. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, Benjamin Reed, Usenix ATC 2010.

Tango: Distributed Data Structures over a Shared Log, Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, Aviad Zuckk. SOSP 2013.

There is more consensus in Egalitarian parliaments. Iulian Moraru, David G. Andersen, Michael Kaminsky, SOSP 2013.

Flexible Paxos: Quorum intersection revisited. Heidi Howard, Dahlia Malkhi, Alexander Spiegelman. 2016.

WormSpace: A modular foundation for simple, verifiable distributed systems. Ji-Yong Shin, Jieung Kim, Wolf Honore, Hernán Vanzetto, Srihari Radhakrishnan, Mahesh Balakrishnan, Zhong Shao, SOCC'19. 

Expository papers and blog posts

In Search of an Understandable Consensus Algorithm. Diego Ongaro, John Ousterhout, Usenix ATC, 2014. 

Paxos Made Moderately Complex. Robbert Van Renesse and Deniz Altinbuken, ACM Computing Surveys, 2015.

Consensus in the Cloud: Paxos Systems Demystified. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, 2016.

Modeling Paxos and Flexible Paxos in Pluscal and TLA+

Dissecting performance bottlenecks of Paxos protocols.


Distributed algorithms

Self-stabilizing systems in spite of distributed control, Edsgar W. Dijkstra, CACM 1974.

The Drinking Philosophers Problem, K. M. Chandy, J. Misra, ACM TOPLAS 1984

Sparse partitions, Baruch Awerbuch, David Peleg, FOCS 1990. 

Distributed reset, Anish Arora, Mohamed Gouda, 1994

The Arrow Distributed Directory Protocol, Michael J. Demmer, M. Herlihy, DISC 1998. 

Expository papers and blog posts

Dijkstra's stabilizing token ring algorithm

Modeling the hygienic dining philosophers algorithm in TLA+


Miscellaneous

Hints for computer system design, Butler Lampson, 1983

The role of distributed state, John Ousterhout, 1990

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. Matt Welsh, David Culler, and Eric Brewer, SOSP 2001

Crash only software, George Candea, Armando Fox, HotOS 2003

Expository papers and blog posts

Learning about distributed systems: where to start?


Cloud computing, big data storage/processing 

Lessons from Giant-Scale Services. Eric A. Brewer, IEEE Internet Computing, 2001.

MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat, OSDI 2004.

Optimistic Replication, Yasushi Saito and Marc Shapiro, 2005.

Dynamo: Amazon’s Highly Available Key-Value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, ACM SIGOPS 2007.

On designing and deploying Internet scale services, James Hamilton, LISA 2007 

Life beyond Distributed Transactions: an Apostate's Opinion, Pat Helland, CIDR 2007.

Conflict-free Replicated Data Types. Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski, 2011.

Consistency Analysis in Bloom: a CALM and Collected Approach, Peter Alvaro, Neil Conway, Joseph M. Hellerstein, William R. Marczak, CIDR 2011. 

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. 

Tail at scale. Jeff Dean, Luiz Andre Barroso, Commn of the ACM, 2013. 

Spanner: Google’s Globally Distributed Database, ACM, 2013.

TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016.

Expository papers

Above the Clouds: A Berkeley View of Cloud Computing. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia, 2009.

Cloud Programming Simplified: A Berkeley View on Serverless Computing, 2019. 

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book