Reading list for our distributed systems class

- September 03, 2019

Here is our reading list. It follows our distributed systems course schedule provided in the previous post. I tried to choose the papers that are foundational, comprehensive, and readable by a first year graduate student. In some cases, I omitted very long or hard to follow papers ---even though they may be foundational papers--- and instead included some follow up papers that summarized the concept better.

I assign these papers as review material for the corresponding topic. The students then choose one from the batch and review that one critically. But I hope that the students can read all of them if possible.

These papers should be all available with a Google Scholar search. Some of these papers appear in open access venues. If that is not the case, often the authors make their papers available freely at their websites, as they have the right to do. "Authors retain the right to use the accepted author manuscript for personal use, internal institutional use and for permitted scholarly posting provided that these are not for purposes of commercial use or systematic distribution".

Consensus and Paxos

Paxos Made Moderately Complex. Robbert Van Renesse and Deniz Altinbuken, ACM Computing Surveys, 2015.
Paxos made live - An engineering perspective. Tushar D. Chandra, Robert Griesemer, Joshua Redstone. ACM PODC, Pages: 398 - 407, 2007
ZooKeeper: Wait-free coordination for internet-scale systems P. Hunt, M. Konar, F. P. Junqueira, and B. Reed USENIX ATC 2010.
The Chubby Lock Service for Loosely-Coupled Distributed Systems. Mike Burrows, OSDI 2006.
In Search of an Understandable Consensus Algorithm. Diego Ongaro, John Ousterhout, USENIX ATC, 2014.
WPaxos: Wide Area Network Flexible Consensus. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Tevfik Kosar, IEEE TPDS, 2019.
Dissecting the Performance of Strongly-Consistent Replication Protocols. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Sigmod 2019.
Viewstamped replication revisited. Barbara Liskov and James Cowling. MIT-CSAIL-TR-2012-021, 2012.
Chain Replication for Supporting High Throughput and Availability. Robbert van Renesse and Fred B. Schneider, OSDI 2004.
FAWN: A Fast Array of Wimpy Nodes David G. Andersen and Jason Franklin and Michael Kaminsky and Amar Phanishayee and Lawrence Tan and Vijay Vasudevan. SOSP 2009.
CORFU: A shared log design for Flash clusters. Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, John D. Davis, NSDI'2012.

Failure detectors and fault-tolerance

Unreliable Failure Detectors for Reliable Distributed Systems, Tushar Deepak Chandra and Sam Toueg, Journal of the ACM, 1996.
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm, OSDI 2014.
Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages, Haryadi S. Gunawi, Mingzhe Hao, and Riza O. Sumintom Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar, SOCC 2016.
Does The Cloud Need Stabilizing? Murat Demirbas, Aleksey Charapko, Ailidani Ailijiang, 2018.
TaxDC: A Taxonomy of nondeterministic concurrency bugs in datacenter distributed systems, Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi, ASPLOS 2016.

Time and snapshots

Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978.
Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases. Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, and Marcelo Leone, 2014. https://cse.buffalo.edu/tech-reports/2014-04.pdf
Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985.

Cloud computing

Tail at scale. Jeff Dean, Luiz Andre Barroso, Commn of the ACM, 2013.
Lessons from Giant-Scale Services. Eric A. Brewer, IEEE Internet Computing, 2001.
Above the Clouds: A Berkeley View of Cloud Computing. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica and Matei Zaharia. EECS Department University of California, Berkeley Technical Report No. UCB/EECS-2009-28 February 10, 2009.
Serverless computing: One step forward, two steps back, UC Berkeley, CIDR 2019.
Cloud Programming Simplified: A Berkeley View on Serverless Computing, 2019. https://arxiv.org/abs/1902.03383
On designing and deploying Internet scale services, James Hamilton, LISA 2007.

NoSQL and distributed databases

Life beyond Distributed Transactions: an Apostate’s Opinion. Pat Helland, CIDR 2007.
Optimistic Replication. Yasushi Saito and Marc Shapiro, ACM Computing Surveys, 2005.
CAP Twelve Years Later: How the "Rules" Have Changed. Eric Brewer, IEEE Computer, 2012
PNUTS: Yahoo!'s Hosted Data Serving Platform. Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni, VLDB 2008.
Dynamo: Amazon’s Highly Available Key-Value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, ACM SIGOPS 2007.
Bigtable: A Distributed Storage System for Structured Data. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, ACM Transactions on Computer Systems, 2008.
Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh,Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura,David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak,Christopher Taylor, Ruth Wang, Dale Woodford, ACM Trans on Computer Systems, 2013.

Big data processing

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, Commn of the ACM, 2008.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012.
TUX2: Distributed Graph Computation for Machine Learning, Wencong Xiao, Jilong Xue, Youshan Miao, Zhen Li, Cheng Chen and Ming Wu, Wei Li, Lidong Zhou, NSDI 2017.
Proteus: agile ML elasticity through tiered reliability in dynamic resource markets. Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, Phillip B. Gibbons. EuroSys, 2017.
TensorFlow: A system for large-scale machine learning, Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. OSDI 2016.

Decentralized ledgers

Practical Byzantine Fault Tolerance, Miguel Castro and Barbara Liskov, OSDI'99.
Bitcoin: A Peer-to-Peer Electronic Cash System, Satoshi Nakamoto, 2008.
Scalable and Probabilistic Leaderless BFT Consensus through Metastability, Team Rocket, Maofan Yin, Kevin Sekniqi, Robbert van Renesse, and Emin Gun Sirer, 2019.
Untangling Blockchain: A Data Processing View of Blockchain Systems. Tien Tuan Anh Dinh, Rui Liu, Meihui Zhang, Gang Chen, Beng Chin Ooi, IEEE Transactions on Knowledge and Data Engineering, 2017.
Bridging Paxos and Blockchain Consensus. A. Charapko, A. Ailijiang, M. Demirbas, IEEE Blockchain, 2018. http://www.cse.buffalo.edu/~demirbas/publications/bridging.pdf
Blockchains from a distributed computing perspective, Maurice Herlihy, Commn of the ACM, 2017.

Search This Blog

Metadata