SOSP21 conference (Day 1)

SOSP, along with OSDI, is the premiere conference in computer systems. SOSP was held biannually; I had attended SOSP19 in person, and shared notes and paper summaries from that. 

This year SOSP is virtual, which made it a lot easier to travel to. It is nice attending the conference from the convenience of your home. The experience is not too inferior to physical conference attending if you take the convenience factor into account. 

Here are some notes from SOSP21. If you find and interesting paper, you can dig in, because:


The conference opened with announcements from program committee (PC) chairs. SOSP21 had 348 submissions from 2078 authors, resulting on avg 6 authors per paper. 54 out of 348 papers are accepted. Reviews were conducted by 64 PC members, who produced 1500+ reviews.

The trend is clear, the number of submissions are growing steeply.  Last year OSDI had announced it is going annual.  In the second day of SOSP it is announced that SOSP is going annual, starting from 2023.

To provide consistency across PCs, the OSDI rejected paper reviews were submitted to the conference along with those papers. The SOSP reviewers first reviewed the papers themselves, then saw OSDI reviews. This provided more consistency/continuation across submissions. The feedback is to make this optional rather than mandatory in the future.

Another experiment was to allow appendices in submissions. The appendices will be included in the cameraready, with the appendix marked as not peer-reviewed. This will help those who want to dive deeper into the paper. 

Best paper awards went to:

SOSP had put serious effort on artifact evaluation, which involved 91 reviewers, 171 reviews, 4.3 reviews per artifact. This resulted in issuing of  38 artifacts evaluated certificates and 27 results reproduced certificates.

The distinguished artifact award went to Rudra: finding memory safety bugs in Rust at the ecosystem scale. You can find results of the artifact evaluation, including links to all of the artifacts and summaries of the work done by the AEC at:


Zoom was used for the conference. Audio/video is set to be off for audience by default. I didn't know this until the announcement was made. I was confused as I tried searching in vain to find the audio controls to mute myself. But I agree that this is a good setup for avoiding accidental zoom-bombing. When someone needs to ask a question, they are promoted to get access to video and audio. 

The poster session was done through gathertown. This was awesome. Gathertown overlays a physical space, where you run into people, look at posters. This felt like going to the conference, running people at the hallways. I enjoyed it. 

BFT session

Basil Breaking up BFT with ACID transactions

BIDL: A High-throughput, Low-latency Permissioned Blockchain Framework for Datacenter Networks

Kauri: Scalable BFT Consensus with Pipelined Tree-Based Dissemination and Aggregation

Bug finding session

iGUARD: In-GPU Advanced Race Detection

The title may be misleading. This is not a GPU application for computer vision-based race detection from photos/videos. GPU applications are growing. Every gpu generation is growing in size and complexity. GPUs introduce advance synchronization for programming. Sync only within a subset of threads, warp (32 thread), threadblock (1000 threads), or grid (even more threads). Independent thread scheduling (ITS). Races in CUDA code due to insufficient scopes. 12 races caught in popular. Anyways, it was great to see race issues being discussed in SOSP. 

Snowboard: Finding Kernel Concurrency Bugs through Systematic Inter-thread Communication Analysis

Finding concurrent inputs is challenging. Snowboard a tool for finding kernel concurrency bugs. It analyzes potential memory communications (pmc). It finds pmcs, prioritizes pmcs, and test them to find bugs.

Rudra: Finding Memory Safety Bugs in Rust at the Ecosystem Scale (Distinguished Artifact Award)

This work identified 3 bug patterns in "unsafe" Rust: panic safety bug, higher order invariant bug, and send/sync variance bugs.


Witcher: Systematic Crash Consistency Testing for Non-Volatile Memory Key-Value Stores

NVM is here to stay. Two main operations: flush operation writes back a cache line to memory, fence operation provides ordering guarantee between flushes. They did an NVM software survey, and categorized things as persistence ordering bugs, persistence atomicity bugs, and persistence performance bugs. They use  rules for inferring likely correctness conditions specific to NVM programming. They then use output equivalence checking to validate a crashed state (this is not NVM specific). The tool takes program and test-case as input, do an  llvm compiler pass, and  run instrumented bytecode.

Understanding and Detecting Software Upgrade Failures in Distributed Systems

There is a tradeoff between safe versus fast upgrade. Developers use canary deployment to achieve some safety, but 20% of those take add 1000+ minutes to deployment. They did a study focusing on upgrade failures: 2/3 of upgrage failures were only caugth after software release. In 2/3rd of upgrade failures, the cause was data format and data semantic incompatibilities. 20% of syntax incompatibilities are defined by serialization libraries or enum.

Crash Consistent Non-Volatile Memory Express


Cuckoo Trie: Exploiting Memory-Level Parallelism for Efficient DRAM Indexing

Cuckoo trie: an ordered index that exploits memory-level parallelism (mlp). Start with a trie, store nodes in a hash table, no pointers are needed, you can read nodes in parallel.

Regular Sequential Serializability and Regular Sequential Consistency 

Strict serializability guarantees total order of transactions: The order respects causality and the order also respects some real-time: reads must return up-to-date values. They introduce regular sequential serializability (rss). Informally any application inv that holds with a strictly serializable service also holds with an rss service.

Caracal: Contention Management with Deterministic Concurrency Control 


Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

Learning about distributed systems: where to start?

Strict-serializability, but at what cost, for what purpose?

CockroachDB: The Resilient Geo-Distributed SQL Database

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

Anna: A Key-Value Store For Any Scale

Your attitude determines your success