ATC/OSDI’25 Technical Sessions

- July 10, 2025

ATC and OSDI ran in parallel. As is tradition, OSDI was single-track; ATC had two parallel tracks. The schedules and papers are online as linked above.

USENIX is awesome: it has been open access for its conference proceedings since 2008. So you can access all the paper pdfs through the links above now. I believe the presentation videos will be made available soon as well. Kudos to USENIX!

I attended the OSDI opening remarks delivered by the PC chairs, Lidong Zhou (Microsoft) and Yuan Yuan Zhou (UCSD). OSDI saw 339 submissions this year, which is up 20% from last year. Of those, 53 were accepted, for an acceptance rate of 16%. The TPC worked through Christmas to keep the publication machine running. We really are a bunch of workaholics. Who needs family time when you have rebuttals to respond to?

OSDI gave two best paper awards:

Basilisk: Using Provenance Invariants to Automate Proofs of Undecidable Protocols. Tony Nuda Zhang and Keshav Singh, University of Michigan; Tej Chajed, University of Wisconsin-Madison; Manos Kapritsos, University of Michigan; Bryan Parno, Carnegie Mellon University
Building Bridges: Safe Interactions with Foreign (programming) Languages through OmniGlot. Leon Schuermann and Jack Toubes, Princeton University; Tyler Potyondy and Pat Pannuto, University of California San Diego; Mae Milano and Amit Levy, Princeton University

The Distinguished Artifact award went to: PoWER Never Corrupts: Tool-Agnostic Verification of Crash Consistency and Corruption Detection. Hayley LeBlanc, University of Texas at Austin; Jacob R. Lorch and Chris Hawblitzel, Microsoft Research; Cheng Huang and Yiheng Tao, Microsoft; Nickolai Zeldovich, MIT CSAIL and Microsoft Research; Vijay Chidambaram, University of Texas at Austin

2026 OSDI' will be in Seattle organized by Eddie Kohler (Harvard) and Amar Phanishayee (Meta).

Emery Berger’s Keynote: Accelerating software development, the LLM revolution

Deniz Altinbuken (ATC cochair) introduced the OSDI/ATC-joint keynote speaker: Professor Emery Berger of Umass Amherst, who is also an Amazon scholar. Emery's Plasma lab is a birthplace of many interesting projects: Scalene, Hoard, DieHard. Emery is sigma. I mean, he has an ACM Sigma distinguished service award, and he is an ACM fellow. He is also the creator of cs-ranking, which he keeps getting occasional hate mail every now and then.

Emery gives great talks. I think this was the best presentation across all OSDI and ATC sessions. What is Emery's secret? Like other good speakers, rehearsing crazy number of times and preparing really well.

His talk was about the coming Cambrian explosion of LLM-augmented developer tools. Here’s the gist: "Traditional software development is the dinosaur. LLMs is the asteroid. Cambrian explosion is coming for tooling, not just apps. So let’s evolve in order not to go extinct."

He first introduced the framework he used repeatedly for the new generation of LLM-enhanced tools coming out of his lab.

Evolve.
Exploit a niche.
Ensure fitness.

He argued throughout the talk that AI-assisted tools can deliver the best of both worlds precision from PL, and flexibility from LLMs.

Scalene: Evolving the Profiler. Emery's group LLM-augmented Scalene their Python profiler tool, to not only pinpoint inefficiencies in your code but also to explain why your code is slow and provide suggestions on how to fix them. Common suggestions include: replace interpreted loops with native libraries, vectorize, use GPU. LLMs help generate optimization suggestions with surprising performance boosts (sometimes 90x). That was evolve (adopt LLMs), and exploit a niche (profiler knows where code is inefficient, why it is inefficient). The ensure fitness comes from running the optimized code against the original.

chatDBG: Evolving a Debugger. Most developers don’t use debuggers. Print statements are more popular. Why? Because debuggers are clunky and stagnant. chatDBG changes this. It turns the debugger into a conversational assistant. You can type "why" at the prompt, and it gives you the root cause and a proposed fix. You can query code slices, symbol declarations, even ask for documentation. The LLM has access to dynamic program state and source context provided by the debugger, and external knowledge provided by huge real world training. Success rates improved with more context: 25% (default stack), 50% (with "why"), 75–80% (with targeted queries). Safety is enforced using containers and command whitelists. Again: evolve the interface, exploit the niche (source + state + knowledge), ensure fitness (bounded functionality and testable fixes).

cwhy: Evolving a Compiler. C++ compiler errors are notoriously incomprehensible. cwhy wraps around Clang++ and gives human-readable explanations with concrete fix suggestions. It uses git diffs to localize the cause and context of the error. The assumption: if it compiled yesterday and not today, something broke in the diff. cwhy figures out what and suggests ways to fix it. It even handles things like regex errors. This tool impressed a library author so much that they adopted it after cwhy fixed a bug in their freshly published code.

coverup: Evolving a Coverage Tool. Writing good tests is hard and thankless. Coverage reports only cause guilt-trips. LLMs can write tests, but cannot methodically increase coverage. Coverup is a next-gen testing assistant. It builds on slipcover (Plasma's coverage analysis tool), then uses LLMs to generate new test cases that increase branch coverage methodically.

flowco: Rethinking Notebooks. Jupyter notebooks are the lingua franca of data science. But they're also a mess: brittle, unstructured, and difficult to maintain. Flowco reimagines notebooks as dataflow graphs. Each step (load, clean, wrangle, visualize) is a node in the graph/pipeline. LLMs guide code generation at each step. Combined with pre/post condition checks, you get a smart notebook that requires no code but ensures correct workflows. The metaphor shift from script to graph is what enables notebooks to evolve.

The talk was a tour-de-force across many tools showing how LLMs can help them evolve and improve significantly. Emery is doing a big service building these tools to help all developers. Based on my interactions with talented coders, I have come to conclude that LLMs actually boost performance way more for experts than beginners. I think these tools will help all kinds of developers, not just experts.

Using Provenance Invariance to automate proofs

This paper was one of the standout papers from OSDI'25, and winner of a Best Paper Award. The lead author, Tony Zhang, inow an engineer at Databricks, was a PhD student at University of Michigan and a MongoDB PhD Fellow. He delivered a sharp, polished talk. Here is my reconstruction from my notes.

Distributed protocols are notoriously hard to get right. Testing can show the presence of bugs but never their absence. Formal verification can give stronger correctness guarantees, but only if you can manage to prove your protocol satisfies a desired safety property. This involves crafting an inductive invariant: one that holds initially, implies the safety property, and is closed under protocol transitions.

Coming up with a suitable inductive invariant is hard. It's an iterative process: guess, check, fail, refine, repeat. You often start with a property you care about (say, agreement), only to find it isn’t inductive. Then you strengthen it with auxiliary lemmas and conditions. If you are skilled and somewhat lucky, you eventually get there. I wrote a bit about this loop back in 2019, after SOSP’19 on an earlier effort on this problem.

Basilisk’s goal is to short-circuit this painful invariant discovery loop using provenance invariants. Provenance invariants relate a local variable at a node to its provenance: the causal step or message that caused the variable to have its current value. For example, instead of guessing an invariant like "If replica A has voted yes, then replica B must also have voted yes," Basilisk works backwards to discover why A voted yes and then connects that cause to B’s state. Sort of history variables on steroids. By tracing data dependencies across steps and messages, Basilisk derives inter-host invariants that explain not just what the state is, but how it came to be.

Basilisk builds on the authors' prior work Kondo, which had two types of invariants:

Protocol invariants (expressive but hard to generate automatically)
Regular invariants (mechanically derivable but low-level)

Basilisk generalizes this by introducing host provenance (HP) and network provenance (NP). HP traces a variable's value to a local decision; NP traces it to a message received. Together, these form a causal chain, or witness, which Basilisk uses to justify why a state is safe.

The provenance invariants replace the original inductive invariant. Then Basilisk proves that these provenance invariants imply the desired property. All of this is implemented in a toolchain that extends the Dafny language and verifier. Protocols are modeled as async state machines in Dafny, Basilisk analyzes them, and outputs inductive invariants along with machine-checkable proofs. Basilisk was evaluated on 16 distributed protocols, including heavyweights like Paxos, MultiPaxos, and various consensus and replication variants.

Tigon: A Distributed Database for a CXL Pod

Traditional distributed databases synchronize over the network. This means network overhead, message exchange complexity, and coordination cost. Tigon replaces that network with a CXL memory pod. Instead of using sockets and RPCs, nodes coordinate using inter-host atomic instructions and hardware cache coherence. The only downside is this is only a single-rack-scale database design, which also comes with unavailability disadvantages.

There are still challenges remaining with this architecture. The CXL memory has higher latency (250–400ns) and lower bandwidth than local DRAM. It also provides limited hardware cache coherence capacity. Tigon addresses this with a hybrid architecture: partitioned and shared. All control-plane synchronization is done via CXL-coherent memory. And messages are only exchanged for data movement.

To minimize coherence pressure, they compact metadata into 8-byte words and piggyback coordination info. They also show how to eliminate two-phase commit by taking advantage of reconstructability during crash recovery. The system is single-rack and does not replicate. It's built on a fail-stop assumption, and scaling (adding/removing nodes) requires restart. Evaluated on TPC-C and a YCSB variant using simulation experiments (since the hardware is not there), Tigon outperforms existing CXL-based shared-nothing DBs by up to 2.5x, and RDMA-based systems by up to 18.5x.

Mako: Speculative Distributed Transactions with Geo-Replication

Geo-replicated transactions are hard. The cost of consensus across data centers kills latency and throughput. Mako sidesteps this by decoupling execution from replication. Transactions execute speculatively using two-phase commit (2PC, Tupac) locally, without waiting for cross-region consensus. Replication happens in the background to achieve fault-tolerance.

The core idea is to allow transactions to commit optimistically and only roll back if replication later fails. This opens the door to cascading aborts, and Mako tries to alleviate unbounded rollback problem by tracking transactional dependencies using vector clocks.

Ok, but there is another problem. How do you get consensus among nodes on roll back when the speculative/optimistic execution fails. This was not explained during the talk, and I brought this up during Q&A. The answer seems to be using epochs and sealing, and doing this in a delayed manner. This could open more problems. I haven't read the paper to understand how this works.

Skybridge: Bounded Staleness in Distributed Caches

From Meta (I guess we are not calling it Facebook anymore), this paper addresses the pain of eventual consistency in global cache layers. Eventual is fine for most cases, until it's not. Products like Facebook and Instagram rely on caches being mostly up-to-date. Inconsistencies may cause some occasional weird bugs, user confusion, and degraded experience.

Skybridge is an out-of-band replication stream layered on top of the main async replication pipeline. It adds redundancy without relying on the same code paths, avoiding correlated failures. Skybridge focuses only on timely delivery of updates, not durability.

Skybridge itself is also eventual consistency, but being lightweight it gives you bounded consistency almost always. By leveraging Bloom filter-based synchronization, Skybridge provides a 2-second bounded staleness window for 99.99998% of writes (vs. 99.993% with the baseline). All that, at just 0.54% of the cache deployment size because of reduced scope in this superposed/layered system.

SpecLog: Low End-to-End Latency atop a Speculative Shared Log

Shared logs are foundational in modern distributed systems (e.g., Corfu, Scalog, Boki). They are deployed in Meta (formerly Facebook) as I discussed here earlier. But their latency is often too high for real-time workloads because coordination delays application progress. SpecLog proposes speculative delivery: let applications start processing records before the final global order is fixed.

To make this safe, they introduce Fix-Ante Ordering: a mechanism that deterministically assigns quotas to each shard ahead of time. If each shard sticks to its quota, the global cut is predictable. If not, speculation may fail and need rollback. Their implementation, Belfast, shows 3.5x faster delivery latency and 1.6x improvement in end-to-end latency over existing shared logs.

This is conceptually similar to moving Scalog’s ordering step upfront and letting applications run optimistically. As I noted back in my Scalog post, removing coordination from the critical path is the holy grail. SpecLog pushes this further by betting on speculative execution + predictable quotas. Again, I haven't read the papers to analyze disadvantages.