ACM CAIS: Conference on AI and Agentic Systems

- June 02, 2026

Last week, I traveled to San Jose to attend the ACM CAIS conference. On Day 0, I gave a short talk at the Supporting our AI Overlords (SAO) workshop. And yes, I promise to write a summary of our paper, "A Case for Simulation-Driven Resilience in Agent-First Data Systems" soon!

To start with an overall impression of the conference: much of the work presented felt exploratory and anecdotal. Since the compound AI space is still so new, many work seemed to share on-the-ground best practices that worked for them rather than principled results. Some talks really leaned into the "agent, act like a senior engineer and don't make mistakes" vibe. This was especially apparent in the "Agent Skills Workshop". I am not saying this is a bad thing, I learned some valuable lessons from that workshop, which I'll share below.

CAIS defines the conference's scope broadly as "research on compound AI architectures, optimization, and deployment". Unfortunately, this broadness seemed to work against the main track. Attendance at the primary conference talks was low, and it often felt like attendees were there solely to present their own work rather than engage with others, likely because the subject matter was spread too thin.

In contrast, the workshops were highly focused, which led to much better engagement and active listening from the audience. Moving forward, I think CAIS would benefit greatly from narrowing its focus, maybe specifically focusing more on data systems and infrastructure in support of AI.

On that note, what happened to our collective attention span? CAIS limited paper presentations to 7 minutes with just 2 minutes for questions. This year, SIGMOD also shifted to 9-minute talks. Our own paper, "LeaseGuard: Raft Leases Done Right!", got a mere 9 minutes in the spotlight after Jesse traveled all the way to Bangalore to present it. I've even heard that USENIX Security is down to 3-minute talks now. Should we maybe consider slowing down? After all, isn't attention all we need?

Agent Skills workshop (Day 0)

In the first talk, Graham Neubig discussed OpenHands, their open-source AI developer agent platform that's getting a lot of traction in highly regulated fields like finance and healthcare to speed up software development. A big theme of his talk was skill induction: "the process of inducing/verifying programmatic or prompt-based capabilities through testing/evaluation to enable single-agent systems complete complex long-horizon tasks". Through leveraging offline human-annotated examples or online user feedback, skill induction kicks off a rapid learning phase. In web navigation tests like WebArena, an agent's success rate can ramp up dramatically over a small number of trials before settling into a robust repeatable skill set.

Later on, Kanav Garg (co-founder of Core Automation) walked us through the lifecycle of a Reinforcement Learning (RL) environment. He defined it as a continuous loop made up of an actionable prompt, a starting state, a runtime environment (like a Docker container), configuration, and a reward system. The main takeaway here was that successful RL needs careful difficulty calibration and precise reward shaping to keep agents from hacking the reward system. To get this right, engineers have to actually look at the agent's traces instead of blindly trusting that the numbers on a chart are going up. Kanav also said that data environments are living projects with a shelf life of at most two months, and this means continuous learning from task failures and automated data pipelines are far more important/effective than relying on static expensive human data.

SAO: Supporting our AI overlords (Day 0)

The core theme of the SAO workshop was that agents are rapidly becoming both the primary users and the builders of data systems, and this shift is creating a vast new design space demanding entirely new abstractions. The workshop featured three keynote speakers and was incredibly well attended. With no seating left, people were standing in the doorway just to watch the talks.

Aaron Katz from Clickhouse gave a nice talk on this transition from human to agent users. He said that agents aren't just querying data anymore; they are actively provisioning services, so they need their own identities and budgets. Because agents drive such massive concurrency and require interactive latencies, traditional per-query pricing models are becoming way too punitive. To keep up, platforms have to adapt to headless API-first experiences. He said that Clickhouse is launching the ability to build agents directly inside the database for lower latency and for blending structured and full-text search.

Next, Andy Pavlo talked about databases being the "final boss" for agents. Check out his slides. It is classic Andy humor and style, though a Wu-Tang Clan reference was sorely missing this time. Andy focused heavily on automated database tuning and development, comparing their Proto-X tuning agent with ChatGPT's Lambda-tune. While Proto-X gets better optimization results, it takes 12 hours to train per database, whereas ChatGPT is fast (14 minutes) but performs terribly. To bridge this gap, they adopted LLMs to boost automatic tuning algorithms by leveraging prior history. In the second (and shorter) part of the talk, Andy also noted that while coding agents are making progress, they still completely fail at building complex database components like query optimizers, which require much more support. Although they have successfully "vibed" and manually verified a couple of optimization passes (like DPHyp and Unnesting v2), blindly accepting an LLM's output is fraught with problems because verifying query plan equivalence is notoriously difficult (despite solver efforts from UW, Berkeley, and Microsoft). It seems like coding agents love to add special-case code when an optimizer actually needs to be as general as possible.

Finally, Nikita Shamgunov (ex-Neon, now Databricks) discussed the infrastructure needed for agents, categorizing this down into three pillars: state, compute, and middleware. He argued that because agents speed up the dev loop by 1000x, true serverless architectures are now absolutely critical to avoid the insane costs of overprovisioning. He also described Neon's architecture, highlighting how separating compute from storage allows for instant database branching for agents using microVMs. While he shared some genuinely interesting technical points, the presentation itself was pretty dry and felt like it was missing a clear focus.

The workshop ended with a panel. The organizers tried to spark a debate, but there wasn't much disagreement. The general consensus was that while traditional OLTP and OLAP boundaries will remain, agents will increasingly be the ones conducting tasks that span seamlessly across them. After the panel, we headed to a MongoDB-sponsored happy hour, where the workshop audience had more time to have relaxed conversations about the breakneck hellscape transition our industry is currently going through.

Wednesday (Day 1)

Since CAIS is an AI systems conference, the focus is more on building the systems that surround AI models rather training individual AI models. The dominant theme this year seemed to be [multi-]agent architectures, coordination protocols, and workflow design. The emphasis is on composition: how do we organize these agents, manage their contexts, coordinate their interactions, and handle their lifecycles?

A second major trend is the growing focus on day-to-day operations. Entire sessions were dedicated to evaluation, trace analysis, failure detection, routing, and cost optimization. With so many papers covering efficiency, scheduling, and economic tradeoffs, it is clear that the industry is shifting from just maximizing capability to maximizing capability per dollar.

My shortlist from Day 1:

Here a brief overview of the TraceFix paper, which is most relevant to my interests.

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

TraceFix tackles the coordination failures that happen when multiple LLM agents try to work together concurrently on shared tasks. It is crucial to understand that this paper is not about generating classical distributed computing algorithms; rather, it focuses on scaffolding the formal "rules of engagement" for multi-agent LLM systems. When agents collaborate on domain tasks requiring fine-grained mutual exclusion over shared mutable resources (such as editing the same codebase or scheduling access to a simulated lab instrument) they naturally run into interleaving-sensitive bugs like deadlocks, missed handshakes, and race conditions. TraceFix solves this by isolating the coordination layer, formally verifying the protocol for how agents use shared locks and message channels, and then allowing the agents to remain completely autonomous in executing their actual domain-specific work.

To achieve this, TraceFix introduces a verification-first pipeline where an orchestration agent first synthesizes a declarative protocol topology and writes the behavioral logic in PlusCal. Before any agent takes action, the TLA+ model checker (TLC) exhaustively searches this proposed protocol for safety violations and feeds concrete counterexample traces back to the agent for iterative repair until the code is fully verified. At runtime, these verified process bodies are compiled into agent prompts, and a monitor strictly enforces the approved topology by rejecting any invalid or out-of-bounds coordination attempts. TraceFix is evaluated across a benchmark of 48 complex tasks. The system achieved a 100% verification success rate within four repair iterations and significantly improved runtime task completion, proving that formal model-checker feedback can effectively eliminate the deadlocks and resource clashes in multi-agent workflows.

Counterintuitively, the introduction of this formal coordination scaffolding did not bog the system down with unnecessary overhead, but rather, it significantly accelerated execution. By structurally preventing agents from colliding over resources (a problem that caused a massive 61.2% contention rate and endless retry loops in unstructured chat-only setups) TraceFix eliminated wasted trial-and-error steps. As a result, the verified topology-monitored protocol executed in an average of just 93 seconds using 62 tool-call steps. This vastly outperformed both the chaotic chat-only baseline (229 seconds and 203 steps) and a sequential single-agent baseline (~304 seconds).

Thursday (Day 2)

On Thursday, the papers were more about making agent systems manageable. Many of these papers are about memory, planning, governance, verification, security, and control. Researchers are increasingly treating agents as long-running software systems that require architecture, interfaces, observability, safety mechanisms, and operational discipline. This resembles the evolution of distributed systems from clever protocols toward operational concerns such as consistency models, monitoring, fault tolerance, and standards.

A second trend was the emergence of agents as compound systems rather than monolithic models. Instead of expecting a single model to solve everything, people arenow building ecosystems of interacting components with explicit roles. The vibe here is more like systems engineering for AI.

My shortlist:

Securing Agents With Tracked Capabilities (putting the agents in a programming-language-based safety harness, Scala 3 with capability types)
Dossier: Deep Research via Ledger-Driven Branching Search and Query Encoding Learning
Open Agent Specification: Enabling Cross-Framework Comparison of AI Agents
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
The Verifier Tax: Horizon Dependent Safety–Success Tradeoffs in Tool Using LLM Agents
Supervisory Control Theory for LLM Revision
FedMECA: Scalable Federated Learning via Memory-Efficient and Concurrent Aggregation

Friday (Day 3)

In the Friday program, multiple papers study agent societies, debate, persuasion, socialization, consensus formation, and safety in multi-agent environments. Also, several papers focus on practical concerns such as routing requests across models, optimizing energy consumption, serving multi-agent systems, evaluating agent frameworks, and integrating AI into production workflows.

My shortlist:

Search This Blog

Metadata