SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

This paper presents SysMoBench, a benchmark designed to evaluate generative AI's ability to formally model complex concurrent and distributed systems. Although the paper is published on January 2026, the AI landscape moves so fast that the models evaluated (like Claude-Sonnet-4 and GPT-5) already feel dated, after the release of heavy hitters like Claude 3.5 Opus or OpenAI's Codex. 

The paper draws a distinction between algorithms/protocols and system modeling. As the paper (somewhat circularly) defines it, "system models enable verification of system code via comprehensive testing and model checking". The paper says: Modeling these systems requires the AI to deeply understand the system design, reason about safety and liveness under unexpected faults, and abstract system behavior into an executable program.


This paper is a great paper to follow up my post yesterday on TLA+ Mental Models. It is instructive to see how the paper corroborates my claims in that post. My post identified deciding what to abstract as the hardest skill in TLA+. Rather than letting the AI decide what to ignore, the benchmark tasks in Section 3.1 give the AI detailed instructions about what core actions to model and what implementation details to explicitly exclude.

The paper also validates my observation about invariants being the most signal-heavy part of a spec and, hence, being most difficult for AI and people to write. The authors abandoned having the AI write invariants from scratch, and instead they provide invariant template that contain both a natural language description and a formal TLA+ example. The AI is only asked to concretize these templates by mapping them to its own variable names. 

I'll be criticizing the paper in several places in my writeup below, so just to make it clear upfront, I really like the paper. This paper is a strong accept from me for any conference.

SysMoBench uses the following automated quality metrics for evaluating TLA+ models:

  • Syntax correctness: Statically checked using the SANY Syntactic Analyzer.
  • Runtime correctness: Checked by running the TLC model checker as a proxy for logical self-consistency.
  • Conformance: Measured via trace validation to see if the model conforms to the actual system implementation.
  • Invariant correctness: Model-checked against system-specific safety and liveness properties.

Here "conformance" look contentious to me because of the uncertainty around how refinement mappings are handled and how step-size impedance mismatches are resolved. The paper circumvents this by instrumenting the system code to generate trace logs, and then uses an LLM to automatically map the TLA+ variables and actions to the code traces. If AI-generated models inherently lean towards line-by-line code translation, are we risking losing the cognitive and design benefits of abstraction? Does the Conformance metric inherently bias the benchmark against good abstraction? How could a benchmark be designed to automatically evaluate and reward an AI for successfully cutting away irrelevant mechanics rather than matching them?


Looking at Table 3, the LLMs fail miserably on complex systems, with only Claude-Sonnet-4 managing to piece together anything functional (and even then, scoring very low on conformance for Etcd Raft). The paper's Analysis on LLMs highlights fundamental weaknesses. LLMs constantly introduce syntax errors (as shown in Figure 4a). For instance, DeepSeek-R1 frequently hallucinates mathematical symbols (like $\cap, \forall$) instead of standard ASCII TLA+ operators. GPT-5 and Gemini-2.5-Pro frequently mix TLA+ syntax with Python. Regarding runtime errors (Figure 4b), LLMs frequently generate inconsistent TLC configurations or fundamentally misunderstand TLA+ data structures (e.g., trying to apply set operations to records). I think that things have improved significantly with newer models, so an update from the team would be really great! 

The paper notes that LLMs violated 41.9% of liveness properties but only 8.3% of safety properties (as seen in Figure 4c). This indicates a severe limitation in temporal reasoning. The violations are largely due to missing or incorrectly specified fairness assumptions, though logical structural errors often block the model from making progress before the fairness issues even manifest.

The Trace Learning Agent, which attempts to infer the TLA+ model purely from system execution traces rather than source code, also failed miserably. And I think this is another indication of the temporal reasoning problem LLMs have for modeling. The trace learning agent underperformed compared to the basic modeling and code translation agents, failing to pass runtime checks entirely. Even Claude-Sonnet-4 achieved extremely low syntax scores when using the Trace Learning Agent.

The paper is a great start for an LLM system modeling benchmark. But it currently lacks diversity as the dataset leans heavily on Consensus protocols. Out of the 7 distributed systems artifacts, a huge portion (Etcd Raft, Redis Raft, Xline CURP, PGo raftkvs, ZooKeeper FLE) are essentially variations of distributed consensus/leader election.  The authors mention that adding new systems to SYSMOBENCH requires significant effort to instrument the system to collect execution logs for trace validation. I think they need a "better sell". Why should an engineer go through the effort to instrument their system for SysMoBench if the AI is going to fail at modeling it? A clearer value proposition is needed to drive community contributions here.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

The F word

The Agentic Self: Parallels Between AI and Self-Improvement

Learning about distributed systems: where to start?

Foundational distributed systems papers

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Advice to the young

Agentic AI and The Mythical Agent-Month

Are We Becoming Architects or Butlers to LLMs?

Welcome to Town Al-Gasr