Measuring Agents in Production
When you are in TPOT echo chamber, you would think fully autonomous AI agents are running the world. But this 2025 December paper, "Measuring Agents in Production", cuts through the reality behind the hype. It surveys 306 practitioners and conducts 20 in-depth case studies across 26 domains to document what is actually running in live environments. The reality is far more basic, constrained, and human-dependent than TPOT suggest.
The Most Surprising Findings
Simplicity and Bounded Autonomy: 80% of case studies use predefined structured workflows rather than open-ended autonomous planning, and 68% execute fewer than 10 steps before requiring human intervention. Frankly, these systems sound to me less "autonomous agent" than glorified state machine or multi-step RAG pipeline.
Prompting Beats Fine-Tuning: Despite the academic obsession with reinforcement learning and fine-tuning, 70% of teams building production agents simply prompt off-the-shelf proprietary models. Custom-tuned models are often too brittle, and they break when foundation model providers update their models.
Tolerance for Latency: While in database systems and distributed systems we focus on shaving milliseconds and microseconds off response times, in the agent world 66% of deployed systems take minutes or even longer to respond. I am not comparing or criticizing because of the intrinsically different nature of the work, I am just stating how vastly different the latency expectations are.
Custom Infrastructure Over Heavy Frameworks: Though many developers experiment with frameworks like LangChain, 85% of the detailed production case studies ended up building their systems completely in-house using direct API calls. Teams actively migrate away from heavy frameworks to reduce dependency bloat and maintain the flexibility to integrate with their own proprietary enterprise infrastructure.
Benchmarks are Abandoned: 75% of production teams skip formal benchmarking entirely. Because real-world tasks are incredibly messy and domain-specific, teams rely instead on A/B testing, production monitoring, and human-in-the-loop evaluation (which a massive 74% of systems use as their primary check for correctness).
Reliability (consistent correct behavior over time) remains the primary bottleneck and challenge. OK, this one was not really a surprising finding.
The paper says agents in production deliver tangible value: 80% of practitioners explicitly deploy them for productivity gains, and 72% use them to drastically reduce human task-hours. This would have been a great place to be concrete, and dive deeper into a couple of these cases, because there is a lot of incentive for companies to exaggerate their agent use. I think the closest I have seen paper go here in the appendix, Table 3.
Discussion
So the data says that the state of multi-agent systems in production is exaggerated. Everyone says they are doing it, but only a few actually are. And those who are doing it are keeping it basic.
This feels familiar.
Remember 2018? IBM published a whitepaper stating that "7 in 10 consumer industry executives expect to have a blockchain production network in 3 years". They famously claimed blockchains would cure almost every business ailment, reducing 9 distinct frictions including "inaccessible marketplaces", "restrictive regulations", "institutional inertia", "invisible threats", and "imperfect information". Ha, "invisible threats", it cracks me up every time!
Of course, Pepperidge Farm remembers the massive 2018 hype about Walmart tracking lettuce on the blockchain to pinpoint E. Coli contamination events. We were promised a decentralized revolution, but we only got shitcoins.
For blockchains, whenever I asked people about the killer application, they would mumble something like, "It is trustless", for which I would respond, "That is not an application", which would made them would respond with "But it is decentralized". Today the AI agent narrative occasionally gives off a similar vibe. When pressed on the ultimate value of these systems, the default response is often claiming "productivity gains". Unfortunately, there hasn't been much deep elaboration on what this actually entails at scale.
But, comparing AI agents to blockchains is unfair. Agents actually have a couple of killer applications already, particularly in coding, data analysis, and customer care. They have successfully made it into deployment despite in a very basic and constrained manner. It is just that they aren't the fully autonomous hyper-intelligent multi-agent swarms that people claim they are. They remain basic, human-supervised, highly constrained tools.
This connects perfectly to the arguments I made in my Agentic AI and The Mythical Agent-Month post regarding the mathematical laws of scaling coordination. Throwing thousands of AI agents at a project does not magically bypass Brooks' Law. Agents can dramatically scale the volume of code generated, but they do not scale insight. In fact, due to their vast operational speed, agent coordination complexity will likely far exceed the $O(N^2)$ coordination complexity that Fred Brooks originally postulated for human teams. Until we solve the fundamental epistemic gap of distributed knowledge, adding more agents to a system would simply produce a faster more expensive way to generate merge conflicts.
Comments