Measuring AI Ability to Complete Long Software Tasks

This paper from METR (Model Evaluation & Threat Research) introduces a new metric for tracking AI progress: the "50%-task-completion time horizon". This denotes the length of software engineering task (measured by how long a skilled human developer takes to complete it) that the AI model can finish with 50% success rate. The researchers evaluated 12 frontier AI models on 170 tasks across three benchmarks: HCAST (97 diverse software tasks ranging from 1 minute to 30 hours), RE-Bench (7 difficult ML research engineering tasks, each 8 hours), and SWAA (66 short software actions taking 1-30 seconds). To calibrate task difficulty, they collected over 800 human baselines from professional developers, totaling 2,529 hours of work.

The headline finding is that this time horizon has been doubling every 7 months since 2019. GPT-2 could handle tasks that take humans about 2 seconds. The o3 model reached 110 minutes. Extrapolating this trend, AI reaches a one-month time horizon (167 working hours) between mid-2028 and mid-2031, with a central estimate of mid-2029. To be precise about what this means: the "167 working hours" refers to task complexity as measured by human effort, not AI execution time. An AI agent with a one-month time horizon could complete a task that takes a human developer a month of work within hours; it works continuously, doesn't context-switch, and executes steps faster. What the one-month time horizon means is that, AI would be able to autonomously build a SaaS MVP, migrate a 200k-line legacy codebase to microservices, or implement complex features end-to-end.

That said, "50% success rate" is doing the heavy lifting in this claim. There is a significant reliability gap: the 80% success-rate time horizon is 4-6x shorter than the 50% horizon. A model that can sometimes complete a month-long task can only reliably complete week-long ones. Models also struggle in messy environments that lack clear feedback loops (like automated tests) and require proactive information gathering. This means well-structured well-tested codebases will benefit far more than legacy systems with poor documentation. That is, greenfield work will see more gains than deep maintenance of complex existing systems.

Another caveat is that, experiments with pull requests showed that AI performance aligns more closely with low-context contractors (who take 5-18x longer than expert maintainers) than with experienced developers who know the codebase. This matters because the extrapolation is calibrated against general-purpose developers, not domain experts. When the paper says "one-month time horizon", it means one month of contractor-level work, not one month of work by someone who built the system. For tasks that require deep institutional knowledge (debugging subtle production issues, evolving tightly coupled legacy systems) the effective time horizon is much shorter.

I think the external validity checks were a strong part of the paper. The paper shows that the trend holds on SWE-bench (with an even faster 70-day doubling time). Again, messy tasks show lower absolute success but the same rate of improvement. The methodology is clean and reproducible. The main weakness is the task distribution. These 170 tasks are mostly software tasks with high inter-model task correlation (r=0.73), and they largely represent isolated coding challenges rather than the full breadth of real-world software engineering. True full-stack ownership (gathering specifications, architectural decision-making, deployment, and live operations) remains untested here. Furthermore, while AI excels at isolated prototyping, engineering end-to-end production-grade distributed systems requires a high level of fault tolerance/reliability that remains incredibly challenging to automate.


Discussion: What does a one-month AI time horizon world look like?

If the extrapolation holds and AI systems can automate month-long software tasks by ~2029, what are the implications?

This changes the unit economics of software fundamentally. Tasks that currently require a team of developers working for weeks (building integrations, standing up new services, migrating databases, writing and testing features) become something you can delegate to an AI agent for a fraction of the cost and time. The AI agent operates at the level of a competent junior-to-mid engineer working on a well-scoped project. Software goes from expensive and slow to cheap and fast.

But something tells me we will find ways to squander this. When hardware kept getting faster through Moore's law, we produced bloated software that felt just as slow. Wirth's law ate Moore's law. When cloud computing made infrastructure cheap, we built sprawling microservice architectures that ended up like a ball of yarn, tangled, and sprawling, and hard to unravel. Every time we reduce one cost, we expand scope until the new bottleneck hurts just as much as the old one. I expect AI-generated software to follow the same pattern: more software would mean more complexity, and new categories of problems we don't yet have names for. 50 shades of metastability?


What does the month-long task horizon mean for big tech, enterprise software, and startups? How the hell should I know? But let me speculate anyway, since you are still reading (or at least your LLMs are). 

Big tech companies employ tens of thousands of engineers specifically for the month-long projects this extrapolation targets: internal tools, service migrations, API implementations, data pipelines. AI will accelerate all of this. But these companies are structured around large engineering orgs. Promotion ladders, team hierarchies, planning processes all assume that software requires massive human headcount. If one engineer with AI agents does the work of five, the traditional organizational model breaks. Middle management's role become muddled. We may see flattening hierarchies and small high-leverage teams. Whether the behemoth organizations can handle such a rapid restructuring is unclear.

For infrastructure and database companies, the Jevons Paradox applies. Cheaper software means more of it: more applications, more databases, more demand for managed services. The total addressable market could grow substantially. But the flip side is that AI agents that can build month-long projects can also evaluate and switch infrastructure. Vendor lock-in weakens when migration is cheap, and customers become more price-sensitive. There's also the question of whether AI agents will develop strong default preferences shaped by their training data or their creator companies. This is a brand new market force, which we didn't have to contend with before.

I think the most dramatic impact is on software startups. Today, the primary cost of a startup is engineering talent. If AI handles month-long implementation tasks, that cost drops by an order of magnitude. The barrier to entry collapses. Many more products get built.  That may mean that differentiation shifts to domain expertise, data moats, and taste. I forecasted a similar trend earlier for academic papers.

The reliability gap tells us the world of 2029-2031 is probably not "AI replaces developers" but "developers who use AI effectively are 5-10x more productive". The scarce resource shifts from the ability to write code to the ability to specify what to build, evaluate whether it is correct, and manage the complexity of systems that grow much faster than before. What worries me more is whether organizations can restructure fast enough to keep up.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

The F word

The Agentic Self: Parallels Between AI and Self-Improvement

Learning about distributed systems: where to start?

Foundational distributed systems papers

Are We Becoming Architects or Butlers to LLMs?

Building a Database on S3

Advice to the young

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

End of Productivity Theater