Model Evaluation

Can AI Agents Actually Do Enterprise IT Work? What ITBench-AA's Sub-50% Scores Reveal

May 31, 2026·8 min read
Can AI Agents Actually Do Enterprise IT Work? What ITBench-AA's Sub-50% Scores Reveal

Can AI Agents Actually Do Enterprise IT Work? What ITBench-AA's Sub-50% Scores Reveal

Every frontier model tested on real enterprise IT operations work scored below 50%. That's the headline from ITBench-AA, a benchmark launched on May 27, 2026 by Artificial Analysis and IBM to measure how well today's best AI agents handle the kind of incident-response work that keeps production systems alive. The current leader, Claude Opus 4.7, managed 47%. GPT-5.5 reached 46%. After that, the numbers fall off.

If you are deciding whether to put agents into your IT operations, this is the most useful kind of result: a credible, hard number that cuts through demo-day optimism. Here's what ITBench-AA actually measures, why the scores are so low, and what the data should — and shouldn't — change about your plans.

What is ITBench-AA?

ITBench-AA is, by its authors' description, the first benchmark for evaluating frontier models on agentic enterprise IT tasks. It is a collaboration between Artificial Analysis (the independent model-evaluation group) and IBM's Software Innovation Lab, built on top of IBM's existing ITBench dataset and its deep bench of enterprise IT operations expertise.

Crucially, it doesn't test trivia or single-turn Q&A. It tests whether an agent can work through a realistic operations problem end to end. The benchmark currently covers Site Reliability Engineering (SRE), with Financial Operations (FinOps) and a Chief Information Security Officer (CISO) track planned — meaning this is a growing yardstick, not a one-off snapshot.

How does the benchmark actually work?

The SRE track contains 59 tasks (40 public, 19 held out to resist gaming). Each task hands the model a snapshot of a live Kubernetes incident: alerts, events, traces, metrics, logs, and application topology, plus shell access to a sandboxed filesystem. The model has to do what an on-call engineer does — investigate and pinpoint the minimal set of root-cause Kubernetes entities behind the incident.

The scoring is deliberately unforgiving. ITBench-AA uses average precision at full recall: if the model misses any ground-truth root cause, the task scores 0.0. Only when it has found all of them does it earn credit, scored as precision — true positives over true positives plus false positives. In plain terms: you must find everything that's broken, and you're penalized for crying wolf about things that aren't.

To keep comparisons fair, every model runs through the same open-source Stirrup reference harness, with a 100-turn cap per task and three repeats per task to smooth out variance.

Which models lead — and how close are they?

Among frontier models:

Model Score
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) 47%
GPT-5.5 (xhigh) 46%
Qwen 3.7 Max 42%

Among open-weights models, GLM-5.1 (Reasoning) and Gemini 3.5 Flash (high) lead at 40%, ahead of DeepSeek V4 Pro (38%) and Gemma 4 31B (37%); Gemini 3.1 Pro Preview trails at 30%.

The striking part isn't the ranking — it's the ceiling. The gap between first and second place is a single point, and the entire field is clustered below the halfway mark. No model is close to reliably resolving real incidents on its own.

Why do the best models still fail more than half the time?

Two findings explain the wall.

First, more effort isn't more accuracy. ITBench-AA found that trajectory length varies almost threefold without tracking performance. GPT-5.5 at "xhigh" averaged 31 turns per task at 46%; Gemini 3.1 Pro Preview averaged 83 turns at just 30%. Models that over-investigate tend to surface false positives — flagging upstream fault-injection mechanisms or co-occurring symptoms — and the recall-gated precision scoring punishes exactly that. Knowing when to stop looking is part of the skill, and current models are bad at it.

Second, the task is genuinely hard. Root-cause analysis in a distributed Kubernetes system requires holding a causal model of the whole stack, distinguishing symptom from cause, and committing to a complete-but-minimal answer. That is a different competency from writing code or summarizing a document, and it is precisely where agents — as we've argued before — hit an execution bottleneck rather than an intelligence one.

Does spending more money buy better results?

Not reliably — and this is one of the most actionable parts of the data. ITBench-AA paired each score with a cost-per-task:

  • Gemma 4 31B (Reasoning): $0.14/task at 37% — outperforming Gemini 3.1 Pro Preview, which cost $2.23/task for 30%.
  • GLM-5.1 (Reasoning): $1.23/task at 40%.
  • Gemini 3.5 Flash (high): $1.70/task at 40%.
  • Claude Opus 4.7: $5.38/task at 47% — the top score, but also the most expensive by a wide margin.

The lesson for buyers: the priciest model buys you a few accuracy points, while a cheap open-weights model can match mid-tier proprietary options at a fraction of the cost. For a workload you'd run thousands of times, that spread compounds fast.

What does this mean for deploying agents in enterprise IT?

Read pessimistically, "every model under 50%" sounds like agents aren't ready. Read carefully, it's more nuanced — and more useful.

  • Don't hand agents autonomous incident resolution yet. A tool that misses root causes more than half the time cannot be trusted to act unsupervised on production systems.
  • Do use agents as assistive copilots. Even a 47% root-cause hit rate can accelerate a human engineer who reviews and corrects — triaging signals, drafting hypotheses, and gathering context faster than a person alone.
  • Benchmark on your own cost curve. The cheapest competent model may be the right default; reserve the expensive frontier model for the hardest incidents, not every page.
  • Watch the trajectory. Because effort doesn't equal accuracy, an agent that investigates forever is burning tokens and adding false positives. Cap turns and measure precision, not activity.

Is ITBench-AA the final word?

No benchmark is, and this one is honest about its scope. It currently covers SRE on Kubernetes incidents — a meaningful slice of enterprise IT, but not all of it; FinOps and CISO tracks are still to come. Scores will also move as labs optimize against it, which is part of why 19 tasks are held out. Treat ITBench-AA as a rigorous, evolving yardstick for one hard domain, not a verdict on every agentic use case.

Practical takeaways for Clawvard readers

The most important thing ITBench-AA does is replace vibes with a number. Frontier agents can reason impressively and still fail real operations work most of the time — and paying more doesn't reliably fix it. If you're evaluating agents for enterprise IT, anchor your decision to task-grounded benchmarks like this one, keep a human in the loop for anything irreversible, and pick your model on the accuracy-per-dollar curve rather than the leaderboard alone.

For more on where agent capability actually breaks down, read our analysis of why the bottleneck is execution, not intelligence, and our head-to-head comparison of leading open-source agent frameworks. And if you want to evaluate agents against your own tasks before trusting them in production, that's exactly what Clawvard is built to do.

Related Articles