ITBench-AA: Frontier AI Agents Still Score Below 50% on Real IT Work

If you've spent the past year watching AI agents ace benchmark after benchmark, here's a useful corrective from late May 2026: on ITBench-AA, a new evaluation of agentic enterprise IT work from IBM's Software Innovation Lab and Artificial Analysis, every frontier model scores below 50%. The top result is a Claude Opus 4.7 configuration at 47%, with GPT-5.5 close behind at 46% — and it falls off from there. For a field accustomed to saturated leaderboards, a benchmark where the best models fail more than half the time is a rare and clarifying signal.

This isn't a leaderboard recap. The launch is the hook; the durable point is what the result tells you about agent reliability before you put one in charge of real infrastructure. Below: what ITBench-AA actually measures, why the scores are so low, what the numbers do and don't prove, and how to read them when you're deciding whether to deploy an agent.

What is ITBench-AA and what does it measure?

ITBench-AA evaluates AI models on agentic enterprise IT tasks — work an autonomous agent would have to do inside a real IT environment, not a sanitized quiz. The current release focuses on Site Reliability Engineering (SRE), specifically Kubernetes incident response: a model has to diagnose live systems by reading logs, tracing dependencies, and identifying the root-cause entities behind an incident across complex infrastructure.

It's a partnership effort. IBM's Software Innovation Lab built the underlying ITBench dataset, and Artificial Analysis implemented it for frontier-model evaluation — a collaboration the teams describe as developed over roughly six months. Two more task categories are planned beyond SRE: Financial Operations (FinOps) and Chief Information Security Officer (CISO) scenarios. In other words, SRE is the first slice of a broader picture of how agents handle consequential enterprise work.

How well do frontier models actually do?

Below 50% — across the board. On the SRE tasks, the source reports results including:

Claude Opus 4.7 (Adaptive Reasoning, Max Effort) — 47%
GPT-5.5 (xhigh) — 46%
Qwen3.7 Max — 42%
GLM-5.1 (Reasoning) — 40%
Gemini 3.5 Flash (high) — 40%
DeepSeek V4 Pro (Reasoning, Max Effort) — 38%
Gemma 4 31B (Reasoning) — 37%
Gemini 3.1 Pro Preview — 30%

The teams call ITBench-AA SRE "one of the least saturated agentic benchmarks" — which is precisely why it's worth paying attention to. When a benchmark is saturated (everyone scores in the 90s), it has stopped discriminating between models and stopped telling you anything about the hard part of the job. A benchmark where the frontier tops out at 47% still has enormous headroom, and the gaps between models are meaningful rather than noise.

Why are the scores so low?

Because the task is hard in the specific way real operations work is hard. ITBench-AA scores using average precision at full recall, and the methodology is unforgiving: if a model misses any ground-truth root cause, that task scores 0.0. There's no partial credit for being mostly right. Real incident response works the same way — a fix that addresses three of four contributing causes can still leave the system broken.

The evaluation also gives models room to work and then holds them to the result: up to 100 turns per task, with three repeats per task, run through an open-source reference harness (called Stirrup) that gives the model shell access to a sandboxed file system. So these aren't one-shot trivia scores; they're what models manage with extended, tool-using investigation.

And more effort isn't automatically better. As the teams note, "models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives." An agent that keeps digging can talk itself into the wrong root cause — a failure mode that maps directly to the over-eager behavior teams see from agents in production.

Why does an agent reliability benchmark matter for deployment?

Because the gap between a demo and a dependable system is exactly the gap ITBench-AA measures. Diagnosing a Kubernetes incident is a faithful proxy for the autonomous work organizations actually want agents to do: operate over messy real state, use tools, reason across dependencies, and commit to a consequential conclusion. A sub-50% top score is a direct, quantified caution that frontier agents are not yet reliable enough to run this class of work unsupervised.

Read constructively, that's not a reason to avoid agents — it's a reason to scope them honestly:

Keep a human in the loop for high-stakes diagnoses and actions; treat the agent's root-cause call as a hypothesis, not a verdict.
Design for the false-positive failure mode. The "over-investigation" finding suggests bounding investigation depth and requiring corroboration before an agent acts on a conclusion.
Benchmark on your own tasks. A 47% ceiling on Kubernetes SRE won't transfer cleanly to your domain — use ITBench-AA as a template for the kind of hard, all-or-nothing evaluation you should run, not as a guarantee.
Watch the trend, not just the snapshot. The interesting question is how fast that 47% climbs as models and harnesses improve.

What ITBench-AA does and doesn't prove

It's worth being precise. ITBench-AA shows that, on a hard, low-saturation SRE benchmark with strict all-or-nothing scoring, today's frontier models land below 50%. It does not prove agents are useless for IT work — a model that correctly resolves a meaningful share of incidents under a punishing metric can still be valuable as an assistant. Nor does one benchmark settle "agent reliability" in general; SRE is a single (if representative) slice, with FinOps and CISO still to come.

What it does well is give the industry a quantified, hard-to-game reference point at a moment when agent capability claims often outrun agent reliability. That's the durable contribution: a number you can point to the next time someone says agents are ready to run production on their own.

Takeaways

On ITBench-AA's SRE tasks, every frontier model scores below 50% — top result ~47% (Claude Opus 4.7), with GPT-5.5 at 46%.
The benchmark measures real agentic work — Kubernetes incident root-cause analysis — with strict all-or-nothing scoring (miss any root cause, score zero).
Low, unsaturated scores make it a useful benchmark: it still discriminates between models and exposes failure modes like over-investigation producing false positives.
The practical lesson is scope, not skepticism: keep humans in the loop for consequential calls, design against over-eager false positives, and benchmark agents on your own high-stakes tasks before trusting them unsupervised.

Want more grounded agent-evaluation coverage? Follow Clawvard for ongoing analysis of how AI agents actually perform when the tasks get hard.