Can AI Agents Actually Do Enterprise IT Work? What ITBench-AA's Sub-50% Scores Reveal

The pitch for AI agents is that they can take on real, multi-step work — not just answer questions, but actually do the job. Enterprise IT is one of the most cited use cases. So a new benchmark built specifically to measure that is worth paying attention to, and its headline result is sobering: on ITBench-AA, the first benchmark for agentic enterprise IT tasks, frontier models score below 50%.

That number reframes a lot of agent marketing. It does not mean agents are useless — it means the gap between "demos well" and "reliably completes real enterprise IT tasks" is still wide. This piece explains what ITBench-AA measures, where agents fall down, why agentic IT is genuinely hard, and — most useful for buyers and builders — how to read agent benchmarks so you are not fooled by a flattering chart.

What is ITBench-AA?

ITBench-AA is, per its announcement, the first benchmark for agentic enterprise IT tasks, built by IBM Research together with Artificial Analysis. The framing is the important part: rather than testing isolated reasoning or coding puzzles, it targets the kind of multi-step IT work an agent would have to carry out in an enterprise setting. Its launch result — frontier models scoring below 50% — is the data point everyone is quoting, and it is worth taking literally: even the strongest models available fail more than half of these tasks. (Hugging Face, 2026-05-27)

A benchmark aimed at agentic work is different from a traditional model benchmark. It is not asking "does the model know the answer?" It is asking "can the agent string together the steps, use tools correctly, recover from errors, and actually finish?" That shift in what is being measured is exactly why the scores look so different from the near-saturated numbers we are used to seeing on knowledge or reasoning tests.

The results: frontier models under 50%

A sub-50% score on a task suite designed to look like real work is the headline, and it deserves to be read plainly: the best agents still fail the majority of these enterprise IT tasks.

Where they fail

The benchmark's value is in what the failures expose. Agentic tasks are long-horizon — they require planning over many steps, using tools, observing results, and adapting. A model that produces a perfect single answer can still fail an agentic task by losing the thread halfway through, calling a tool incorrectly, or failing to recover when a step goes wrong. The sub-50% result is the aggregate symptom; the underlying causes are about execution over time, not about whether the model "knows" IT.

How it compares to coding and reasoning benchmarks

Many coding and reasoning benchmarks now show frontier models scoring very high — which is part of why agent hype is so loud. ITBench-AA is a useful corrective: when you change the question from "can it answer?" to "can it complete a realistic multi-step job?", the scores drop sharply. The discrepancy is the lesson. High marks on a reasoning leaderboard do not transfer automatically to reliable agentic execution, and a benchmark built around real tasks makes that gap visible.

Why agentic IT is hard

If these are frontier models, why do they struggle? Three structural reasons.

Long-horizon planning

Enterprise IT work is rarely one step. It is investigate, decide, act, verify, and repeat — often across many actions where each depends on the last. Errors compound: a small mistake early can derail everything downstream. Models optimized to produce a good answer are not automatically good at maintaining a coherent plan across a long sequence of dependent actions.

Tool reliability and error recovery

An agent is only as good as its ability to use tools correctly and to notice when something has gone wrong. Real tasks are full of partial failures — a command errors, a result is unexpected, a state is not what the agent assumed. Strong agentic performance requires recognizing the failure and recovering, not blindly proceeding. This is one of the hardest things to get right and a major source of the sub-50% gap.

Grounding and false-belief failures

There is also a reliability problem underneath. Research this week found that LLMs can continue to believe false statements even after explicit warnings that they are false. (Ars Technica, 2026-05-28) For an agent operating over many steps, that is acute: if it locks onto a wrong assumption about the system state, a correction may not dislodge it, and every subsequent action builds on the false premise. Grounding — staying anchored to what is actually true — is exactly what long, autonomous task chains stress the most.

How to read agent benchmarks

The practical takeaway is not "agents are bad." It is "measure the thing you actually care about." A few principles for reading agent evaluations without being fooled.

Task realism vs. toy tasks

Ask what the benchmark actually tests. A score on toy or single-step tasks tells you little about whether an agent can do your multi-step, messy, real work. ITBench-AA's premise — realistic enterprise IT tasks — is what makes its sub-50% result meaningful; a high score on something easier would not be comparable. When you see an impressive number, the first question is always: how close is the task to the work I need done?

What "agentic" should mean

"Agentic" should mean the model plans, uses tools, observes outcomes, and recovers from errors over a horizon — not that it produced one good response. Borrowing precise vocabulary helps here: terms like harness and scaffold distinguish the model from the surrounding system that lets it act, and getting those terms right keeps capability claims honest. (See Hugging Face's agent glossary for the terminology.) When a vendor says "agentic," check whether they mean genuine multi-step execution or a single clever answer dressed up.

FAQ

What's the best AI agent benchmark?

There is no single "best" — the right benchmark is the one whose tasks resemble the work you need done. ITBench-AA is notable as the first benchmark built specifically for agentic enterprise IT tasks, which makes it a strong reference point if that is your use case. For other domains, look for evaluations that test realistic, multi-step execution rather than isolated answers, and treat any leaderboard as evidence about a specific kind of task, not general capability.

Are AI agents ready for enterprise IT?

Based on ITBench-AA's launch results, not for unsupervised end-to-end work: frontier models score below 50% on its enterprise IT tasks, meaning even the strongest agents fail most of them. That argues for scoping agents to well-defined sub-tasks with human oversight rather than handing them open-ended IT operations, and for measuring readiness against realistic tasks before you deploy.

Why do agents score low on real tasks?

Because real tasks are long-horizon: they require planning across many dependent steps, using tools correctly, and recovering from errors — and mistakes compound. Reliability problems make it worse; research shows LLMs can hold onto false beliefs even after being told they are false, so an agent that adopts a wrong assumption early can build a whole chain of actions on top of it. Producing a good single answer and reliably completing a multi-step job are different skills.

Takeaways for Clawvard readers

A sub-50% score from frontier models on realistic enterprise IT tasks is the gap between "agent demos" and "agent reliability." Plan around it.
Judge agent benchmarks by task realism and by whether "agentic" means genuine multi-step execution — not a single strong answer.
The failure modes are about execution over time — planning, tool use, error recovery, and grounding — which is where to focus your own evaluation and oversight.

These execution-over-time weaknesses are also what make agents exploitable: an agent that mishandles untrusted input or loses track of state is the same agent an attacker can hijack. See our companion piece, How to Secure AI Coding Agents, for the defensive side.

Evaluating agents for real work? Clawvard's model-evaluation tooling is built to test agents against realistic, multi-step tasks — try Clawvard and follow our updates for more agent benchmarking analysis.