Model Evaluation

Can AI Agents Actually Do Enterprise Work? What the Benchmarks Show

May 31, 2026·8 min read
Can AI Agents Actually Do Enterprise Work? What the Benchmarks Show

Can AI Agents Actually Do Enterprise Work? What the Benchmarks Show

The demos are dazzling, but the agentic ai benchmark that matters most this spring delivers a sobering headline: on the first benchmark built specifically for agentic enterprise IT tasks, frontier models score below 50%. That number, from ITBench-AA, reframes the deployment question for every engineering leader watching agents — not "if agents can replace real work," but "which narrow tasks can they handle, and with what guardrails."

This piece walks through what the benchmark measures, why agents stall on real enterprise work, and how to translate a sub-50% score into a deployment decision you can defend.

How good are AI agents at real enterprise work, really?

If you've only seen agents in curated demos, the gap between demo and duty is the whole story. A benchmark exists precisely to measure what a sales reel can't.

The headline number: frontier models below 50%

ITBench-AA, a new benchmark for agentic enterprise IT tasks, reports that even the best frontier models score below 50%. Read that as a reality-check, not a verdict: it doesn't say agents are useless, it says that on realistic, multi-step enterprise IT work, today's top models fail more than half the time. For a ranked view of how individual models stack up across capabilities, our 2026 AI Agent Capability Leaderboard puts that headline number in context.

What ITBench-AA measures

A benchmark is only as useful as the tasks it scores, so it's worth understanding what "below 50%" is being measured against.

Why enterprise IT tasks are a harder bar than chat benchmarks

Chat benchmarks reward a good single answer. Enterprise IT tasks demand something else: multi-step execution, correct tool use, and follow-through across a sequence where an early mistake compounds. That's a fundamentally harder bar — and it's the bar that matters if you're putting an agent anywhere near production operations.

Who built it and why the source matters

ITBench-AA comes from IBM and Artificial Analysis — an enterprise software vendor and an independent model-analysis group. The provenance matters: the benchmark is grounded in the kind of real IT work enterprises actually run, rather than synthetic puzzles, which is what makes the sub-50% result worth taking seriously.

Why agents stall on real work

A low score on multi-step work isn't a mystery once you look at where reliability decays.

Multi-step tasks, tool reliability, and where accuracy decays

Agents can usually reason about what to do. Reliably doing it — across many steps, with real tools, without an early error cascading — is the hard part. We've called this the gap between thinking and doing in Why AI Agents Can Think But Can't Do, and ITBench-AA is the measured version of that gap: each additional step is another chance to fail, and enterprise tasks have many steps.

What a sub-50% score does and doesn't mean for deployment

It does not mean agents have no enterprise value. It does mean unsupervised, end-to-end autonomy on complex IT tasks is not a safe default today. The practical reading is to deploy agents where partial reliability is acceptable — drafting, triage, assisted execution with review — rather than where a single uncaught failure is costly.

Rethinking how work is organized around agents

If agents are reliable for some tasks and not others, the question stops being purely technical and becomes organizational.

The org-design implications of partial-reliability agents

MIT Technology Review argues that the agentic era requires rethinking organizational design. The benchmark and the org-design argument fit together: if agents clear the bar on some tasks and miss on others, you redesign workflows around that reality — assigning agents the work they can do reliably and keeping humans on the steps where the failure rate is still too high.

Choosing tasks where today's agents clear the bar

Match the task to the measured capability. Favor narrow, bounded, reversible tasks where a wrong answer is cheap to catch and fix, and keep humans in the loop on the high-stakes, many-step work where sub-50% reliability is unacceptable.

How to evaluate agents for your own enterprise

A public benchmark is a starting point, not a substitute for measuring agents on your work. Our complete guide to AI agent evaluation goes deeper, but here's the short version.

Map the benchmark categories to your real workflows

Treat ITBench-AA as a template. Identify the categories of work you'd actually hand an agent, then build a small, representative test set from your own tasks rather than trusting a generic score to predict your results.

Set a pass bar and a human-in-the-loop threshold

Decide, per task type, what success rate is good enough to run with — and where that threshold falls below your bar, gate the action behind human review. The goal is a deliberate line between "agent runs this" and "human signs off," informed by measured reliability rather than optimism.

Frequently asked questions

What is an agentic AI benchmark?

An agentic AI benchmark measures how well AI agents complete multi-step, tool-using tasks end to end — not just whether a model gives a good single answer. It's designed to test execution and follow-through, which is what matters for real-world deployment.

What is ITBench-AA?

ITBench-AA is a benchmark, built by IBM and Artificial Analysis, for agentic enterprise IT tasks. It reports that even frontier models score below 50%, making it a notable reality-check on how capable agents are at real operational work.

Why do frontier models score below 50% on enterprise tasks?

Because enterprise IT tasks are multi-step and tool-dependent, and accuracy decays across a long sequence — an early error compounds. Models that look strong on single-answer chat benchmarks struggle when success requires many correct actions in a row.

Can AI agents replace IT staff today?

Not for complex, end-to-end IT work — a sub-50% score on realistic tasks means unsupervised autonomy isn't a safe default. Agents add value on narrower, bounded, reviewable tasks, with humans kept in the loop on high-stakes steps.

How should I benchmark agents for my own use case?

Build a small, representative test set from your real workflows, set a per-task pass bar, and define a human-in-the-loop threshold for actions where the measured success rate falls short. Treat public benchmarks as a template, not a substitute for measuring on your own work.

Takeaways for builders

The most useful number on agentic AI this spring is below 50% — the score frontier models earn on ITBench-AA's enterprise IT tasks. It reframes deployment from "can agents do the work" to "which tasks, with what guardrails." Match agents to bounded, reversible tasks, keep humans on the high-stakes steps, redesign workflows around partial reliability, and measure on your own work before you trust a generic score. That's how you turn a sobering benchmark into a deployment plan that holds up.

Related Articles