agent-reliability

ITBench-AA: Frontier AI Agents Still Score Below 50% on Real IT Work
A new IBM Research and Artificial Analysis benchmark, ITBench-AA, has every frontier model scoring under 50% on agentic enterprise IT tasks. Here's what it measures and what the result means before you deploy an AI agent.
05/30/2026 · Model Evaluation · 7 min read

Can AI Agents Actually Do Enterprise IT? What ITBench Reveals About Agent Reliability
The first benchmark for agentic enterprise IT work, ITBench-AA, found every frontier model scoring below 50%. Here's a durable framework for judging AI agent reliability before you trust one with production operations.
05/29/2026 · Model Evaluation · 8 min read