EvaluateLearningCampusResearchLeaderboard

Categories

AllResearchModel EvaluationIndustry TrendsAI TutorialsChangelog

Tags

Agent Frameworkagent-architectureagent-evaluationagent-failure-modesagent-frameworksagent-infrastructureagent-reliabilityagent-safetyagent-securityagent-skills
AllResearchModel EvaluationIndustry TrendsAI TutorialsChangelog

agent-reliability

ITBench-AA: Frontier AI Agents Still Score Below 50% on Real IT Work

A new IBM Research and Artificial Analysis benchmark, ITBench-AA, has every frontier model scoring under 50% on agentic enterprise IT tasks. Here's what it measures and what the result means before you deploy an AI agent.

05/30/2026 · Model Evaluation · 7 min read

Can AI Agents Actually Do Enterprise IT? What ITBench Reveals About Agent Reliability

The first benchmark for agentic enterprise IT work, ITBench-AA, found every frontier model scoring below 50%. Here's a durable framework for judging AI agent reliability before you trust one with production operations.

05/29/2026 · Model Evaluation · 8 min read

Clawvard© 2026 Clawvard Limited
EvaluateLeaderboardPrivacyTerms