Model Evaluation

Can AI Agents Actually Do Enterprise IT Work? What ITBench-AA's Sub-50% Scores Reveal
Every frontier model scored below 50% on ITBench-AA, a new IBM × Artificial Analysis benchmark for agentic enterprise IT work. Here's what it measures, why scores are so low, and what it means for deploying agents.
05/31/2026 · Model Evaluation · 8 min read

Can AI Agents Actually Do Enterprise Work? What the Benchmarks Show
On ITBench-AA, the first benchmark for agentic enterprise IT tasks, frontier models score below 50%. Here's what that number means before you deploy agents.
05/31/2026 · Model Evaluation · 8 min read

ITBench-AA: Frontier AI Agents Still Score Below 50% on Real IT Work
A new IBM Research and Artificial Analysis benchmark, ITBench-AA, has every frontier model scoring under 50% on agentic enterprise IT tasks. Here's what it measures and what the result means before you deploy an AI agent.
05/30/2026 · Model Evaluation · 7 min read

Can AI Agents Actually Do Enterprise IT Work? What ITBench-AA's Sub-50% Scores Reveal
A new IBM × Artificial Analysis benchmark puts frontier models below 50% on real agentic enterprise IT tasks. Here is what it measures, why the gap exists, and how to read agent benchmarks without being fooled.
05/29/2026 · Model Evaluation · 8 min read

Can AI Agents Actually Do Enterprise IT? What ITBench Reveals About Agent Reliability
The first benchmark for agentic enterprise IT work, ITBench-AA, found every frontier model scoring below 50%. Here's a durable framework for judging AI agent reliability before you trust one with production operations.
05/29/2026 · Model Evaluation · 8 min read

Claude Opus 4.8 vs 4.7: What Actually Changed for Practitioners
Anthropic calls Opus 4.8 "a modest but tangible improvement" over 4.7 — but the real story is a behavior change: the model is more honest about its own mistakes and uncertainty. Here's what that means for your upgrade decision.
05/29/2026 · Model Evaluation · 7 min read

Claude Opus 4.8: What's New, the Dynamic Workflow Tool, and How It Compares to 4.7
Anthropic shipped Claude Opus 4.8 with a new "dynamic workflow" orchestration tool and a notable honesty-and-effort behavior change. Here's a practitioner's breakdown of what actually matters for agent builders.
05/29/2026 · Model Evaluation · 7 min read

LLM API Pricing in 2026: Inside the Frontier Model Price War
DeepSeek made a 75% discount permanent, Opus 4.8 held prices flat, and GPT-5.5 surfaced — all in one week. A durable cost-vs-value framework for choosing a frontier LLM API in 2026 without overpaying.
05/28/2026 · Model Evaluation · 8 min read

Why Frontier AI Agents Still Fail Enterprise IT — Lessons From ITBench-AA
ITBench-AA is the first public benchmark to grade AI agents on real enterprise IT tasks — and every frontier model scores under 50%. Here's what the result actually says, the four failure modes it exposes, and how to rebuild your eval harness around it.
05/28/2026 · Model Evaluation · 9 min read

Gemini 3.5 Flash for Agents: Has the Latency Finally Crossed the Line?
Google pitches Gemini 3.5 Flash as 'agent-optimized' and Ars Technica says it might finally be fast enough for gen AI to make sense. Here's how to decide whether it belongs in your agent loop today — and where it almost certainly doesn't.
05/27/2026 · Model Evaluation · 7 min read

Claude Opus vs GPT-5.4: An 8-Dimension Deep Comparison
Based on Clawvard's evaluation of 693 GPT-5.4 and 200+ Claude Opus Agent exams, we compare the two top models across all 8 capability dimensions.
04/13/2026 · Model Evaluation · 8 min read

2026 AI Agent Capability Leaderboard: 18 Models Ranked
The definitive ranking of AI models by Agent capability, based on 20,070 valid evaluations across 8 dimensions. Updated April 2026.
04/12/2026 · Model Evaluation · 6 min read