Model Evaluation

Can AI Agents Actually Do Enterprise IT Work? What ITBench-AA's Sub-50% Scores Reveal

Every frontier model scored below 50% on ITBench-AA, a new IBM × Artificial Analysis benchmark for agentic enterprise IT work. Here's what it measures, why scores are so low, and what it means for deploying agents.

05/31/2026 · Model Evaluation · 8 min read

Can AI Agents Actually Do Enterprise Work? What the Benchmarks Show

On ITBench-AA, the first benchmark for agentic enterprise IT tasks, frontier models score below 50%. Here's what that number means before you deploy agents.

05/31/2026 · Model Evaluation · 8 min read

ITBench-AA: Frontier AI Agents Still Score Below 50% on Real IT Work

A new IBM Research and Artificial Analysis benchmark, ITBench-AA, has every frontier model scoring under 50% on agentic enterprise IT tasks. Here's what it measures and what the result means before you deploy an AI agent.

05/30/2026 · Model Evaluation · 7 min read

Can AI Agents Actually Do Enterprise IT Work? What ITBench-AA's Sub-50% Scores Reveal

A new IBM × Artificial Analysis benchmark puts frontier models below 50% on real agentic enterprise IT tasks. Here is what it measures, why the gap exists, and how to read agent benchmarks without being fooled.

05/29/2026 · Model Evaluation · 8 min read

Can AI Agents Actually Do Enterprise IT? What ITBench Reveals About Agent Reliability

The first benchmark for agentic enterprise IT work, ITBench-AA, found every frontier model scoring below 50%. Here's a durable framework for judging AI agent reliability before you trust one with production operations.

05/29/2026 · Model Evaluation · 8 min read

Claude Opus 4.8 vs 4.7: What Actually Changed for Practitioners

Anthropic calls Opus 4.8 "a modest but tangible improvement" over 4.7 — but the real story is a behavior change: the model is more honest about its own mistakes and uncertainty. Here's what that means for your upgrade decision.

05/29/2026 · Model Evaluation · 7 min read

Claude Opus 4.8: What's New, the Dynamic Workflow Tool, and How It Compares to 4.7

Anthropic shipped Claude Opus 4.8 with a new "dynamic workflow" orchestration tool and a notable honesty-and-effort behavior change. Here's a practitioner's breakdown of what actually matters for agent builders.

05/29/2026 · Model Evaluation · 7 min read

LLM API Pricing in 2026: Inside the Frontier Model Price War

DeepSeek made a 75% discount permanent, Opus 4.8 held prices flat, and GPT-5.5 surfaced — all in one week. A durable cost-vs-value framework for choosing a frontier LLM API in 2026 without overpaying.

05/28/2026 · Model Evaluation · 8 min read

Why Frontier AI Agents Still Fail Enterprise IT — Lessons From ITBench-AA

ITBench-AA is the first public benchmark to grade AI agents on real enterprise IT tasks — and every frontier model scores under 50%. Here's what the result actually says, the four failure modes it exposes, and how to rebuild your eval harness around it.

05/28/2026 · Model Evaluation · 9 min read

Gemini 3.5 Flash for Agents: Has the Latency Finally Crossed the Line?

Google pitches Gemini 3.5 Flash as 'agent-optimized' and Ars Technica says it might finally be fast enough for gen AI to make sense. Here's how to decide whether it belongs in your agent loop today — and where it almost certainly doesn't.

05/27/2026 · Model Evaluation · 7 min read

Claude Opus vs GPT-5.4: An 8-Dimension Deep Comparison

Based on Clawvard's evaluation of 693 GPT-5.4 and 200+ Claude Opus Agent exams, we compare the two top models across all 8 capability dimensions.

04/13/2026 · Model Evaluation · 8 min read

2026 AI Agent Capability Leaderboard: 18 Models Ranked

The definitive ranking of AI models by Agent capability, based on 20,070 valid evaluations across 8 dimensions. Updated April 2026.

04/12/2026 · Model Evaluation · 6 min read