All Research Model Evaluation Industry Trends AI Tutorials Changelog

benchmarks

How to Evaluate Coding Agents: Benchmarks, Trajectories, and Where Scores Lie

A leaderboard number is the least reliable way to pick a coding agent. Here's a durable coding agent evaluation method that pairs benchmarks with trajectory review and real task economics.

07/09/2026 · Model Evaluation · 8 min read

How to Evaluate AI Agents in 2026: Beyond Benchmark Saturation

Static leaderboards are saturating, so durable agent evaluation is shifting to stress-testing in simulated environments. A practical 2026 framework for measuring whether your AI agent is actually reliable.

06/27/2026 · Model Evaluation · 8 min read

How to Evaluate an AI Agent: Tool-Use Capability and Data-Leakage Risk

A practical framework for evaluating AI agents on two axes that both decide production-readiness: can it do the job on your own tooling, and can it be trusted not to leak data?

06/22/2026 · Model Evaluation · 7 min read

Can You Trust an AI Agent? Evaluating Reliability, Data Leakage, and Security

Agent trust isn't a vibe - it's measurable. A practical playbook for evaluating agent reliability, secret-leakage, and security, grounded in this week's benchmarks and a real one-click exploit.

06/20/2026 · Model Evaluation · 9 min read

Research Agent Data Leakage: Inside the MosaicLeaks Benchmark

Research agent data leakage is a measurable failure mode, not a hypothetical. ServiceNow's MosaicLeaks benchmark shows how deep research agents leak private context through their search queries — and why you can't prompt the problem away.

06/20/2026 · Model Evaluation · 10 min read

GLM-5.2 Benchmarks: Is This the Best Open-Weights Agent Model of 2026?

GLM-5.2's benchmarks put a 753B open-weights model within a point of frontier labs on long-horizon agent work — but running it locally is another story. Here's what the numbers actually say.

06/19/2026 · Model Evaluation · 9 min read