benchmarks

How to Evaluate Agent Skills: Frameworks, Benchmarks, and What Actually Matters
New frameworks and benchmarks finally make agent skill quality measurable. Here's a practical playbook for scoring and evolving your own skills — and why how you organize them changes runtime behavior.
06/12/2026 · Model Evaluation · 7 min read

Computer Use Agent Benchmarks, Explained: What They Measure and How to Read One
A computer use agent benchmark tells you whether an OS-driving agent actually works — but only if you know what it measures. Here's how to read task success, trajectory quality, and cost before you trust the headline number.
06/09/2026 · Model Evaluation · 11 min read

Computer-Use Agents in 2026: How Good They Are and How to Run One Locally
Computer-use agents have moved past demos — Holo3.1 ships local checkpoints and the new MacArena benchmark exposes where they still break. Here's how good computer-use agents really are in 2026 and how to run one locally.
06/08/2026 · Model Evaluation · 8 min read