GLM-5.2 Benchmarks: Is This the Best Open-Weights Agent Model of 2026?

GLM-5.2 Benchmarks: Is This the Best Open-Weights Agent Model of 2026?
On June 16, 2026, Z.ai released GLM-5.2 under an MIT license, and the GLM-5.2 benchmarks landed hard: an open-weights model scoring within roughly a point of frontier labs on the kind of multi-hour engineering work that actually defines agent capability. Simon Willison called it "probably the most powerful text-only open weights LLM" available. That matters because for the first time the question "should we build our agent on a closed frontier model or an open one?" has an answer that isn't obviously "closed." This post walks through what the benchmarks really show, how GLM-5.2 compares to Claude and GPT, why it's positioned as a long-horizon agent model — and the brutal hardware reality that comes with a 753-billion-parameter model.
What is GLM-5.2 and what actually changed?
GLM-5.2 is a 753B-parameter Mixture-of-Experts model with roughly 40B parameters active per token, released by Z.ai with open weights under the MIT license. The headline change from the previous generation is context: the window jumps to 1 million tokens, up from 200,000 in GLM-5.1, per Willison's writeup. The full BF16 weights weigh in at 1.51 TB.
The architecture is built around long context efficiency. Z.ai's lab blog describes a sparse-attention scheme ("IndexShare") that reuses the same indexer across every four sparse-attention layers, which it says reduces per-token FLOPs by 2.9× at 1M context. The point of all of this is sustained, tool-using work — not chat.
What do the GLM-5.2 benchmarks actually show?
Two independent vantage points are worth separating: aggregate index scores and task-level coding/agent benchmarks.
On the aggregate side, Willison reports that GLM-5.2 scores 51 on the Artificial Analysis Intelligence Index v4.1, leading other leading open-weights models — MiniMax-M3 (44), DeepSeek V4 Pro (44), and Kimi K2.6 (43). He also notes it ranks 2nd on the Code Arena WebDev leaderboard, behind only Claude Fable 5.
On the task side, Z.ai's lab blog publishes a battery of coding and agent benchmark scores, including:
- Terminal Bench 2.1: 81.0
- SWE-bench Pro: 62.1
- HLE: 40.5 (54.7 with tools)
- GPQA-Diamond: 91.2
- AIME 2026: 99.2
- MCP-Atlas (public set): 76.8
As always with vendor-published benchmarks, treat the lab's own numbers as a starting point and weight the independent index (Artificial Analysis) and arena results more heavily. The signal across all of them points the same direction: this is a serious coding-and-agent model, not just a strong chat model.
Is GLM-5.2 better than Claude?
Not quite — but the gap is now measured in single digits, which is the real story. On Z.ai's FrontierSWE long-horizon benchmark, the lab blog reports GLM-5.2 trails Opus 4.8 by about 1%, edges out GPT-5.5 by about 1%, and exceeds Opus 4.7 by roughly 11%. On the WebDev arena, Willison places it second behind Claude Fable 5.
Pricing tells the other half of the comparison. Via OpenRouter, Willison lists GLM-5.2 at $1.40 per million input tokens and $4.40 per million output, versus GPT-5.5 at $5/$30 and Claude Opus (4.5–4.8) at $5/$25. So even hosted through an API rather than self-run, GLM-5.2 is dramatically cheaper per token than the frontier models it's chasing on benchmarks.
One caveat that affects real cost: Willison notes GLM-5.2 uses more output tokens per task than other leading open-weights models — about 43k output tokens per Intelligence Index task. A lower per-token price doesn't fully translate to a lower per-task price if the model is more verbose, so model your costs on tasks, not tokens.
What makes GLM-5.2 a "long-horizon" agent model?
Z.ai frames "long-horizon tasks" as complex engineering projects that span hours to tens of hours — building compilers, optimizing kernels, developing production-grade services, large-scale code construction, and applied ML research, often executed as multi-step coding-agent trajectories. That framing is what distinguishes GLM-5.2 from a model tuned for single-turn answers: it's optimized for staying coherent across a long chain of tool calls and environment feedback.
The lab also reports compatibility with existing coding-agent harnesses including Claude Code and OpenCode, plus configurable "effort levels" to trade capability against latency and cost. For teams building agents, that harness compatibility lowers the switching cost considerably.
Can you run GLM-5.2 locally?
This is where the excitement meets reality. A hands-on review that reached the Hacker News front page lays out the memory math for the 1.51 TB model:
- 4-bit (Q4_K_M): ~476 GB — multi-GPU datacenter territory only.
- 2-bit dynamic quant: ~241 GB — fits a Mac Studio M3/M4 Ultra with 256 GB+ unified memory.
- 1-bit dynamic quant: ~176 GB — still needs a 256 GB machine.
Even when it fits, throughput is the catch: the review measured 3–9 tokens per second at a 2-bit quant on a Mac Studio M3 Ultra (a ~$9,500 machine). Its blunt conclusion: "Unless you own a 256GB+ Mac Studio, and can live with single-digit tokens per second at a 2-bit quant, this is a model you'll most sensibly rent or hit via API, not host at home." A 128 GB box or a 24 GB GPU is simply out — "the weights don't fit at any usable quant."
So "open weights" here means auditable and rentable, not runs on your laptop. For most teams the practical deployment path is an API or a rented GPU, with the open license mattering for portability, fine-tuning, and avoiding vendor lock-in rather than for local hosting.
How does GLM-5.2 stack up against other open-weights models?
On the Artificial Analysis Intelligence Index v4.1, Willison's numbers put GLM-5.2 (51) clearly ahead of the open-weights pack — MiniMax-M3, DeepSeek V4 Pro, and Kimi K2.6 all cluster in the 43–44 range. Combined with the FrontierSWE result sitting within ~1% of frontier closed models, the takeaway is that the open-weights frontier has compressed the gap to the closed frontier from a chasm to a margin. For more on how those closed frontier models compare to each other, see our Claude vs GPT comparison for 2026, and for where agent-specific benchmarks are heading, our AI agent leaderboard.
Takeaways for Clawvard readers
- The open-weights gap is now a margin, not a chasm. GLM-5.2 lands within ~1% of frontier models on Z.ai's long-horizon coding benchmark and tops the open-weights field on the independent Artificial Analysis index.
- Cheaper per token, but model per task. At $1.40/$4.40 via OpenRouter it's far cheaper than GPT-5.5 or Claude Opus, but its higher output-token usage (~43k per index task) means you should benchmark cost on your tasks.
- "Open" ≠ "local." At 1.51 TB, realistic deployment is API or rented GPU; true local hosting needs a 256 GB+ machine and you'll live with single-digit tokens/sec.
- Harness compatibility lowers the switch cost. Reported support for Claude Code and OpenCode means you can trial GLM-5.2 inside an agent stack you already run.
If you're evaluating which model to put behind a production agent, the most useful next step is to test long-horizon behavior on your own tooling rather than trusting any single leaderboard. Want the framework for that? Try Clawvard to benchmark agent models on your real workflows, and follow our model-evaluation updates as the open-weights frontier keeps moving.