Model Evaluation

GLM-5.2: The Open-Weights Model Built for Long-Horizon Agents

June 20, 2026·8 min read
GLM-5.2: The Open-Weights Model Built for Long-Horizon Agents

GLM-5.2: The Open-Weights Model Built for Long-Horizon Agents

In mid-June 2026, Chinese AI lab Z.ai released GLM-5.2 and — a few days after the initial rollout to its coding-plan subscribers — published the weights under an MIT license. That combination matters: an openly licensed model you can download, inspect, and self-host, pitched explicitly at the hardest thing agents do, which is staying coherent across long, multi-step tasks. If you build or evaluate AI agents and you've been waiting for an open-weights option that can credibly run real agent loops instead of just answering one-shot prompts, GLM-5.2 is the release to look at.

This post breaks down what changed in GLM-5.2, how it benchmarks against other open-weights models, and whether it actually holds up for long-horizon agent workloads.

What's new in GLM-5.2?

GLM-5.2 is a large mixture-of-experts model. Independent reviewer Simon Willison, who tested it hands-on the week it shipped, reports a total of 753B parameters with roughly 40B active per token — the standard MoE pattern where only a fraction of the network fires for any given token. The headline architectural change is context length: GLM-5.2 supports a 1 million token context window, up from 200,000 in GLM-5.1. For agent work that has to carry a long task history, tool outputs, and a growing scratchpad, that jump is the difference between truncating state and keeping it.

Z.ai's own release frames the model around sustained engineering work — large-scale implementation, automated research, performance optimization, and complex debugging — rather than chat. The release also describes agent-oriented infrastructure shipped alongside the model, including a framework for agentic reinforcement learning and an "anti-hack" module that uses a rule-based filter plus an LLM judge to discourage reward hacking in coding agents. The practical signal is clear: this model was trained with agent loops in mind, not retrofitted to them.

The long-horizon-task pitch, explained

"Long-horizon" is the part that's easy to gloss over. A model can be brilliant at a single coding question and still fall apart over a task that spans dozens of steps, where each decision depends on the last and small errors compound. Z.ai positions GLM-5.2 for exactly that regime, and reports gains on long-horizon coding evaluations — for example, its SWE-bench Pro score rises to 62.1 from GLM-5.1's 58.4. It also reports an "effort level" control that lets you trade capability against speed and cost, which is a useful knob when you're deciding how much compute to spend on a given agent step.

How it stacks up against other open-weights models

On the Artificial Analysis Intelligence Index (v4.1), GLM-5.2 scores 51, ahead of other recent open-weights models including MiniMax-M3 (44), DeepSeek V4 Pro (44), and Kimi K2.6 (43), according to Willison's write-up. On the Code Arena WebDev leaderboard it ranks second, behind only a closed frontier model. Z.ai's release also reports competitive numbers on agentic and tool-use benchmarks such as MCP-Atlas and Tool-Decathlon, plus strong reasoning scores.

One caveat worth internalizing before you budget for it: GLM-5.2 is verbose. Willison notes it uses roughly 43k output tokens per Intelligence Index task, versus 24k–37k for competing open-weights models. More thinking tokens can help quality, but in an agent loop — where every step generates output you pay for and feed back in — that verbosity is a real cost and latency factor.

Is GLM-5.2 really the most powerful open-weights LLM?

The strongest single endorsement comes from Willison, whose hands-on assessment is that GLM-5.2 is "probably the most powerful text-only open weights LLM." Note the careful wording: probably, and text-only. This is one experienced reviewer's judgment the week of release, corroborated by the third-party Intelligence Index ranking above — not a settled, peer-reviewed verdict.

What Simon Willison's hands-on testing found

Willison's testing lines up with Z.ai's positioning on the headline points: top-tier open-weights performance, a genuinely large context window, and benchmark results that put it at or near the front of the open pack. He also surfaces the practical texture the marketing won't — the high token consumption per task, and that the model is available through hosted providers such as OpenRouter at roughly $1.40 per million input tokens and $4.40 per million output tokens, which gives a useful price anchor even though the weights are free to self-host.

That dual availability — download the weights or hit a hosted endpoint — is exactly what you want when evaluating: prototype against an API, then move to self-hosting if the economics or data-control requirements demand it.

Can GLM-5.2 actually run agents?

Benchmarks tell you a model can do something; they don't tell you it will hold together over a 50-step task on your stack. The honest answer is that GLM-5.2's design and reported scores make it a credible candidate for agent workloads — and that you should verify it on your own tasks before trusting it in production.

Long-horizon evaluation — framing it with CEO-Bench

To see why long-horizon evaluation is so hard, it helps to look at how researchers are now measuring it. CEO-Bench, a recent long-horizon agent benchmark, has agents run a simulated startup for 500 days through a programmable interface — setting pricing, managing budgets, and coordinating decisions across interconnected data over time. The finding is sobering: most state-of-the-art models struggle, and in the paper's reporting only the strongest closed frontier models even managed to preserve their starting balance, none consistently turning a profit. That's the bar long-horizon agents are being held to, and it's a reminder that headline coding scores don't automatically translate into sustained, adaptive performance. GLM-5.2's long-context design helps with the "remember the whole task" half of the problem; it does not, by itself, guarantee the "make good sequential decisions" half.

Where it still falls short

Three honest caveats. First, verbosity: 43k output tokens per task is a tax on every agent step in both cost and latency. Second, hardware: a 753B-parameter model is not something you casually self-host — Z.ai's materials discuss GPU memory as the primary bottleneck for 1M-context inference, so running the full context window locally is a serious infrastructure commitment, even with MoE sparsity reducing active compute. Third, benchmark-to-production gap: the reported scores are encouraging, but long-horizon reliability is precisely the dimension where models flatter themselves on benchmarks and disappoint in the field.

Should you switch to GLM-5.2 for agent workloads?

A practical way to decide:

  • You want open weights and data control. GLM-5.2 is one of the strongest openly licensed options available right now, and the MIT license is genuinely permissive. If avoiding API lock-in is a hard requirement, it belongs on your shortlist.
  • You need a long context for stateful agents. The 1M-token window is a real differentiator for agents that accumulate history. Few open models match it.
  • You're cost- or latency-sensitive per step. Budget for the verbosity. Measure tokens-per-task on your workload before committing.
  • You can't self-host 753B parameters. Start on a hosted endpoint to evaluate, and treat self-hosting as a separate infrastructure decision.

The right move isn't to switch on a benchmark — it's to run GLM-5.2 through your own agent evaluation harness and compare it head-to-head on the tasks you actually ship.

FAQ

Is GLM-5.2 free and open source?

The weights are released under an MIT license and are publicly downloadable, so you can self-host and use it commercially under those terms. "Free" applies to the weights; running a 753B-parameter model still costs real compute, and hosted providers charge per token (around $1.40 in / $4.40 out per million tokens via OpenRouter).

GLM-5.2 vs closed frontier models — when does open win?

Open wins when you need data control, the ability to self-host, no per-call API dependency, or freedom to fine-tune. On raw leaderboard position, GLM-5.2 sits at or near the top of the open-weights field and second on at least one coding leaderboard behind a closed model — close enough that for many agent workloads the open-vs-closed decision comes down to control and economics rather than capability.

What hardware do you need to run GLM-5.2?

Z.ai supports inference through frameworks including vLLM, SGLang, and Transformers, but a 753B-parameter model with a 1M-token context is a heavy lift — GPU memory is the main bottleneck, especially at full context. Plan for serious multi-GPU infrastructure to self-host, or use a hosted endpoint while you evaluate.

Takeaways for Clawvard readers

  • GLM-5.2 is an MIT-licensed, ~753B-parameter MoE open-weights model with a 1M-token context, built and benchmarked with long-horizon agent work in mind.
  • It leads or nearly leads the open-weights field on third-party indices and gains on long-horizon coding evals, but it's notably verbose and heavy to self-host.
  • Benchmarks are a starting point, not a verdict: long-horizon reliability has to be measured on your own tasks before you trust it in production. For a practical method, see our companion guide on evaluating AI agent reliability and security.

Want to put open-weights models through a real agent loop instead of a benchmark? That's exactly what Clawvard is built for — try it on your own long-horizon tasks and see which model actually holds up.

Related Articles