Model Evaluation

Claude Sonnet 5: What's New, How It Benchmarks, and Where Claude Science Fits

July 2, 2026·8 min read
Claude Sonnet 5: What's New, How It Benchmarks, and Where Claude Science Fits

Claude Sonnet 5: What's New, How It Benchmarks, and Where Claude Science Fits

In one 48-hour window at the end of June 2026, Anthropic shipped three things at once: Claude Sonnet 5, a new frontier-class model; Claude Science, a vertical product for research; and a policy clearance that let restricted models get a global release. For builders, the launch-day threads blurred those together. This explainer separates them — benchmarks and cost first, the new product second, and the policy story last — so you can decide whether Sonnet 5 belongs in your stack.

What is Claude Sonnet 5?

Claude Sonnet 5 is the newest model in Anthropic's mid-tier Sonnet line. The headline positioning, per Anthropic's own materials as summarized by Simon Willison, is that Sonnet 5's "performance is close to that of Opus 4.8, but at lower prices." In other words, the pitch is Opus-class quality at Sonnet-class cost — a familiar move, and the thing worth verifying before you migrate.

If you want context on where Opus sits in Anthropic's lineup, our Claude Opus 4.8 vs 4.7 comparison is a useful reference point for the quality bar Sonnet 5 is being measured against.

What's new in Claude Sonnet 5?

Several concrete, source-confirmed changes matter more than the marketing line:

  • A 1 million-token context window with up to 128,000 output tokens — a large working context for long-horizon tasks.
  • Adaptive thinking is on by default (and can be disabled). The model decides how much to "think" per request rather than requiring you to toggle it.
  • Sampling parameters temperature, top_p, and top_k are no longer supported. If your code sets these, it will need updating — this is the kind of change that quietly breaks integrations.
  • Same tools and platform features as Claude Sonnet 4.6, so the surrounding API surface is familiar.

The tokenizer change you can't ignore

The most consequential detail is easy to miss. Sonnet 5 uses a new tokenizer that produces roughly 30% more tokens than Sonnet 4.6 for the same input, per Willison's analysis. The effect varies by language and content:

  • English text: about 1.42x more tokens
  • Spanish text: about 1.33x more tokens
  • Python code: about 1.27x more tokens
  • Simplified Mandarin: comparable token usage

Because you're billed per token, more tokens for the same text means higher effective cost even when the per-token price is unchanged. This is exactly why headline pricing can mislead — which brings us to benchmarks and cost.

How does Claude Sonnet 5 benchmark and price out?

Here's the honest state of things. Anthropic's positioning is that Sonnet 5 performs close to Opus 4.8 at lower prices, but the widely-circulated launch coverage did not publish a full independent benchmark table. So rather than repeat numbers that aren't in the sources, treat the quality claim as a hypothesis to test on your own workload.

On price, the confirmed figures are:

  • Standard rates: $3 per million input tokens and $15 per million output tokens.
  • Introductory discount: $2 / $10 until August 31, 2026.

But recall the tokenizer: a ~30% token inflation versus Sonnet 4.6 means the effective cost of a given task can be meaningfully higher than the sticker price suggests. When you compare Sonnet 5 to another model, compare cost per completed task, not cost per token — the token count is now part of the price. For the broader pricing backdrop, our 2026 LLM API pricing breakdown puts these rates in context.

This is also a reminder that a single benchmark number rarely settles a model decision. We've written about why headline leaderboards can mislead in can you trust an AI model leaderboard?, and the practical fix is always the same: benchmark on your own tasks. Our guides on how to benchmark agents on your own tools and the broader AI agent evaluation guide lay out how. If your use case is specifically agentic coding, we also have a focused Claude Sonnet 5 for coding agents review that digs into the cost-per-task question.

What is Claude Science, and who is it for?

Launched the same day, Claude Science is a separate, flagship product — positioned by Anthropic alongside Claude Code and Claude Cowork, per MIT Technology Review. It's aimed at scientific research, with a stated focus on drug development and computational biology, and an audience of pharmaceutical executives, biotech founders, and researchers.

What it does, per the reporting:

  • Autonomously carries out meaningful work from concise, high-level instructions.
  • Writes code and manages execution on powerful compute clusters.
  • Interfaces with tools for genetics, chemistry, and protein biology.
  • Prioritizes reproducibility so scientists can verify results.

In the launch demo, it reportedly identified new drug candidates for phenylketonuria, a rare genetic disease. One clarification worth making: the MIT Tech Review piece does not state which underlying model powers Claude Science, and it does not tie the product to Sonnet 5 specifically. So treat Claude Science as a product announcement that arrived in the same window — not as "Sonnet 5 for science."

Why were Anthropic's models gated, and what changed?

The third strand is the policy story, and it's the one to keep brief. Some of Anthropic's models had faced restrictions tied to safety and cyber-capability concerns; reporting (TechCrunch, Ars Technica) covered those restrictions on models including Mythos and Fable being dropped, clearing the way for a global release.

Where this touches Sonnet 5 directly: its system card notes that Sonnet 5 is significantly less capable at cyber tasks than Mythos 5, and that lower cyber capability is what supported regulatory clearance. For most builders the practical takeaway is simple — Sonnet 5 shipped broadly, and the gating that affected earlier models is not a barrier here.

Should you migrate to Claude Sonnet 5?

A practical checklist:

  • Audit your sampling parameters. If you set temperature, top_p, or top_k, remove or refactor them before switching — they're no longer supported.
  • Re-estimate cost on real traffic. Because of the ~30% token inflation, don't assume the sticker price maps to your old bill. Run a sample of real requests and measure tokens.
  • Benchmark on your tasks, not the launch demo. The "close to Opus 4.8" claim is promising but unverified in public benchmarks; confirm it on your workload.
  • Use the intro discount window intentionally. The $2/$10 rate runs until August 31, 2026 — a good window to run evaluations before committing.

For a sense of how other 2026 frontier launches have played out — including restricted rollouts — our GPT-5.6 "Sol" explainer is a useful companion read.

Frequently asked questions

Is Claude Sonnet 5 better than Sonnet 4.6?

Anthropic positions Sonnet 5 as close to Opus 4.8 in performance — a step up from the Sonnet 4.x line — but note the new tokenizer produces roughly 30% more tokens than 4.6, which raises effective cost. Whether it's "better" for you depends on measuring quality and cost per completed task on your own workload.

What can Claude Science do?

Per MIT Technology Review, Claude Science autonomously carries out research work from high-level instructions: writing code, running it on compute clusters, and interfacing with genetics, chemistry, and protein-biology tools, with an emphasis on reproducibility. It's aimed at drug development and computational biology.

Is Claude Sonnet 5 available globally?

Yes. Following the clearance of earlier restrictions on some Anthropic models, Sonnet 5 shipped with a global release. Its system card notes it is less capable at cyber tasks than Mythos 5, which supported that clearance.

How do Claude Sonnet 5 benchmarks compare?

Anthropic's public claim is "close to Opus 4.8 at lower prices," but the launch coverage did not publish a full independent benchmark suite. The responsible way to compare is to run it against your own tasks and measure cost per completed task — see our benchmarking guide.

Takeaways

  • Sonnet 5's pitch is Opus-4.8-class quality at lower prices — treat that as a hypothesis to verify, not a settled benchmark result.
  • The new tokenizer's ~30% token inflation means effective cost can exceed the sticker price; compare cost per task, not per token.
  • Breaking change: temperature, top_p, and top_k are no longer supported.
  • Claude Science is a separate research product from the same window — don't conflate it with Sonnet 5.
  • Use the intro pricing window (through Aug 31, 2026) to run your own evaluation.

Want to go deeper on the "close to Opus" claim? Compare the reference point in our Opus 4.8 vs 4.7 breakdown, and if you're evaluating Sonnet 5 for real work, use Clawvard to benchmark it against tasks that look like yours rather than trusting launch-day numbers. Follow our model-evaluation coverage for the independent results as they land.

Related Articles