Claude Sonnet 4.6: Near-Opus Performance at a Fraction of the Cost

Two weeks after releasing Opus 4.6, Anthropic is back with Claude Sonnet 4.6. It’s the kind of release that makes you question whether flagship models are worth the premium anymore. Sonnet 4.6 doesn’t just improve on its predecessor. It approaches or matches Opus-class performance at one-fifth the price.

This isn’t marketing fluff backed by cherry-picked evals. The numbers are consistent across coding, computer use, reasoning, and real-world agent tasks. Developer reactions have been unusually unified: this is a real upgrade.

The Numbers That Matter

Let’s start with what Sonnet 4.6 actually scores, because the story is in the specifics.

Coding (SWE-bench Verified): 79.6%, within 1.2 points of Opus 4.6’s 80.8%. For a model that costs $3/$15 per million tokens versus Opus’s $15/$75, that gap is almost negligible.

Computer Use (OSWorld-Verified): 72.5%, essentially tied with Opus 4.6 at 72.7%. This is a nearly 5x improvement from where Sonnet was just 16 months ago (14.9%). Anthropic says Sonnet 4.6 can now handle tasks like navigating complex spreadsheets and multi-step web forms at human level.

Reasoning (ARC-AGI-2): 58.3%, up from Sonnet 4.5’s 13.6%. That’s a 4.3x jump in a single generation. Benchmark maintainers called it the largest single-generation reasoning leap in the benchmark’s history.

Math (MATH-500): 97.8%.

Here’s the part that’s hard to ignore. On several real-world agent benchmarks, Sonnet 4.6 doesn’t just approach Opus. It beats every model tested:

Benchmark	Sonnet 4.6	Opus 4.6
GDPval (Office)	1633 Elo	1606 Elo
Finance Agent	63.3%	60.1%
MCP-Atlas (Tool Use)	61.3%	60.3%

The model that costs one-fifth as much is outperforming the flagship on office productivity, financial analysis, and scaled tool use. The mid-tier is eating the flagship.

What Developers Are Actually Saying

In Claude Code testing, users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. Even more telling: they preferred it over Opus 4.5, Anthropic’s previous top-of-line model from November, 59% of the time.

The qualitative feedback tracks with the numbers. Developers reported:

Less over-engineering and “laziness”
Better instruction following
Fewer false claims of task completion
Fewer hallucinations
More consistent follow-through on multi-step tasks

Joe Binder, VP of Product at GitHub, noted: “Out of the gate, Claude Sonnet 4.6 is already excelling at complex code fixes, especially when searching across large codebases is essential.”

Jamie Cuffe, CEO of insurance tech company Pace, said Sonnet 4.6 hit 94% on their complex insurance computer use benchmark, the highest of any Claude model tested. “It reasons through failures and self-corrects in ways we haven’t seen before.”

Simon Willison tested it immediately and noted it performs comparably to Opus 4.5 at lower pricing, with a knowledge cutoff of August 2025, actually more recent than Opus 4.6’s May 2025 cutoff.

The Caveat: Token Hunger

There’s one important detail the headline benchmarks don’t capture. According to Latent Space’s analysis, Sonnet 4.6 used 280 million tokens to run GDPval-AA versus Sonnet 4.5’s 58 million, a 4.8x increase. The model is thinking harder, and that costs tokens.

This means the per-token pricing advantage can erode depending on the task. For long-running agentic work, Sonnet 4.6’s total cost can approach or exceed Opus in some scenarios. Cursor’s team noted it’s “better on longer tasks but below Opus 4.6 for intelligence,” which tracks with the benchmark data.

The takeaway: Sonnet 4.6 is a long-horizon workhorse, not a drop-in replacement for Opus on tasks that need maximum raw intelligence with minimal token overhead.

What’s New Under the Hood

Beyond raw performance, Sonnet 4.6 picks up several architectural features from the Opus 4.6 release:

1M Token Context Window (Beta): Same as Opus, Sonnet 4.6 can now operate with a million tokens of context, double the largest window previously available for any Sonnet model. For codebases, that’s roughly 30,000 lines of code in a single conversation.

Adaptive Thinking: The model dynamically adjusts reasoning depth based on task complexity, with configurable effort levels. Light tasks get fast responses; complex problems get deeper analysis.

Context Compaction: Automatic summarization of older context during long conversations, keeping the window useful without stacking overhead.

Improved Safety: Researchers assessed it as having “a broadly warm, honest, prosocial character, very strong safety behaviors” with improved prompt injection resistance compared to Sonnet 4.5.

Pricing: The Actual Story

Sonnet 4.6 holds at $3 / $15 per million tokens (input/output), identical to Sonnet 4.5. For comparison:

Model	Input	Output	Relative Cost
Sonnet 4.6	$3/M	$15/M	1x
Opus 4.6	$15/M	$75/M	5x

It’s now the default model on Free and Pro plans in claude.ai and Claude Cowork, which means millions of users got an automatic upgrade yesterday. It’s also available via API, AWS Bedrock, Google Vertex AI, and Microsoft Azure.

What This Means for Vibe Coders

If you’re using Claude Code, this is your new default. You don’t need to do anything. Sonnet 4.6 is already running. The practical difference you’ll notice: better code quality, fewer iterations to get production-ready results, and more reliable follow-through on complex multi-file tasks.

If you’ve been defaulting to Opus for coding tasks, it’s worth testing whether Sonnet 4.6 handles your workflow at 80% of the quality for 20% of the cost. For agentic workflows, code fixes across large codebases, and office productivity, the benchmarks suggest you won’t notice a difference. For problems requiring maximum reasoning depth, Opus still has the edge.

The performance gap between flagship and mid-tier models is compressing fast. What required Opus-class intelligence six months ago now runs on Sonnet. The interesting question is how quickly the next Sonnet makes the current Opus feel overpriced.

Sources: