2026-04-09

Claude Mythos Preview vs GPT-5.4 — What the Benchmarks Actually Show

Anthropic dropped Claude Mythos Preview on April 7. Two days later, the AI community is still processing the numbers.

I've been tracking this at [benchmark.space](https://benchmark.space), which indexes benchmark results from official sources hourly. Here's what the data says.

Mythos is a generational leap

[Full Mythos benchmark profile](https://benchmark.space/model/claude-mythos-preview)

The cybersecurity angle

Anthropic's Project Glasswing reveals Mythos found thousands of high-severity vulnerabilities in "every major operating system and web browser." Ethan Mollick's reaction: "In different hands, Mythos would be an unprecedented cyberweapon."

The [CyberGym benchmark](https://benchmark.space/benchmark/cybergym) score of 83.1% is 16.5 points ahead of Opus 4.6.

But GPT-5.4 still wins at some things

On browser agent tasks, GPT-5.4 leads:

**Mind2Web**: 92.8% — [leaderboard](https://benchmark.space/benchmark/mind2web)

**WebArena**: 67.3% — [leaderboard](https://benchmark.space/benchmark/webArena)

And on the [Aider Polyglot coding benchmark](https://benchmark.space/benchmark/aider-polyglot), GPT-5 (88%) still leads Claude Opus 4.6 (72%) by 16 points.

What to actually use

If you're choosing between them today:

**Coding/reasoning**: Mythos (if you have access) or Opus 4.6

**Browser agents**: GPT-5.4

**Math**: Tie — both hit near-perfect on AIME

**Cost-sensitive**: Look at Qwen 3.6 Plus or Gemma 4 31B

[Compare any two models head-to-head](https://benchmark.space/vs)

Data from [benchmark.space](https://benchmark.space) — updated hourly with verified sources.

Benchmark	Claude Mythos Preview	Claude Opus 4.6	GPT-5.4
SWE-bench Verified	93.9%	81.4%	80.0%
GPQA Diamond	94.6%	91.3%	92.8%
OSWorld	79.6%	72.7%	75.0%
USAMO 2026	97.6%	42.3%	95.2%
SWE-bench Pro	77.8%	53.4%	57.7%
Terminal-Bench 2.0	82.0%	65.4%	81.8%
Humanity's Last Exam	56.8%	34.4%	36.2%