2026-04-09

Claude Mythos Preview vs GPT-5.4 — What the Benchmarks Actually Show


Anthropic dropped Claude Mythos Preview on April 7. Two days later, the AI community is still processing the numbers.


I've been tracking this at [benchmark.space](https://benchmark.space), which indexes benchmark results from official sources hourly. Here's what the data says.


Mythos is a generational leap



[Full Mythos benchmark profile](https://benchmark.space/model/claude-mythos-preview)


The cybersecurity angle


Anthropic's Project Glasswing reveals Mythos found thousands of high-severity vulnerabilities in "every major operating system and web browser." Ethan Mollick's reaction: "In different hands, Mythos would be an unprecedented cyberweapon."


The [CyberGym benchmark](https://benchmark.space/benchmark/cybergym) score of 83.1% is 16.5 points ahead of Opus 4.6.


But GPT-5.4 still wins at some things


On browser agent tasks, GPT-5.4 leads:


**Mind2Web**: 92.8% — [leaderboard](https://benchmark.space/benchmark/mind2web)
**WebArena**: 67.3% — [leaderboard](https://benchmark.space/benchmark/webArena)

And on the [Aider Polyglot coding benchmark](https://benchmark.space/benchmark/aider-polyglot), GPT-5 (88%) still leads Claude Opus 4.6 (72%) by 16 points.


What to actually use


If you're choosing between them today:


**Coding/reasoning**: Mythos (if you have access) or Opus 4.6
**Browser agents**: GPT-5.4
**Math**: Tie — both hit near-perfect on AIME
**Cost-sensitive**: Look at Qwen 3.6 Plus or Gemma 4 31B

[Compare any two models head-to-head](https://benchmark.space/vs)


Data from [benchmark.space](https://benchmark.space) — updated hourly with verified sources.

BenchmarkClaude Mythos PreviewClaude Opus 4.6GPT-5.4
SWE-bench Verified**93.9%**81.4%80.0%
GPQA Diamond**94.6%**91.3%92.8%
OSWorld**79.6%**72.7%75.0%
USAMO 2026**97.6%**42.3%95.2%
SWE-bench Pro**77.8%**53.4%57.7%
Terminal-Bench 2.0**82.0%**65.4%81.8%
Humanity's Last Exam**56.8%**34.4%36.2%