Anthropic dropped Claude Mythos Preview on April 7. Two days later, the AI community is still processing the numbers.
I've been tracking this at [benchmark.space](https://benchmark.space), which indexes benchmark results from official sources hourly. Here's what the data says.
| Benchmark | Claude Mythos Preview | Claude Opus 4.6 | GPT-5.4 |
| SWE-bench Verified | **93.9%** | 81.4% | 80.0% |
| GPQA Diamond | **94.6%** | 91.3% | 92.8% |
| OSWorld | **79.6%** | 72.7% | 75.0% |
| USAMO 2026 | **97.6%** | 42.3% | 95.2% |
| SWE-bench Pro | **77.8%** | 53.4% | 57.7% |
| Terminal-Bench 2.0 | **82.0%** | 65.4% | 81.8% |
| Humanity's Last Exam | **56.8%** | 34.4% | 36.2% |
[Full Mythos benchmark profile](https://benchmark.space/model/claude-mythos-preview)
The cybersecurity angle
Anthropic's Project Glasswing reveals Mythos found thousands of high-severity vulnerabilities in "every major operating system and web browser." Ethan Mollick's reaction: "In different hands, Mythos would be an unprecedented cyberweapon."
The [CyberGym benchmark](https://benchmark.space/benchmark/cybergym) score of 83.1% is 16.5 points ahead of Opus 4.6.
But GPT-5.4 still wins at some things
On browser agent tasks, GPT-5.4 leads:
**Mind2Web**: 92.8% — [leaderboard](https://benchmark.space/benchmark/mind2web)
**WebArena**: 67.3% — [leaderboard](https://benchmark.space/benchmark/webArena)
And on the [Aider Polyglot coding benchmark](https://benchmark.space/benchmark/aider-polyglot), GPT-5 (88%) still leads Claude Opus 4.6 (72%) by 16 points.
What to actually use
If you're choosing between them today:
**Coding/reasoning**: Mythos (if you have access) or Opus 4.6
**Browser agents**: GPT-5.4
**Math**: Tie — both hit near-perfect on AIME
**Cost-sensitive**: Look at Qwen 3.6 Plus or Gemma 4 31B
[Compare any two models head-to-head](https://benchmark.space/vs)
Data from [benchmark.space](https://benchmark.space) — updated hourly with verified sources.