2026-04-09

I Tracked 83 AI Models Across 37 Benchmarks. Here's Who Actually Leads in 2026.


Everyone claims their model is the best. I built a system that checks.


[benchmark.space](https://benchmark.space) indexes AI benchmark results hourly from official sources — papers, blog posts, model cards. No self-reported claims without verification. No marketing slides. Just real scores.


After tracking 352 results across 37 benchmarks, some findings surprised me.


The SOTA Map (April 2026)


The frontier is razor-thin. Here are the current leaders:


**SWE-bench Verified** (coding): Claude Mythos Preview — 93.9%. [Full leaderboard](https://benchmark.space/benchmark/swe-bench-verified)
**GPQA Diamond** (PhD-level reasoning): Claude Mythos Preview — 94.6%. [Full leaderboard](https://benchmark.space/benchmark/gpqa-diamond)
**AIME** (math): GPT-5.2 and Claude Opus 4.6 — both 100%. [Full leaderboard](https://benchmark.space/benchmark/aime)
**OSWorld** (desktop agents): Holo3-122B — 78.85%. [Full leaderboard](https://benchmark.space/benchmark/osworld)
**USAMO 2026** (proof-based math): Claude Mythos Preview — 97.6%

Claude vs GPT — It's 4–4


The most searched comparison in AI. [Claude Opus 4.6 vs GPT-5.4](https://benchmark.space/vs/claude-opus-4.6/gpt-5.4) across 9 shared benchmarks: Claude wins 4, GPT wins 4, 1 tie.


Claude dominates reasoning (GPQA, HLE, ARC-AGI). GPT dominates agents (Mind2Web, WebArena). Neither is "better" — it depends entirely on what you need.


The Small Model Story Nobody's Talking About


Qwen 3.5 9B scores 92.5% on AIME. That's a 9-billion parameter model matching what GPT-5 could do 8 months ago. Gemma 4 31B hits 89.2%. These are open-weight models you can run locally.


[See all model rankings](https://benchmark.space/rankings)


What Researchers Are Actually Saying


François Chollet (ARC-AGI creator) on Meta's latest: "Already looking like a disappointment: overoptimized for public benchmark numbers at the detriment of everything else."


Fuli Luo (Xiaomi MiMo, ex-DeepSeek): "Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising."


Jensen Huang: "I think we've achieved AGI."


[All researcher takes with sources](https://benchmark.space/voices)


The Trust Problem


MIT Technology Review published "AI Benchmarks Are Broken" in March 2026. 45% data overlap on QA benchmarks. Only 9 out of 30 models report train-test contamination. MMLU and GSM8K are effectively saturated.


That's why [benchmark.space](https://benchmark.space) links every score to its primary source and shows verification status. The data should be checkable.


352 results, 83 models, 37 benchmarks. Updated hourly.