2026-04-09

I Tracked 83 AI Models Across 37 Benchmarks. Here's Who Actually Leads in 2026.

Everyone claims their model is the best. I built a system that checks.

[benchmark.space](https://benchmark.space) indexes AI benchmark results hourly from official sources — papers, blog posts, model cards. No self-reported claims without verification. No marketing slides. Just real scores.

After tracking 352 results across 37 benchmarks, some findings surprised me.

The SOTA Map (April 2026)

The frontier is razor-thin. Here are the current leaders:

**SWE-bench Verified** (coding): Claude Mythos Preview — 93.9%. [Full leaderboard](https://benchmark.space/benchmark/swe-bench-verified)

**GPQA Diamond** (PhD-level reasoning): Claude Mythos Preview — 94.6%. [Full leaderboard](https://benchmark.space/benchmark/gpqa-diamond)

**AIME** (math): GPT-5.2 and Claude Opus 4.6 — both 100%. [Full leaderboard](https://benchmark.space/benchmark/aime)

**OSWorld** (desktop agents): Holo3-122B — 78.85%. [Full leaderboard](https://benchmark.space/benchmark/osworld)

**USAMO 2026** (proof-based math): Claude Mythos Preview — 97.6%

Claude vs GPT — It's 4–4

The most searched comparison in AI. [Claude Opus 4.6 vs GPT-5.4](https://benchmark.space/vs/claude-opus-4.6/gpt-5.4) across 9 shared benchmarks: Claude wins 4, GPT wins 4, 1 tie.

Claude dominates reasoning (GPQA, HLE, ARC-AGI). GPT dominates agents (Mind2Web, WebArena). Neither is "better" — it depends entirely on what you need.

The Small Model Story Nobody's Talking About

Qwen 3.5 9B scores 92.5% on AIME. That's a 9-billion parameter model matching what GPT-5 could do 8 months ago. Gemma 4 31B hits 89.2%. These are open-weight models you can run locally.

[See all model rankings](https://benchmark.space/rankings)

What Researchers Are Actually Saying

François Chollet (ARC-AGI creator) on Meta's latest: "Already looking like a disappointment: overoptimized for public benchmark numbers at the detriment of everything else."

Fuli Luo (Xiaomi MiMo, ex-DeepSeek): "Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising."

Jensen Huang: "I think we've achieved AGI."

[All researcher takes with sources](https://benchmark.space/voices)

The Trust Problem

MIT Technology Review published "AI Benchmarks Are Broken" in March 2026. 45% data overlap on QA benchmarks. Only 9 out of 30 models report train-test contamination. MMLU and GSM8K are effectively saturated.

That's why [benchmark.space](https://benchmark.space) links every score to its primary source and shows verification status. The data should be checkable.

352 results, 83 models, 37 benchmarks. Updated hourly.