Everyone claims their model is the best. I built a system that checks.
[benchmark.space](https://benchmark.space) indexes AI benchmark results hourly from official sources — papers, blog posts, model cards. No self-reported claims without verification. No marketing slides. Just real scores.
After tracking 352 results across 37 benchmarks, some findings surprised me.
The frontier is razor-thin. Here are the current leaders:
The most searched comparison in AI. [Claude Opus 4.6 vs GPT-5.4](https://benchmark.space/vs/claude-opus-4.6/gpt-5.4) across 9 shared benchmarks: Claude wins 4, GPT wins 4, 1 tie.
Claude dominates reasoning (GPQA, HLE, ARC-AGI). GPT dominates agents (Mind2Web, WebArena). Neither is "better" — it depends entirely on what you need.
Qwen 3.5 9B scores 92.5% on AIME. That's a 9-billion parameter model matching what GPT-5 could do 8 months ago. Gemma 4 31B hits 89.2%. These are open-weight models you can run locally.
[See all model rankings](https://benchmark.space/rankings)
François Chollet (ARC-AGI creator) on Meta's latest: "Already looking like a disappointment: overoptimized for public benchmark numbers at the detriment of everything else."
Fuli Luo (Xiaomi MiMo, ex-DeepSeek): "Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising."
Jensen Huang: "I think we've achieved AGI."
[All researcher takes with sources](https://benchmark.space/voices)
MIT Technology Review published "AI Benchmarks Are Broken" in March 2026. 45% data overlap on QA benchmarks. Only 9 out of 30 models report train-test contamination. MMLU and GSM8K are effectively saturated.
That's why [benchmark.space](https://benchmark.space) links every score to its primary source and shows verification status. The data should be checkable.
352 results, 83 models, 37 benchmarks. Updated hourly.