benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
articles
AIME leaderboard
AIME
39 models tested · Updated 2025-12-11 · Verified sources only
GPT-5.2
leads at
100.0%
1
GPT-5.2
OpenAI ·
OpenAI — GPT-5.2 for Science and Math
· 2025-12-11
Perfect score on math competition. No tools enabled.
100.0%
2
Claude Opus 4.6
Anthropic ·
Blog/Anthropic
· 2026-02-05
Perfect score on AIME 2025. Matches GPT-5.2 ceiling — AIME may be saturated for frontier models.
100.0%
3
o4-mini
OpenAI ·
Blog/OpenAI
· 2025-04-16
AIME 2025 pass@1 with Python interpreter access. 100% consensus@8. Retired Feb 2026. Tool-assisted score — not directly comparable to models without tool access.
99.5%
4
gpt-oss-20b
OpenAI ·
Blog/OpenAI
· 2025-08-05
AIME 2025 with tool use. Edges out gpt-oss-120b (97.9%). Remarkable for a 3.6B active param model.
98.7%
5
Seed 2.0 Pro
ByteDance ·
Blog/ByteDance
· 2026-02-14
Near-perfect AIME score. ByteDance reports gold medals on major math olympiads.
98.3%
6
Gemini 3.1 Pro
Google ·
Leaderboard/MathArena
· 2026-02-28
AIME 2024+2025 combined (60 questions). New AIME SOTA.
98.13%
7
gpt-oss-120b
OpenAI ·
Blog/OpenAI
· 2025-08-05
AIME 2025 with tool use. Also scores 96.6% on AIME 2024 (tools). Apache 2.0 open-weight.
97.9%
8
Step-3.5-Flash
StepFun ·
Blog/StepFun
· 2026-02-12
Open-weight MoE (196B total, 11B active). Reaches 99.9 with PaCoRe parallel thinking. Best open model on AIME.
97.3%
9
Gemini 3 Pro
Google ·
Vals.ai
· 2026-03-17
MathArena AIME 2026. 95% without tools, 100% with code execution. Top math performance.
96.7%
10
Arcee Trinity
Arcee AI ·
HuggingFace/arcee-ai/Trinity-Large-Thinking
· 2026-04-01
AIME 2025. Strong math reasoning for a 13B-active MoE model.
96.3%
11
Kimi K2.5
Moonshot AI ·
HuggingFace/moonshotai
· 2026-01-27
1T MoE, 32B active. Open-source. Averaged over 32 runs on AIME 2025.
96.1%
12
DeepSeek V3.2 Speciale
DeepSeek ·
Blog/DeepSeek
· 2025-12-01
AIME 2025. Deep reasoning variant. SOTA-level math, surpasses GPT-5 (94.6).
96.0%
13
DeepSeek-V3.2-Speciale
DeepSeek ·
Paper/DeepSeek-V3.2
· 2025-12-01
AIME 2025, 23k thinking tokens. Extended reasoning variant.
96.0%
14
GLM-5
Zhipu AI ·
MathArena
· 2026-04-08
AIME 2026 competition. MathArena independent evaluation, 3rd place behind Step-3.5-Flash (96.67).
95.83%
15
GLM-4.7
Zhipu AI ·
NVIDIA/Zhipu Official
· 2025-12-22
Open-weight 400B model. Highest AIME score among open models, competitive with GPT-5.4.
95.7%
16
Qwen 3.6 Plus
Alibaba ·
Blog/Qwen
· 2026-04-02
Highest AIME score among Qwen models. Strong math reasoning, competitive with Gemini 3 Pro (96.7) and Kimi K2.5 (96.1).
95.3%
17
GLM-5.1
Zhipu AI ·
MathArena
· 2026-04-08
AIME 2026 competition. MathArena independent evaluation, 4th place.
95.3%
18
GPT-5
OpenAI ·
Blog/OpenAI
· 2025-08-07
AIME 2025 without tools. Strong but not perfect — Claude Opus 4.6 and GPT-5.2 both hit 100%.
94.6%
19
DeepSeek V3.2
DeepSeek ·
Blog/DeepSeek
· 2025-12-01
AIME 2026. 685B MoE params with DeepSeek Sparse Attention. Top-tier math reasoning.
94.17%
20
Claude Sonnet 4.6
Anthropic ·
Blog/Anthropic
· 2026-02-17
Advanced math reasoning. Anthropic official benchmarks page.
94.0%
21
Grok 4
xAI ·
Artificial Analysis
· 2025-07-09
Joint highest AIME 2024 at time of release. Third-party verified.
94.0%
22
DeepSeek-V3.2
DeepSeek ·
Paper/DeepSeek-V3.2
· 2025-12-01
AIME 2025, Pass@1. Open-weight, 685B MoE. Matches GPT-5 on reasoning.
93.1%
23
Seed 2.0 Lite
ByteDance ·
Blog/ByteDance
· 2026-03-10
Mid-tier model in Seed 2.0 family. Handles 95% of enterprise workloads at half the Pro cost.
93.0%
24
GLM-5
Zhipu AI ·
Paper/Zhipu AI (arxiv:2602.15763)
· 2026-02-11
AIME 2026 I. On par with Kimi K2.5 (92.5) and DeepSeek V3.2 (92.7).
92.7%
25
o4-mini
OpenAI ·
Blog/OpenAI
· 2025-04-16
Without tools (closed-book). With tools achieves 99.5%. Best-performing small reasoning model on AIME.
92.7%
26
Qwen 3.5 9B
Alibaba ·
HuggingFace/Qwen
· 2026-03-02
MathArena AIME 2026. A 9B model matching frontier math scores. Remarkable reasoning-per-parameter.
92.5%
27
GLM-4.7-Flash
Zhipu AI ·
HuggingFace/Zhipu
· 2026-01-15
30B-A3B MoE with only 3.6B active params. Strong math reasoning for a lightweight model.
91.6%
28
Qwen 3.5 397B
Alibaba ·
HuggingFace/Qwen
· 2026-02-16
AIME 2026. Strong math reasoning from open-weight MoE model.
91.3%
29
Qwen3.5 27B
Alibaba ·
HuggingFace/Alibaba
· 2026-02-20
AIME 2026. Exceptional math reasoning for a 27B model — near frontier-class.
90.83%
30
Gemma 4 31B
Google ·
X/@o_mega___
· 2026-04-02
Independent user corroborates Google's AIME score; 0 engagement
89.2%
31
Nemotron 3 Nano 30B
NVIDIA ·
HuggingFace/NVIDIA
· 2025-12-15
Without tools. Hybrid Mamba-Transformer MoE, 3.3x throughput vs Qwen3-30B on H200.
89.1%
32
Gemma 4 26B A4B
Google ·
Model Card/Google
· 2026-04-02
MoE with only 3.8B active params outperforms many larger models.
88.3%
33
Sarvam 105B
Sarvam AI ·
HuggingFace/sarvamai
· 2026-03-06
India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.
88.3%
34
MiniMax M2.5
MiniMax ·
Blog/MiniMax
· 2026-02-12
Competitive math reasoning. Behind Claude Opus 4.5 (91.0) but ahead of Sonnet 4.5 (88.0).
86.3%
35
Phi-4-reasoning-plus
Microsoft ·
HuggingFace/Microsoft
· 2025-04-30
SFT+RL variant of Phi-4-reasoning. 14B model hitting 81%+ on AIME 2024 — remarkable for size.
81.3%
36
Claude Haiku 4.5
Anthropic ·
Blog/Anthropic
· 2025-10-15
Strong math reasoning for a Haiku-tier model. Extended thinking with 128K budget.
80.7%
37
Phi-4-reasoning
Microsoft ·
HuggingFace/Microsoft
· 2025-04-30
14B dense model. Matches DeepSeek-R1 (671B) on AIME despite 47x fewer params.
75.3%
38
Gemma 4 4B
Google ·
Google Model Card
· 2026-04-02
Impressive for a 4B model running on T4 GPUs. MoE architecture.
42.5%
39
Gemma 4 E2B
Google ·
Model Card/Google
· 2026-04-02
Matches Gemma 3 27B despite being a fraction of the size. Built-in reasoning capability.
37.5%