AIME
39 models tested · Updated 2025-12-11 · Verified sources only
GPT-5.2 leads at 100.0%
1
Perfect score on math competition. No tools enabled.
100.0%
2
Anthropic · Blog/Anthropic · 2026-02-05
Perfect score on AIME 2025. Matches GPT-5.2 ceiling — AIME may be saturated for frontier models.
100.0%
3
OpenAI · Blog/OpenAI · 2025-04-16
AIME 2025 pass@1 with Python interpreter access. 100% consensus@8. Retired Feb 2026. Tool-assisted score — not directly comparable to models without tool access.
99.5%
4
OpenAI · Blog/OpenAI · 2025-08-05
AIME 2025 with tool use. Edges out gpt-oss-120b (97.9%). Remarkable for a 3.6B active param model.
98.7%
5
ByteDance · Blog/ByteDance · 2026-02-14
Near-perfect AIME score. ByteDance reports gold medals on major math olympiads.
98.3%
6
Google · Leaderboard/MathArena · 2026-02-28
AIME 2024+2025 combined (60 questions). New AIME SOTA.
98.13%
7
OpenAI · Blog/OpenAI · 2025-08-05
AIME 2025 with tool use. Also scores 96.6% on AIME 2024 (tools). Apache 2.0 open-weight.
97.9%
8
StepFun · Blog/StepFun · 2026-02-12
Open-weight MoE (196B total, 11B active). Reaches 99.9 with PaCoRe parallel thinking. Best open model on AIME.
97.3%
9
Google · Vals.ai · 2026-03-17
MathArena AIME 2026. 95% without tools, 100% with code execution. Top math performance.
96.7%
10
AIME 2025. Strong math reasoning for a 13B-active MoE model.
96.3%
11
Moonshot AI · HuggingFace/moonshotai · 2026-01-27
1T MoE, 32B active. Open-source. Averaged over 32 runs on AIME 2025.
96.1%
12
DeepSeek · Blog/DeepSeek · 2025-12-01
AIME 2025. Deep reasoning variant. SOTA-level math, surpasses GPT-5 (94.6).
96.0%
13
DeepSeek · Paper/DeepSeek-V3.2 · 2025-12-01
AIME 2025, 23k thinking tokens. Extended reasoning variant.
96.0%
14
Zhipu AI · MathArena · 2026-04-08
AIME 2026 competition. MathArena independent evaluation, 3rd place behind Step-3.5-Flash (96.67).
95.83%
15
Zhipu AI · NVIDIA/Zhipu Official · 2025-12-22
Open-weight 400B model. Highest AIME score among open models, competitive with GPT-5.4.
95.7%
16
Alibaba · Blog/Qwen · 2026-04-02
Highest AIME score among Qwen models. Strong math reasoning, competitive with Gemini 3 Pro (96.7) and Kimi K2.5 (96.1).
95.3%
17
Zhipu AI · MathArena · 2026-04-08
AIME 2026 competition. MathArena independent evaluation, 4th place.
95.3%
18
OpenAI · Blog/OpenAI · 2025-08-07
AIME 2025 without tools. Strong but not perfect — Claude Opus 4.6 and GPT-5.2 both hit 100%.
94.6%
19
DeepSeek · Blog/DeepSeek · 2025-12-01
AIME 2026. 685B MoE params with DeepSeek Sparse Attention. Top-tier math reasoning.
94.17%
20
Anthropic · Blog/Anthropic · 2026-02-17
Advanced math reasoning. Anthropic official benchmarks page.
94.0%
21
xAI · Artificial Analysis · 2025-07-09
Joint highest AIME 2024 at time of release. Third-party verified.
94.0%
22
DeepSeek · Paper/DeepSeek-V3.2 · 2025-12-01
AIME 2025, Pass@1. Open-weight, 685B MoE. Matches GPT-5 on reasoning.
93.1%
23
ByteDance · Blog/ByteDance · 2026-03-10
Mid-tier model in Seed 2.0 family. Handles 95% of enterprise workloads at half the Pro cost.
93.0%
24
Zhipu AI · Paper/Zhipu AI (arxiv:2602.15763) · 2026-02-11
AIME 2026 I. On par with Kimi K2.5 (92.5) and DeepSeek V3.2 (92.7).
92.7%
25
OpenAI · Blog/OpenAI · 2025-04-16
Without tools (closed-book). With tools achieves 99.5%. Best-performing small reasoning model on AIME.
92.7%
26
Alibaba · HuggingFace/Qwen · 2026-03-02
MathArena AIME 2026. A 9B model matching frontier math scores. Remarkable reasoning-per-parameter.
92.5%
27
Zhipu AI · HuggingFace/Zhipu · 2026-01-15
30B-A3B MoE with only 3.6B active params. Strong math reasoning for a lightweight model.
91.6%
28
Alibaba · HuggingFace/Qwen · 2026-02-16
AIME 2026. Strong math reasoning from open-weight MoE model.
91.3%
29
Alibaba · HuggingFace/Alibaba · 2026-02-20
AIME 2026. Exceptional math reasoning for a 27B model — near frontier-class.
90.83%
30
Google · X/@o_mega___ · 2026-04-02
Independent user corroborates Google's AIME score; 0 engagement
89.2%
31
NVIDIA · HuggingFace/NVIDIA · 2025-12-15
Without tools. Hybrid Mamba-Transformer MoE, 3.3x throughput vs Qwen3-30B on H200.
89.1%
32
Google · Model Card/Google · 2026-04-02
MoE with only 3.8B active params outperforms many larger models.
88.3%
33
Sarvam AI · HuggingFace/sarvamai · 2026-03-06
India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.
88.3%
34
MiniMax · Blog/MiniMax · 2026-02-12
Competitive math reasoning. Behind Claude Opus 4.5 (91.0) but ahead of Sonnet 4.5 (88.0).
86.3%
35
Microsoft · HuggingFace/Microsoft · 2025-04-30
SFT+RL variant of Phi-4-reasoning. 14B model hitting 81%+ on AIME 2024 — remarkable for size.
81.3%
36
Anthropic · Blog/Anthropic · 2025-10-15
Strong math reasoning for a Haiku-tier model. Extended thinking with 128K budget.
80.7%
37
Microsoft · HuggingFace/Microsoft · 2025-04-30
14B dense model. Matches DeepSeek-R1 (671B) on AIME despite 47x fewer params.
75.3%
38
Google · Google Model Card · 2026-04-02
Impressive for a 4B model running on T4 GPUs. MoE architecture.
42.5%
39
Google · Model Card/Google · 2026-04-02
Matches Gemma 3 27B despite being a fraction of the size. Built-in reasoning capability.
37.5%