AIME Leaderboard 2026 — Results Across 39 Real AI Models

AIME leaderboard

AIME

39 models tested · Updated 2025-12-11 · Verified sources only

      GPT-5.2 leads at 100.0%
    

GPT-5.2

OpenAI · OpenAI — GPT-5.2 for Science and Math · 2025-12-11

Perfect score on math competition. No tools enabled.

100.0%

Claude Opus 4.6

Anthropic · Blog/Anthropic · 2026-02-05

Perfect score on AIME 2025. Matches GPT-5.2 ceiling — AIME may be saturated for frontier models.

100.0%

o4-mini

OpenAI · Blog/OpenAI · 2025-04-16

AIME 2025 pass@1 with Python interpreter access. 100% consensus@8. Retired Feb 2026. Tool-assisted score — not directly comparable to models without tool access.

99.5%

gpt-oss-20b

OpenAI · Blog/OpenAI · 2025-08-05

AIME 2025 with tool use. Edges out gpt-oss-120b (97.9%). Remarkable for a 3.6B active param model.

98.7%

Seed 2.0 Pro

ByteDance · Blog/ByteDance · 2026-02-14

Near-perfect AIME score. ByteDance reports gold medals on major math olympiads.

98.3%

Gemini 3.1 Pro

Google · Leaderboard/MathArena · 2026-02-28

AIME 2024+2025 combined (60 questions). New AIME SOTA.

98.13%

gpt-oss-120b

OpenAI · Blog/OpenAI · 2025-08-05

AIME 2025 with tool use. Also scores 96.6% on AIME 2024 (tools). Apache 2.0 open-weight.

97.9%

Step-3.5-Flash

StepFun · Blog/StepFun · 2026-02-12

Open-weight MoE (196B total, 11B active). Reaches 99.9 with PaCoRe parallel thinking. Best open model on AIME.

97.3%

Gemini 3 Pro

Google · Vals.ai · 2026-03-17

MathArena AIME 2026. 95% without tools, 100% with code execution. Top math performance.

96.7%

Arcee Trinity

Arcee AI · HuggingFace/arcee-ai/Trinity-Large-Thinking · 2026-04-01

AIME 2025. Strong math reasoning for a 13B-active MoE model.

96.3%

Kimi K2.5

Moonshot AI · HuggingFace/moonshotai · 2026-01-27

1T MoE, 32B active. Open-source. Averaged over 32 runs on AIME 2025.

96.1%

DeepSeek V3.2 Speciale

DeepSeek · Blog/DeepSeek · 2025-12-01

AIME 2025. Deep reasoning variant. SOTA-level math, surpasses GPT-5 (94.6).

96.0%

DeepSeek-V3.2-Speciale

DeepSeek · Paper/DeepSeek-V3.2 · 2025-12-01

AIME 2025, 23k thinking tokens. Extended reasoning variant.

96.0%

GLM-5

Zhipu AI · MathArena · 2026-04-08

AIME 2026 competition. MathArena independent evaluation, 3rd place behind Step-3.5-Flash (96.67).

95.83%

GLM-4.7

Zhipu AI · NVIDIA/Zhipu Official · 2025-12-22

Open-weight 400B model. Highest AIME score among open models, competitive with GPT-5.4.

95.7%

Qwen 3.6 Plus

Alibaba · Blog/Qwen · 2026-04-02

Highest AIME score among Qwen models. Strong math reasoning, competitive with Gemini 3 Pro (96.7) and Kimi K2.5 (96.1).

95.3%

GLM-5.1

Zhipu AI · MathArena · 2026-04-08

AIME 2026 competition. MathArena independent evaluation, 4th place.

95.3%

GPT-5

OpenAI · Blog/OpenAI · 2025-08-07

AIME 2025 without tools. Strong but not perfect — Claude Opus 4.6 and GPT-5.2 both hit 100%.

94.6%

DeepSeek V3.2

DeepSeek · Blog/DeepSeek · 2025-12-01

AIME 2026. 685B MoE params with DeepSeek Sparse Attention. Top-tier math reasoning.

94.17%

Claude Sonnet 4.6

Anthropic · Blog/Anthropic · 2026-02-17

Advanced math reasoning. Anthropic official benchmarks page.

94.0%

Grok 4

xAI · Artificial Analysis · 2025-07-09

Joint highest AIME 2024 at time of release. Third-party verified.

94.0%

DeepSeek-V3.2

DeepSeek · Paper/DeepSeek-V3.2 · 2025-12-01

AIME 2025, Pass@1. Open-weight, 685B MoE. Matches GPT-5 on reasoning.

93.1%

Seed 2.0 Lite

ByteDance · Blog/ByteDance · 2026-03-10

Mid-tier model in Seed 2.0 family. Handles 95% of enterprise workloads at half the Pro cost.

93.0%

GLM-5

Zhipu AI · Paper/Zhipu AI (arxiv:2602.15763) · 2026-02-11

AIME 2026 I. On par with Kimi K2.5 (92.5) and DeepSeek V3.2 (92.7).

92.7%

o4-mini

OpenAI · Blog/OpenAI · 2025-04-16

Without tools (closed-book). With tools achieves 99.5%. Best-performing small reasoning model on AIME.

92.7%

Qwen 3.5 9B

Alibaba · HuggingFace/Qwen · 2026-03-02

MathArena AIME 2026. A 9B model matching frontier math scores. Remarkable reasoning-per-parameter.

92.5%

GLM-4.7-Flash

Zhipu AI · HuggingFace/Zhipu · 2026-01-15

30B-A3B MoE with only 3.6B active params. Strong math reasoning for a lightweight model.

91.6%

Qwen 3.5 397B

Alibaba · HuggingFace/Qwen · 2026-02-16

AIME 2026. Strong math reasoning from open-weight MoE model.

91.3%

Qwen3.5 27B

Alibaba · HuggingFace/Alibaba · 2026-02-20

AIME 2026. Exceptional math reasoning for a 27B model — near frontier-class.

90.83%

Gemma 4 31B

Google · X/@o_mega___ · 2026-04-02

Independent user corroborates Google's AIME score; 0 engagement

89.2%

Nemotron 3 Nano 30B

NVIDIA · HuggingFace/NVIDIA · 2025-12-15

Without tools. Hybrid Mamba-Transformer MoE, 3.3x throughput vs Qwen3-30B on H200.

89.1%

Gemma 4 26B A4B

Google · Model Card/Google · 2026-04-02

MoE with only 3.8B active params outperforms many larger models.

88.3%

Sarvam 105B

Sarvam AI · HuggingFace/sarvamai · 2026-03-06

India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.

88.3%

MiniMax M2.5

MiniMax · Blog/MiniMax · 2026-02-12

Competitive math reasoning. Behind Claude Opus 4.5 (91.0) but ahead of Sonnet 4.5 (88.0).

86.3%

Phi-4-reasoning-plus

Microsoft · HuggingFace/Microsoft · 2025-04-30

SFT+RL variant of Phi-4-reasoning. 14B model hitting 81%+ on AIME 2024 — remarkable for size.

81.3%

Claude Haiku 4.5

Anthropic · Blog/Anthropic · 2025-10-15

Strong math reasoning for a Haiku-tier model. Extended thinking with 128K budget.

80.7%

Phi-4-reasoning

Microsoft · HuggingFace/Microsoft · 2025-04-30

14B dense model. Matches DeepSeek-R1 (671B) on AIME despite 47x fewer params.

75.3%

Gemma 4 4B

Google · Google Model Card · 2026-04-02

Impressive for a 4B model running on T4 GPUs. MoE architecture.

42.5%

Gemma 4 E2B

Google · Model Card/Google · 2026-04-02

Matches Gemma 3 27B despite being a fraction of the size. Built-in reasoning capability.

37.5%