benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
articles
GPQA Diamond leaderboard
GPQA Diamond
49 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview
leads at
94.6%
1
Claude Mythos Preview
Anthropic ·
Blog/Anthropic
· 2026-04-07
New SOTA. Tied with Gemini 3.1 Pro (94.3%). PhD-level science reasoning.
94.6%
2
Gemini 3.1 Pro
Google ·
PCMag/Google DeepMind
· 2026-02-19
Highest GPQA Diamond score ever recorded. Leads 13 of 16 major benchmarks per Google DeepMind.
94.3%
3
GPT-5.2 Pro
OpenAI ·
Blog/OpenAI
· 2025-12-11
OpenAI Pro-tier model. Surpasses PhD experts (69.7%) by large margin. Just below Gemini 3.1 Pro (94.3%).
93.2%
4
GPT-5.4
OpenAI ·
Blog/OpenAI
· 2026-03-05
GPT-5.4 scores 92.8% on GPQA Diamond, marginal improvement over GPT-5.2 at 92.4%
92.8%
5
GPT-5.2
OpenAI ·
OpenAI — GPT-5.2 for Science and Math
· 2025-12-11
GPT-5.2 Thinking variant. +4.3% over GPT-5.1. No tools, max reasoning effort.
92.4%
6
Gemini 3 Pro
Google ·
Blog/Google
· 2025-11-18
Announced with Gemini 3 launch. PhD-level reasoning, top of LMArena at 1501 Elo.
91.9%
7
GPT-5.3 Codex
OpenAI ·
Artificial Analysis
· 2026-02-05
Third-highest GPQA Diamond score. Codex-native agent pairing frontier coding with general reasoning.
91.5%
8
Claude Opus 4.6
Anthropic ·
Anthropic — Introducing Claude Opus 4.6
· 2026-02-05
Graduate-level science reasoning. Within 1.1 points of GPT-5.2.
91.3%
9
Qwen 3.6 Plus
Alibaba ·
Blog/Qwen
· 2026-04-02
Qwen 3.6 Plus leads on GPQA with 90.4%
90.4%
10
Gemini 3 Flash
Google ·
Blog/Google
· 2025-12-17
Frontier PhD-level reasoning at Flash-tier pricing and latency. Matches larger models.
90.4%
11
Claude Sonnet 4.6
Anthropic ·
Blog/Anthropic
· 2026-02-17
Averaged over 10 trials with adaptive thinking. Within 2.5 pts of GPT-5.2.
89.9%
12
Muse Spark
Meta ·
Blog/Meta
· 2026-04-08
Strong PhD-level reasoning. Ahead of Gemini 3.1 Pro (90.8%) and GLM-5.1 (86.2%).
89.5%
13
Seed 2.0 Pro
ByteDance ·
Blog/ByteDance
· 2026-02-14
ByteDance frontier model. Gold medals on ICPC, IMO, CMO. Competes with GPT-5.2.
88.9%
14
Qwen 3.5 397B
Alibaba ·
HuggingFace/Qwen
· 2026-02-16
397B total, 17B active MoE. Native multimodal with vision. Top open-weight model on reasoning.
88.4%
15
Grok 4
xAI ·
Artificial Analysis
· 2025-07-09
All-time high GPQA Diamond at time of release. Verified by Artificial Analysis.
88.0%
16
GPT-5.4 Mini
OpenAI ·
Blog/OpenAI
· 2026-03-17
Small/cheap GPT-5.4 variant approaching full GPT-5.4 (92.8%) on reasoning. 2x faster than GPT-5 mini.
88.0%
17
o3
OpenAI ·
Blog/OpenAI
· 2025-04-16
OpenAI reasoning model. Strong scientific knowledge but below Gemini 3.1 Pro (94.3%) and Claude Opus 4.6 (91.3%).
87.7%
18
Kimi K2.5
Moonshot AI ·
HuggingFace/moonshotai
· 2026-01-27
1T MoE, 32B active. Open-source. Averaged over 8 runs. Strong but trails Gemini 3.1 Pro (94.3) and Claude Opus 4.6 (91.3).
87.6%
19
Gemini 3.1 Flash-Lite
Google ·
Model Card/Google
· 2026-03-03
Lightweight model rivaling larger Gemini 3 series on reasoning. 2.5x faster than 2.5 Flash.
86.9%
20
MiniMax M2.7
MiniMax ·
Artificial Analysis
· 2026-03-18
Self-improving model that ran 100+ autonomous optimization cycles during training. Competitive with Kimi K2.5 on science reasoning.
86.2%
21
GLM-5.1
Zhipu AI ·
Blog/Z.AI
· 2026-04-07
Strong reasoning for a 744B MoE open-weight model. Behind Gemini 3.1 Pro (94.3) and GPT-5.4 (92.8).
86.2%
22
GLM-5
Zhipu AI ·
Paper/Zhipu AI (arxiv:2602.15763)
· 2026-02-11
Open-weight 744B MoE (40B active). Competitive with proprietary models, trails only Gemini 3 Pro and GPT-5.2.
86.0%
23
Qwen3.5 27B (Reasoning)
Alibaba ·
X/@rohanpaul_ai
· 2026-02-24
Highest recorded result among open-weights models under 40B parameters on GPQA Diamond
85.8%
24
Gemma 4 31B (Reasoning)
Google ·
X/@rohanpaul_ai
· 2026-04-02
2nd-highest among open-weights models under 40B parameters on GPQA Diamond
85.7%
25
DeepSeek V3.2 Speciale
DeepSeek ·
Blog/DeepSeek
· 2025-12-01
Deep reasoning variant. Par with Gemini-3.0-Pro on scientific reasoning.
85.7%
26
GLM-4.7
Zhipu AI ·
NVIDIA/Zhipu Official
· 2025-12-22
Open-weight, 400B params. Strong science reasoning for an open model.
85.7%
27
Qwen3.5 27B
Alibaba ·
HuggingFace/Alibaba
· 2026-02-20
Dense 27B model, natively multimodal. Competitive with much larger frontier models on graduate-level reasoning.
85.5%
28
MiniMax M2.5
MiniMax ·
Blog/MiniMax
· 2026-02-12
Strong graduate-level reasoning. Between Gemini 3 Pro (91.0) and Claude Sonnet 4.5 (83.0).
85.2%
29
Seed 2.0 Lite
ByteDance ·
Blog/ByteDance
· 2026-03-10
Strong for a mid-tier model. 262K context window, multimodal.
85.1%
30
Gemma 4 31B
Google ·
Model Card/Google
· 2026-04-02
From Google model card. Dense 31B instruction-tuned.
84.3%
31
Step-3.5-Flash
StepFun ·
HuggingFace/stepfun-ai
· 2026-02-02
Open-weight MoE (196B total, 11B active). From official model card.
83.5%
32
GPT-5.4 Nano
OpenAI ·
Blog/OpenAI
· 2026-03-17
Smallest GPT-5.4 variant. Strong GPQA for its cost tier ($0.20/1M input tokens).
82.8%
33
DeepSeek V3.2
DeepSeek ·
Blog/DeepSeek
· 2025-12-01
Graduate-level science reasoning. Solid mid-frontier performance.
82.4%
34
Gemma 4 26B A4B
Google ·
Google — Gemma 4 Blog
· 2026-04-02
MoE with 3.8B active params. Outperforms Sonnet 4.6 (74%). Runs on 24GB hardware.
82.3%
35
Qwen 3.5 9B
Alibaba ·
HuggingFace/Qwen
· 2026-03-02
9B params. Outperforms GPT-OSS-120B (80.1). Best reasoning-per-parameter ratio among open models.
81.7%
36
o4-mini
OpenAI ·
Blog/OpenAI
· 2025-04-16
Strong scientific reasoning from OpenAI cost-efficient reasoning model. Competitive with much larger models.
81.4%
37
Sarvam 105B
Sarvam AI ·
HuggingFace/sarvamai
· 2026-03-06
India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.
78.7%
38
Arcee Trinity
Arcee AI ·
HuggingFace/arcee-ai/Trinity-Large-Thinking
· 2026-04-01
Trinity-Large-Thinking: 398B MoE, only 13B active params. Open-weight, Apache 2.0.
76.3%
39
Qwen3.5-4B
Alibaba ·
HuggingFace/Alibaba
· 2026-03-02
Remarkable for a 4B model. Matches larger Qwen3-80B-A3B on reasoning.
76.2%
40
GLM-4.7-Flash
Zhipu AI ·
HuggingFace/Zhipu
· 2026-01-15
30B-A3B MoE, 3.6B active params. Competitive with much larger models on graduate-level reasoning.
75.2%
41
gpt-oss-120b
OpenAI ·
HuggingFace/openai/gpt-oss-120b
· 2025-08-05
Primary source (HuggingFace model card, arxiv paper). Scores up to 80.9% reported with tool augmentation.
73.5%
42
Mistral Small 4
Mistral ·
Blog/Mistral
· 2026-03-16
71.2% on GPQA Diamond with efficient output length. Released March 16, 2026.
71.2%
43
Llama 4 Maverick
Meta ·
HuggingFace/Meta
· 2026-04-05
17B active params, 128 experts, 400B total. Released alongside Scout on April 5. MoE architecture, natively multimodal.
69.8%
44
Phi-4-reasoning-plus
Microsoft ·
HuggingFace/Microsoft
· 2025-04-30
14B SFT+RL model. Outperforms DeepSeek-R1-Distill-70B on most benchmarks.
68.9%
45
gpt-oss-20b
OpenAI ·
HuggingFace/openai/gpt-oss-20b
· 2025-08-05
Without tool augmentation. HuggingFace model card primary source. 20.9B total params, 3.6B active.
67.1%
46
Phi-4-reasoning
Microsoft ·
HuggingFace/Microsoft
· 2025-04-30
14B model beating o1-mini (60.0%) and QwQ-32B (59.5%) on graduate-level science.
65.8%
47
Gemma 4 4B
Google ·
Google Model Card
· 2026-04-02
Strong graduate-level reasoning for a 4B edge model.
58.6%
48
Llama 4 Scout
Meta ·
HuggingFace/Meta
· 2026-04-05
17B active params, 16 experts. 10M context window. Smaller expert count than Maverick.
57.2%
49
Gemma 4 E2B
Google ·
Model Card/Google
· 2026-04-02
Ultra-compact model with Per-Layer Embeddings. Competitive with models 10x its size on science reasoning.
43.4%