GPQA Diamond
49 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview leads at 94.6%
1
Anthropic · Blog/Anthropic · 2026-04-07
New SOTA. Tied with Gemini 3.1 Pro (94.3%). PhD-level science reasoning.
94.6%
2
Google · PCMag/Google DeepMind · 2026-02-19
Highest GPQA Diamond score ever recorded. Leads 13 of 16 major benchmarks per Google DeepMind.
94.3%
3
OpenAI · Blog/OpenAI · 2025-12-11
OpenAI Pro-tier model. Surpasses PhD experts (69.7%) by large margin. Just below Gemini 3.1 Pro (94.3%).
93.2%
4
OpenAI · Blog/OpenAI · 2026-03-05
GPT-5.4 scores 92.8% on GPQA Diamond, marginal improvement over GPT-5.2 at 92.4%
92.8%
5
GPT-5.2 Thinking variant. +4.3% over GPT-5.1. No tools, max reasoning effort.
92.4%
6
Google · Blog/Google · 2025-11-18
Announced with Gemini 3 launch. PhD-level reasoning, top of LMArena at 1501 Elo.
91.9%
7
OpenAI · Artificial Analysis · 2026-02-05
Third-highest GPQA Diamond score. Codex-native agent pairing frontier coding with general reasoning.
91.5%
8
Graduate-level science reasoning. Within 1.1 points of GPT-5.2.
91.3%
9
Alibaba · Blog/Qwen · 2026-04-02
Qwen 3.6 Plus leads on GPQA with 90.4%
90.4%
10
Google · Blog/Google · 2025-12-17
Frontier PhD-level reasoning at Flash-tier pricing and latency. Matches larger models.
90.4%
11
Anthropic · Blog/Anthropic · 2026-02-17
Averaged over 10 trials with adaptive thinking. Within 2.5 pts of GPT-5.2.
89.9%
12
Meta · Blog/Meta · 2026-04-08
Strong PhD-level reasoning. Ahead of Gemini 3.1 Pro (90.8%) and GLM-5.1 (86.2%).
89.5%
13
ByteDance · Blog/ByteDance · 2026-02-14
ByteDance frontier model. Gold medals on ICPC, IMO, CMO. Competes with GPT-5.2.
88.9%
14
Alibaba · HuggingFace/Qwen · 2026-02-16
397B total, 17B active MoE. Native multimodal with vision. Top open-weight model on reasoning.
88.4%
15
xAI · Artificial Analysis · 2025-07-09
All-time high GPQA Diamond at time of release. Verified by Artificial Analysis.
88.0%
16
OpenAI · Blog/OpenAI · 2026-03-17
Small/cheap GPT-5.4 variant approaching full GPT-5.4 (92.8%) on reasoning. 2x faster than GPT-5 mini.
88.0%
17
OpenAI · Blog/OpenAI · 2025-04-16
OpenAI reasoning model. Strong scientific knowledge but below Gemini 3.1 Pro (94.3%) and Claude Opus 4.6 (91.3%).
87.7%
18
Moonshot AI · HuggingFace/moonshotai · 2026-01-27
1T MoE, 32B active. Open-source. Averaged over 8 runs. Strong but trails Gemini 3.1 Pro (94.3) and Claude Opus 4.6 (91.3).
87.6%
19
Google · Model Card/Google · 2026-03-03
Lightweight model rivaling larger Gemini 3 series on reasoning. 2.5x faster than 2.5 Flash.
86.9%
20
MiniMax · Artificial Analysis · 2026-03-18
Self-improving model that ran 100+ autonomous optimization cycles during training. Competitive with Kimi K2.5 on science reasoning.
86.2%
21
Zhipu AI · Blog/Z.AI · 2026-04-07
Strong reasoning for a 744B MoE open-weight model. Behind Gemini 3.1 Pro (94.3) and GPT-5.4 (92.8).
86.2%
22
Zhipu AI · Paper/Zhipu AI (arxiv:2602.15763) · 2026-02-11
Open-weight 744B MoE (40B active). Competitive with proprietary models, trails only Gemini 3 Pro and GPT-5.2.
86.0%
23
Alibaba · X/@rohanpaul_ai · 2026-02-24
Highest recorded result among open-weights models under 40B parameters on GPQA Diamond
85.8%
24
Google · X/@rohanpaul_ai · 2026-04-02
2nd-highest among open-weights models under 40B parameters on GPQA Diamond
85.7%
25
DeepSeek · Blog/DeepSeek · 2025-12-01
Deep reasoning variant. Par with Gemini-3.0-Pro on scientific reasoning.
85.7%
26
Zhipu AI · NVIDIA/Zhipu Official · 2025-12-22
Open-weight, 400B params. Strong science reasoning for an open model.
85.7%
27
Alibaba · HuggingFace/Alibaba · 2026-02-20
Dense 27B model, natively multimodal. Competitive with much larger frontier models on graduate-level reasoning.
85.5%
28
MiniMax · Blog/MiniMax · 2026-02-12
Strong graduate-level reasoning. Between Gemini 3 Pro (91.0) and Claude Sonnet 4.5 (83.0).
85.2%
29
ByteDance · Blog/ByteDance · 2026-03-10
Strong for a mid-tier model. 262K context window, multimodal.
85.1%
30
Google · Model Card/Google · 2026-04-02
From Google model card. Dense 31B instruction-tuned.
84.3%
31
StepFun · HuggingFace/stepfun-ai · 2026-02-02
Open-weight MoE (196B total, 11B active). From official model card.
83.5%
32
OpenAI · Blog/OpenAI · 2026-03-17
Smallest GPT-5.4 variant. Strong GPQA for its cost tier ($0.20/1M input tokens).
82.8%
33
DeepSeek · Blog/DeepSeek · 2025-12-01
Graduate-level science reasoning. Solid mid-frontier performance.
82.4%
34
Google · Google — Gemma 4 Blog · 2026-04-02
MoE with 3.8B active params. Outperforms Sonnet 4.6 (74%). Runs on 24GB hardware.
82.3%
35
Alibaba · HuggingFace/Qwen · 2026-03-02
9B params. Outperforms GPT-OSS-120B (80.1). Best reasoning-per-parameter ratio among open models.
81.7%
36
OpenAI · Blog/OpenAI · 2025-04-16
Strong scientific reasoning from OpenAI cost-efficient reasoning model. Competitive with much larger models.
81.4%
37
Sarvam AI · HuggingFace/sarvamai · 2026-03-06
India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.
78.7%
38
Trinity-Large-Thinking: 398B MoE, only 13B active params. Open-weight, Apache 2.0.
76.3%
39
Alibaba · HuggingFace/Alibaba · 2026-03-02
Remarkable for a 4B model. Matches larger Qwen3-80B-A3B on reasoning.
76.2%
40
Zhipu AI · HuggingFace/Zhipu · 2026-01-15
30B-A3B MoE, 3.6B active params. Competitive with much larger models on graduate-level reasoning.
75.2%
41
OpenAI · HuggingFace/openai/gpt-oss-120b · 2025-08-05
Primary source (HuggingFace model card, arxiv paper). Scores up to 80.9% reported with tool augmentation.
73.5%
42
Mistral · Blog/Mistral · 2026-03-16
71.2% on GPQA Diamond with efficient output length. Released March 16, 2026.
71.2%
43
Meta · HuggingFace/Meta · 2026-04-05
17B active params, 128 experts, 400B total. Released alongside Scout on April 5. MoE architecture, natively multimodal.
69.8%
44
Microsoft · HuggingFace/Microsoft · 2025-04-30
14B SFT+RL model. Outperforms DeepSeek-R1-Distill-70B on most benchmarks.
68.9%
45
OpenAI · HuggingFace/openai/gpt-oss-20b · 2025-08-05
Without tool augmentation. HuggingFace model card primary source. 20.9B total params, 3.6B active.
67.1%
46
Microsoft · HuggingFace/Microsoft · 2025-04-30
14B model beating o1-mini (60.0%) and QwQ-32B (59.5%) on graduate-level science.
65.8%
47
Google · Google Model Card · 2026-04-02
Strong graduate-level reasoning for a 4B edge model.
58.6%
48
Meta · HuggingFace/Meta · 2026-04-05
17B active params, 16 experts. 10M context window. Smaller expert count than Maverick.
57.2%
49
Google · Model Card/Google · 2026-04-02
Ultra-compact model with Per-Layer Embeddings. Competitive with models 10x its size on science reasoning.
43.4%