GPQA Diamond Leaderboard 2026 — Results Across 49 Real AI Models

GPQA Diamond leaderboard

GPQA Diamond

49 models tested · Updated 2026-04-07 · Verified sources only

      Claude Mythos Preview leads at 94.6%
    

Claude Mythos Preview

Anthropic · Blog/Anthropic · 2026-04-07

New SOTA. Tied with Gemini 3.1 Pro (94.3%). PhD-level science reasoning.

94.6%

Gemini 3.1 Pro

Google · PCMag/Google DeepMind · 2026-02-19

Highest GPQA Diamond score ever recorded. Leads 13 of 16 major benchmarks per Google DeepMind.

94.3%

GPT-5.2 Pro

OpenAI · Blog/OpenAI · 2025-12-11

OpenAI Pro-tier model. Surpasses PhD experts (69.7%) by large margin. Just below Gemini 3.1 Pro (94.3%).

93.2%

GPT-5.4

OpenAI · Blog/OpenAI · 2026-03-05

GPT-5.4 scores 92.8% on GPQA Diamond, marginal improvement over GPT-5.2 at 92.4%

92.8%

GPT-5.2

OpenAI · OpenAI — GPT-5.2 for Science and Math · 2025-12-11

GPT-5.2 Thinking variant. +4.3% over GPT-5.1. No tools, max reasoning effort.

92.4%

Gemini 3 Pro

Google · Blog/Google · 2025-11-18

Announced with Gemini 3 launch. PhD-level reasoning, top of LMArena at 1501 Elo.

91.9%

GPT-5.3 Codex

OpenAI · Artificial Analysis · 2026-02-05

Third-highest GPQA Diamond score. Codex-native agent pairing frontier coding with general reasoning.

91.5%

Claude Opus 4.6

Anthropic · Anthropic — Introducing Claude Opus 4.6 · 2026-02-05

Graduate-level science reasoning. Within 1.1 points of GPT-5.2.

91.3%

Qwen 3.6 Plus

Alibaba · Blog/Qwen · 2026-04-02

Qwen 3.6 Plus leads on GPQA with 90.4%

90.4%

Gemini 3 Flash

Google · Blog/Google · 2025-12-17

Frontier PhD-level reasoning at Flash-tier pricing and latency. Matches larger models.

90.4%

Claude Sonnet 4.6

Anthropic · Blog/Anthropic · 2026-02-17

Averaged over 10 trials with adaptive thinking. Within 2.5 pts of GPT-5.2.

89.9%

Muse Spark

Meta · Blog/Meta · 2026-04-08

Strong PhD-level reasoning. Ahead of Gemini 3.1 Pro (90.8%) and GLM-5.1 (86.2%).

89.5%

Seed 2.0 Pro

ByteDance · Blog/ByteDance · 2026-02-14

ByteDance frontier model. Gold medals on ICPC, IMO, CMO. Competes with GPT-5.2.

88.9%

Qwen 3.5 397B

Alibaba · HuggingFace/Qwen · 2026-02-16

397B total, 17B active MoE. Native multimodal with vision. Top open-weight model on reasoning.

88.4%

Grok 4

xAI · Artificial Analysis · 2025-07-09

All-time high GPQA Diamond at time of release. Verified by Artificial Analysis.

88.0%

GPT-5.4 Mini

OpenAI · Blog/OpenAI · 2026-03-17

Small/cheap GPT-5.4 variant approaching full GPT-5.4 (92.8%) on reasoning. 2x faster than GPT-5 mini.

88.0%

OpenAI · Blog/OpenAI · 2025-04-16

OpenAI reasoning model. Strong scientific knowledge but below Gemini 3.1 Pro (94.3%) and Claude Opus 4.6 (91.3%).

87.7%

Kimi K2.5

Moonshot AI · HuggingFace/moonshotai · 2026-01-27

1T MoE, 32B active. Open-source. Averaged over 8 runs. Strong but trails Gemini 3.1 Pro (94.3) and Claude Opus 4.6 (91.3).

87.6%

Gemini 3.1 Flash-Lite

Google · Model Card/Google · 2026-03-03

Lightweight model rivaling larger Gemini 3 series on reasoning. 2.5x faster than 2.5 Flash.

86.9%

MiniMax M2.7

MiniMax · Artificial Analysis · 2026-03-18

Self-improving model that ran 100+ autonomous optimization cycles during training. Competitive with Kimi K2.5 on science reasoning.

86.2%

GLM-5.1

Zhipu AI · Blog/Z.AI · 2026-04-07

Strong reasoning for a 744B MoE open-weight model. Behind Gemini 3.1 Pro (94.3) and GPT-5.4 (92.8).

86.2%

GLM-5

Zhipu AI · Paper/Zhipu AI (arxiv:2602.15763) · 2026-02-11

Open-weight 744B MoE (40B active). Competitive with proprietary models, trails only Gemini 3 Pro and GPT-5.2.

86.0%

Qwen3.5 27B (Reasoning)

Alibaba · X/@rohanpaul_ai · 2026-02-24

Highest recorded result among open-weights models under 40B parameters on GPQA Diamond

85.8%

Gemma 4 31B (Reasoning)

Google · X/@rohanpaul_ai · 2026-04-02

2nd-highest among open-weights models under 40B parameters on GPQA Diamond

85.7%

DeepSeek V3.2 Speciale

DeepSeek · Blog/DeepSeek · 2025-12-01

Deep reasoning variant. Par with Gemini-3.0-Pro on scientific reasoning.

85.7%

GLM-4.7

Zhipu AI · NVIDIA/Zhipu Official · 2025-12-22

Open-weight, 400B params. Strong science reasoning for an open model.

85.7%

Qwen3.5 27B

Alibaba · HuggingFace/Alibaba · 2026-02-20

Dense 27B model, natively multimodal. Competitive with much larger frontier models on graduate-level reasoning.

85.5%

MiniMax M2.5

MiniMax · Blog/MiniMax · 2026-02-12

Strong graduate-level reasoning. Between Gemini 3 Pro (91.0) and Claude Sonnet 4.5 (83.0).

85.2%

Seed 2.0 Lite

ByteDance · Blog/ByteDance · 2026-03-10

Strong for a mid-tier model. 262K context window, multimodal.

85.1%

Gemma 4 31B

Google · Model Card/Google · 2026-04-02

From Google model card. Dense 31B instruction-tuned.

84.3%

Step-3.5-Flash

StepFun · HuggingFace/stepfun-ai · 2026-02-02

Open-weight MoE (196B total, 11B active). From official model card.

83.5%

GPT-5.4 Nano

OpenAI · Blog/OpenAI · 2026-03-17

Smallest GPT-5.4 variant. Strong GPQA for its cost tier ($0.20/1M input tokens).

82.8%

DeepSeek V3.2

DeepSeek · Blog/DeepSeek · 2025-12-01

Graduate-level science reasoning. Solid mid-frontier performance.

82.4%

Gemma 4 26B A4B

Google · Google — Gemma 4 Blog · 2026-04-02

MoE with 3.8B active params. Outperforms Sonnet 4.6 (74%). Runs on 24GB hardware.

82.3%

Qwen 3.5 9B

Alibaba · HuggingFace/Qwen · 2026-03-02

9B params. Outperforms GPT-OSS-120B (80.1). Best reasoning-per-parameter ratio among open models.

81.7%

o4-mini

OpenAI · Blog/OpenAI · 2025-04-16

Strong scientific reasoning from OpenAI cost-efficient reasoning model. Competitive with much larger models.

81.4%

Sarvam 105B

Sarvam AI · HuggingFace/sarvamai · 2026-03-06

India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.

78.7%

Arcee Trinity

Arcee AI · HuggingFace/arcee-ai/Trinity-Large-Thinking · 2026-04-01

Trinity-Large-Thinking: 398B MoE, only 13B active params. Open-weight, Apache 2.0.

76.3%

Qwen3.5-4B

Alibaba · HuggingFace/Alibaba · 2026-03-02

Remarkable for a 4B model. Matches larger Qwen3-80B-A3B on reasoning.

76.2%

GLM-4.7-Flash

Zhipu AI · HuggingFace/Zhipu · 2026-01-15

30B-A3B MoE, 3.6B active params. Competitive with much larger models on graduate-level reasoning.

75.2%

gpt-oss-120b

OpenAI · HuggingFace/openai/gpt-oss-120b · 2025-08-05

Primary source (HuggingFace model card, arxiv paper). Scores up to 80.9% reported with tool augmentation.

73.5%

Mistral Small 4

Mistral · Blog/Mistral · 2026-03-16

71.2% on GPQA Diamond with efficient output length. Released March 16, 2026.

71.2%

Llama 4 Maverick

Meta · HuggingFace/Meta · 2026-04-05

17B active params, 128 experts, 400B total. Released alongside Scout on April 5. MoE architecture, natively multimodal.

69.8%

Phi-4-reasoning-plus

Microsoft · HuggingFace/Microsoft · 2025-04-30

14B SFT+RL model. Outperforms DeepSeek-R1-Distill-70B on most benchmarks.

68.9%

gpt-oss-20b

OpenAI · HuggingFace/openai/gpt-oss-20b · 2025-08-05

Without tool augmentation. HuggingFace model card primary source. 20.9B total params, 3.6B active.

67.1%

Phi-4-reasoning

Microsoft · HuggingFace/Microsoft · 2025-04-30

14B model beating o1-mini (60.0%) and QwQ-32B (59.5%) on graduate-level science.

65.8%

Gemma 4 4B

Google · Google Model Card · 2026-04-02

Strong graduate-level reasoning for a 4B edge model.

58.6%

Llama 4 Scout

Meta · HuggingFace/Meta · 2026-04-05

17B active params, 16 experts. 10M context window. Smaller expert count than Maverick.

57.2%

Gemma 4 E2B

Google · Model Card/Google · 2026-04-02

Ultra-compact model with Per-Layer Embeddings. Competitive with models 10x its size on science reasoning.

43.4%