MMLU Pro
29 models tested · Updated 2026-02-19 · Verified sources only
Gemini 3.1 Pro leads at 90.99%
1
Google · Blog/Google · 2026-02-19
Highest MMLU Pro score reported. Leads all models on this benchmark.
90.99%
2
OpenAI · Blog/OpenAI · 2025-08-05
Matches o4-mini on general knowledge. Runs on single 80GB GPU despite 116.8B params (5.1B active via MoE).
90.0%
3
Alibaba · Blog/Qwen · 2026-04-02
Leads MMLU-Pro leaderboard among available models as of early April 2026.
88.5%
4
Alibaba · HuggingFace/Qwen · 2026-02-16
Broad knowledge and reasoning. 397B total, 17B active params.
87.8%
5
ByteDance · Blog/ByteDance · 2026-03-10
Actually scores higher than Seed 2.0 Pro (87.0) on this benchmark.
87.7%
6
Moonshot AI · HuggingFace/moonshotai · 2026-01-27
Strong knowledge reasoning from the 1T MoE open-source model. Thinking mode enabled.
87.1%
7
xAI · Artificial Analysis · 2025-07-09
Joint highest MMLU Pro at time of release. Verified by Artificial Analysis.
87.0%
8
ByteDance · Blog/ByteDance · 2026-02-14
Strong knowledge breadth. Seed 2.0 Lite variant scores slightly higher at 87.7.
87.0%
9
Alibaba · Model Card/Qwen · 2026-02-24
Dense multimodal model, 262k native context.
86.1%
10
OpenAI · Blog/OpenAI · 2025-08-05
Only 3.6B active params. Matches o3-mini on general knowledge. Best-in-class for sub-30B models.
85.3%
11
Google · Model Card/Google · 2026-04-02
Updated from model card. Dense 31B instruction-tuned.
85.2%
12
DeepSeek · Blog/DeepSeek · 2025-12-01
Strong knowledge benchmark. Competitive with top models.
85.0%
13
StepFun · HuggingFace/stepfun-ai · 2026-02-02
Open-weight MoE with 11B active params. Strong general knowledge for its size.
84.4%
14
Solid general reasoning for a 13B-active MoE. Competitive with much larger models.
83.4%
15
Google · Model Card/Google · 2026-04-02
MoE with 3.8B active params. Strong efficiency.
82.6%
16
Alibaba · HuggingFace/Qwen · 2026-02-28
9B params beating GPT-OSS-120B (80.8). Best-in-class for small open models.
82.5%
17
Anthropic · Blog/Anthropic · 2026-02-05
Strong general knowledge, but trails Gemini 3.1 Pro and GPT-5 on this benchmark.
82.0%
18
Sarvam AI · HuggingFace/sarvamai · 2026-03-06
India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.
81.7%
19
JD.com · Paper/JD.com (arXiv) · 2026-04-03
48B MoE, 2.7B active params. Competitive with 9B-class models at fraction of compute.
81.6%
20
Meta · HuggingFace/Meta · 2026-04-05
17B active params, 128 experts, 400B total. Instruction-tuned score from official model card.
80.5%
21
Anthropic · Blog/Anthropic · 2026-02-17
Broad knowledge and reasoning. Anthropic official benchmarks page.
79.2%
22
Alibaba · HuggingFace/Alibaba · 2026-03-02
Strong general knowledge for a 4B open-weight model.
79.1%
23
NVIDIA · Blog/NVIDIA · 2025-12-15
Hybrid MoE model, 30B params with 3B active. Mamba-Transformer architecture with 1M token context. Open-weight under NVIDIA license.
78.3%
24
Mistral · Blog/Mistral · 2026-03-16
Efficient small model. Competitive with much larger models using less compute.
78.0%
25
Microsoft · HuggingFace/Microsoft · 2025-04-30
SFT+RL variant. 14B params, MIT license, 32k context.
76.0%
26
Meta · HuggingFace/Meta · 2026-04-05
Instruction-tuned score from official model card.
74.3%
27
Microsoft · HuggingFace/Microsoft · 2025-04-30
14B dense model, MIT license. Strong for its size class.
74.3%
28
Google · Google Model Card · 2026-04-02
Official model card score. Up from 60.0% (tweet source).
69.4%
29
Google · Model Card/Google · 2026-04-02
Ultra-compact 2.3B active params (5.1B total) with PLE. Fits in 1.5GB quantized. Strong for its size class.
60.0%