MMLU Pro Leaderboard 2026 — Results Across 29 Real AI Models

MMLU Pro leaderboard

MMLU Pro

29 models tested · Updated 2026-02-19 · Verified sources only

      Gemini 3.1 Pro leads at 90.99%
    

Gemini 3.1 Pro

Google · Blog/Google · 2026-02-19

Highest MMLU Pro score reported. Leads all models on this benchmark.

90.99%

gpt-oss-120b

OpenAI · Blog/OpenAI · 2025-08-05

Matches o4-mini on general knowledge. Runs on single 80GB GPU despite 116.8B params (5.1B active via MoE).

90.0%

Qwen 3.6 Plus

Alibaba · Blog/Qwen · 2026-04-02

Leads MMLU-Pro leaderboard among available models as of early April 2026.

88.5%

Qwen 3.5 397B

Alibaba · HuggingFace/Qwen · 2026-02-16

Broad knowledge and reasoning. 397B total, 17B active params.

87.8%

Seed 2.0 Lite

ByteDance · Blog/ByteDance · 2026-03-10

Actually scores higher than Seed 2.0 Pro (87.0) on this benchmark.

87.7%

Kimi K2.5

Moonshot AI · HuggingFace/moonshotai · 2026-01-27

Strong knowledge reasoning from the 1T MoE open-source model. Thinking mode enabled.

87.1%

Grok 4

xAI · Artificial Analysis · 2025-07-09

Joint highest MMLU Pro at time of release. Verified by Artificial Analysis.

87.0%

Seed 2.0 Pro

ByteDance · Blog/ByteDance · 2026-02-14

Strong knowledge breadth. Seed 2.0 Lite variant scores slightly higher at 87.7.

87.0%

Qwen3.5 27B

Alibaba · Model Card/Qwen · 2026-02-24

Dense multimodal model, 262k native context.

86.1%

gpt-oss-20b

OpenAI · Blog/OpenAI · 2025-08-05

Only 3.6B active params. Matches o3-mini on general knowledge. Best-in-class for sub-30B models.

85.3%

Gemma 4 31B

Google · Model Card/Google · 2026-04-02

Updated from model card. Dense 31B instruction-tuned.

85.2%

DeepSeek V3.2

DeepSeek · Blog/DeepSeek · 2025-12-01

Strong knowledge benchmark. Competitive with top models.

85.0%

Step-3.5-Flash

StepFun · HuggingFace/stepfun-ai · 2026-02-02

Open-weight MoE with 11B active params. Strong general knowledge for its size.

84.4%

Arcee Trinity

Arcee AI · HuggingFace/arcee-ai/Trinity-Large-Thinking · 2026-04-01

Solid general reasoning for a 13B-active MoE. Competitive with much larger models.

83.4%

Gemma 4 26B A4B

Google · Model Card/Google · 2026-04-02

MoE with 3.8B active params. Strong efficiency.

82.6%

Qwen 3.5 9B

Alibaba · HuggingFace/Qwen · 2026-02-28

9B params beating GPT-OSS-120B (80.8). Best-in-class for small open models.

82.5%

Claude Opus 4.6

Anthropic · Blog/Anthropic · 2026-02-05

Strong general knowledge, but trails Gemini 3.1 Pro and GPT-5 on this benchmark.

82.0%

Sarvam 105B

Sarvam AI · HuggingFace/sarvamai · 2026-03-06

India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.

81.7%

JoyAI-LLM Flash

JD.com · Paper/JD.com (arXiv) · 2026-04-03

48B MoE, 2.7B active params. Competitive with 9B-class models at fraction of compute.

81.6%

Llama 4 Maverick

Meta · HuggingFace/Meta · 2026-04-05

17B active params, 128 experts, 400B total. Instruction-tuned score from official model card.

80.5%

Claude Sonnet 4.6

Anthropic · Blog/Anthropic · 2026-02-17

Broad knowledge and reasoning. Anthropic official benchmarks page.

79.2%

Qwen3.5-4B

Alibaba · HuggingFace/Alibaba · 2026-03-02

Strong general knowledge for a 4B open-weight model.

79.1%

Nemotron 3 Nano 30B

NVIDIA · Blog/NVIDIA · 2025-12-15

Hybrid MoE model, 30B params with 3B active. Mamba-Transformer architecture with 1M token context. Open-weight under NVIDIA license.

78.3%

Mistral Small 4

Mistral · Blog/Mistral · 2026-03-16

Efficient small model. Competitive with much larger models using less compute.

78.0%

Phi-4-reasoning-plus

Microsoft · HuggingFace/Microsoft · 2025-04-30

SFT+RL variant. 14B params, MIT license, 32k context.

76.0%

Llama 4 Scout

Meta · HuggingFace/Meta · 2026-04-05

Instruction-tuned score from official model card.

74.3%

Phi-4-reasoning

Microsoft · HuggingFace/Microsoft · 2025-04-30

14B dense model, MIT license. Strong for its size class.

74.3%

Gemma 4 4B

Google · Google Model Card · 2026-04-02

Official model card score. Up from 60.0% (tweet source).

69.4%

Gemma 4 E2B

Google · Model Card/Google · 2026-04-02

Ultra-compact 2.3B active params (5.1B total) with PLE. Fits in 1.5GB quantized. Strong for its size class.

60.0%