benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
articles
MMMU Pro leaderboard
MMMU Pro
15 models tested · Updated 2026-03-05 · Verified sources only
GPT-5.4
leads at
81.2%
1
GPT-5.4
OpenAI ·
Blog/OpenAI
· 2026-03-05
Visual understanding and reasoning. Without tool use, reasoning effort xhigh.
81.2%
2
Gemini 3 Flash
Google ·
Blog/Google
· 2025-12-17
Multimodal reasoning at Flash-tier cost. Competitive with Gemini 3 Pro.
81.0%
3
Gemini 3.1 Pro
Google ·
Official/Google DeepMind
· 2026-02-19
Multimodal understanding without tools. Strong visual reasoning.
80.5%
4
Muse Spark
Meta ·
Blog/Meta AI
· 2026-04-08
Second-best MMMU Pro score, behind Gemini 3.1 Pro Preview (82.4%). Strong multimodal showing for Meta Superintelligence Labs debut model.
80.5%
5
Qwen 3.5 397B
Alibaba ·
HuggingFace/Qwen
· 2026-02-16
Native vision-language model. Strong multimodal reasoning.
79.0%
6
Kimi K2.5
Moonshot AI ·
Blog/Kimi
· 2026-01-27
Strong multimodal reasoning for an open-weight model.
78.5%
7
Gemma 4 31B
Google ·
Model Card/Google
· 2026-04-02
Multimodal vision benchmark. Up from 49.7% on Gemma 3 27B.
76.9%
8
Gemini 3.1 Flash-Lite
Google ·
Model Card/Google
· 2026-03-03
Budget-tier ($0.25/1M input) yet competitive on multimodal benchmarks.
76.8%
9
Qwen3.5 27B
Alibaba ·
HuggingFace/Qwen
· 2026-02-16
Multimodal understanding. Strong for a small model — competitive with some frontier scores from late 2025.
75.0%
10
Gemma 4 26B A4B
Google ·
Model Card/Google
· 2026-04-02
MoE multimodal. 31B dense reaches 76.9%.
73.8%
11
Qwen 3.5 9B
Alibaba ·
HuggingFace/Qwen
· 2026-03-02
9B params. Outperforms Gemini 2.5 Flash-Lite (59.7) on visual reasoning. Strong for its size class.
70.1%
12
Llama 4 Maverick
Meta ·
HuggingFace/Meta
· 2026-04-05
Official model card score. 17B active params, 128 experts.
59.6%
13
Gemma 4 4B
Google ·
Google Model Card
· 2026-04-02
Multimodal understanding from a 4B model. Strong for edge deployment.
52.6%
14
Llama 4 Scout
Meta ·
HuggingFace/Meta
· 2026-04-05
Official model card score. 17B active params, 16 experts, 10M context.
52.2%
15
Gemma 4 E2B
Google ·
Model Card/Google
· 2026-04-02
Multimodal reasoning at 2.3B active params. Natively multimodal from pretraining.
44.2%