MathVista Leaderboard 2026 — Results Across 14 Real AI Models

MathVista leaderboard

MathVista

14 models tested · Updated 2026-04-14 · Verified sources only

      EXAONE 4.5 33B leads at 85.0%
    

EXAONE 4.5 33B

LG AI Research · HuggingFace/LGAI-EXAONE · 2026-04-14

Beats GPT-5 mini (79.1). Strong visual math reasoning for 33B open-weight VLM.

85.0%

Gemini 2.5 Pro

arxiv · arxiv/2604.08539 · 2026-04-09

8B open-weight multimodal model trained with GRPO+GDPO. Competitive with Gemini 2.5 Pro on DocVQA and chart understanding. New SOTA for open-weight VLMs on MMMU.

82.7%

Qwen 3 VL 32B Instruct

arxiv · arxiv/2603.03975 · 2026-03-04

Compact 15B open-weight multimodal reasoning model from Microsoft. Achieves competitive VLM performance with much less compute via careful data curation and dynamic-resolution encoders.

81.8%

OpenVLThinkerV2 8B

UCLA · arxiv/2604.08539 · 2026-04-09

Open-weight 8B VLM surpassing GPT-4o (63.8%) by 15.7 points on MathVista. Demonstrates strong visual math reasoning.

79.5%

OpenVLThinkerV2

arxiv · arxiv/2604.08539 · 2026-04-09

Introduces Gaussian GRPO (G2RPO), replacing standard linear scaling in GRPO with non-linear distributional matching. OpenVLThinkerV2-7B achieves new SOTA for open-source 7B models on MMMU (71.6%), Mat

79.5%

Vero Q3T-8B

Vero Team · arxiv/2604.04917 · 2026-04-06

Slightly below thinking base (81.4%) but more consistent across all 30 benchmarks.

79.2%

Vero Q3I-8B

Vero Team · arxiv/2604.04917 · 2026-04-06

+1.5 over base. Consistent improvement across visual reasoning tasks via task-routed rewards.

78.7%

RLSD (Qwen3-VL-8B)

arxiv · arxiv/2604.03128 · 2026-04-06

Proposes RLSD combining self-distillation magnitude with RLVR direction. On Qwen3-VL-8B, achieves best avg accuracy across 5 multimodal reasoning benchmarks, outperforming GRPO by 2.32% on average.

78.1%

Qwen 3 VL 8B Instruct

arxiv · arxiv/2603.03975 · 2026-03-04

Compact 15B open-weight multimodal reasoning model from Microsoft. Achieves competitive VLM performance with much less compute via careful data curation and dynamic-resolution encoders.

77.1%

Phi-4-reasoning-vision-15B

arxiv · arxiv/2603.03975 · 2026-03-04

Compact 15B open-weight multimodal reasoning model from Microsoft. Achieves competitive VLM performance with much less compute via careful data curation and dynamic-resolution encoders.

75.2%

Qwen 3 VL 8B Instruct

arxiv · arxiv/2604.08539 · 2026-04-09

8B open-weight multimodal model trained with GRPO+GDPO. Competitive with Gemini 2.5 Pro on DocVQA and chart understanding. New SOTA for open-weight VLMs on MMMU.

74.2%

Kimi-VL-A3B-Instruct

arxiv · arxiv/2603.03975 · 2026-03-04

Compact 15B open-weight multimodal reasoning model from Microsoft. Achieves competitive VLM performance with much less compute via careful data curation and dynamic-resolution encoders.

67.1%

GPT-4o

arxiv · arxiv/2604.08539 · 2026-04-09

8B open-weight multimodal model trained with GRPO+GDPO. Competitive with Gemini 2.5 Pro on DocVQA and chart understanding. New SOTA for open-weight VLMs on MMMU.

63.8%

Gemma 3 12B IT

arxiv · arxiv/2603.03975 · 2026-03-04

Compact 15B open-weight multimodal reasoning model from Microsoft. Achieves competitive VLM performance with much less compute via careful data curation and dynamic-resolution encoders.

57.4%