CharXiv Leaderboard 2026 — Results Across 23 Real AI Models

CharXiv leaderboard

CharXiv

23 models tested · Updated 2026-04-16 · Verified sources only

      Claude Opus 4.7 leads at 89.0%
    

Claude Opus 4.7

Anthropic · Blog/Anthropic · 2026-04-16

Strong chart understanding. +10 pts over Opus 4.6 (78.9). 3x higher vision resolution helps document/chart tasks.

89.0%

Claude Fable 5

Anthropic · Blog/Anthropic · 2026-06-09

Score as cited in Kimi K3 blog (source: Anthropic official). Fable 5 hit fallbacks on 35% of tasks.

88.9%

CharTool-7B

ByteDance · arxiv/2604.02794 · 2026-04-03

Tool-integrated chart reasoning: +2.04% over Qwen2.5-VL-7B on CharXiv Avg via image cropping and code-based computation with agentic RL.

86.76%

Muse Spark

Meta · Blog/Meta · 2026-04-08

Chart reasoning SOTA. Beats GPT-5.4 (82.8), Gemini 3.1 Pro (80.2), Claude Opus 4.6 (65.3) per Meta announcement.

86.4%

Kimi K3

Moonshot AI · Blog/Moonshot AI · 2026-07-16

CharXiv Reasoning Quality.

84.8%

GPT-5.6 Sol

OpenAI · Blog/OpenAI · 2026-07-09

Score as cited in Kimi K3 blog (source: OpenAI official).

84.6%

Gemini 3.5 Flash

Google DeepMind · Blog/Google DeepMind · 2026-06-10

Highest among compared models. Up from Gemini 3 Flash (80.3).

84.2%

GPT-5.5

OpenAI · Blog/OpenAI · 2026-06-26

Score as cited in Kimi K3 blog (source: OpenAI official).

84.1%

Gemini 3.1 Pro

Google DeepMind · Blog/Google DeepMind · 2026-06-10

Below Gemini 3.5 Flash (84.2). Above most peers.

83.3%

Claude Opus 4.7

Anthropic · Blog/Google DeepMind · 2026-06-10

Below Gemini 3.5 Flash (84.2). Above Sonnet 4.6 (72.4).

82.1%

CharTool-3B

ByteDance · arxiv/2604.02794 · 2026-04-03

3B tool-integrated chart model: +5.04% over Qwen2.5-VL-3B on CharXiv Avg, using DuoChart data and agentic RL with cropping + code tools.

81.68%

Qwen 3.5 397B

Alibaba · HuggingFace/Qwen · 2026-02-16

Chart understanding and reasoning. Between Claude Opus 4.6 (78.9) and Muse Spark (86.4).

80.8%

Claude Opus 4.8

Anthropic · Blog/Anthropic · 2026-06-17

Score as cited in Kimi K3 blog (source: Anthropic official).

80.5%

Kimi K2.6

Moonshot AI · HF/moonshotai · 2026-04-20

Without python tool. With python: 86.7. Native multimodal chart understanding.

80.4%

Gemini 3 Flash

Google DeepMind · Blog/Google DeepMind · 2026-06-10

Below Gemini 3.5 Flash (84.2).

80.3%

Claude Opus 4.6

Anthropic · arxiv/Mythos-System-Card · 2026-04-07

With tools. Chart reasoning on arxiv figures.

78.9%

Qwen 3.6 27B

Alibaba · HuggingFace/Qwen · 2026-04-22

Near Qwen 3.5 397B MoE on document understanding. Beats Claude Opus 4.5.

78.4%

Inkling

Thinking Machines Lab · Blog/Thinking Machines Lab · 2026-07-20

Chart reasoning; below Fable 5 (86.5) but above GLM-5.2. Encoder-free vision architecture.

78.1%

Qwen 3.6 35B-A3B

Alibaba · HuggingFace/Qwen · 2026-04-18

MoE 3B active/35B total. Matches Sonnet 4.5 on document chart understanding.

78.0%

Inkling-Small

Thinking Machines Lab · Blog/ThinkingMachines · 2026-07-15

CharXiv RQ.

76.7%

Claude Sonnet 4.6

Anthropic · Blog/Google DeepMind · 2026-06-10

Below all frontier models. Gemini 3.5 Flash leads (84.2).

72.4%

EXAONE 4.5 33B

LG AI Research · HuggingFace/LGAI-EXAONE · 2026-04-14

Strong document understanding. Beats GPT-5 mini (68.6) but trails Qwen3.5 27B (79.5).

71.7%

Claude Opus 4.6

Anthropic · arxiv/Mythos-System-Card · 2026-04-07

No tools. Chart reasoning on arxiv figures.

61.5%