WebVoyager Leaderboard 2026 — Results Across 18 Real AI Models

WebVoyager leaderboard

WebVoyager

18 models tested · Updated 2026-03-26 · Verified sources only

      Alumnium MCP + Claude Code leads at 98.5%
    

Alumnium MCP + Claude Code

Alumnium · Blog/Alumnium · 2026-03-26

New WebVoyager SOTA. MCP server + Claude Code general-purpose agent. ~$5 total cost for 610 tasks. Reproducible code available.

98.5%

Surfer 2

H Company · Blog/H Company · 2025-10-17

Cross-platform computer-use agent. SOTA on WebVoyager at time of release. Separates planning from execution with orchestrator + sub-agents.

97.1%

Gemini 3.1 Pro

Google DeepMind · arxiv/2604.08516 · 2026-04-09

Google computer-use-preview mode on WebVoyager. Tested in MolmoWeb paper.

88.6%

Gemini computer-use-preview

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

88.6%

GLM-5V-Turbo

Zhipu AI · X/@VaibhavSisinty + Z.ai launch · 2026-04-01

Vision-language model with native multimodal input. Marginal lead over Claude Opus 4.6 (88.0) on browser navigation tasks.

88.5%

GUI-Owl-1.5-32B-Thinking

Alibaba · arxiv/2602.16855 · 2026-02-15

Strong web navigation. 32B thinking model for multi-platform GUI tasks.

82.1%

SoM Agent + o3

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

79.3%

MolmoWeb 8B

Allen AI · Blog/Allen AI · 2026-03-24

Open-weight 8B browser agent. With test-time scaling (pass@4): 94.7%. Outperforms Fara-7B and GPT-4o-based agents.

78.2%

MolmoWeb-8B

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

78.2%

GUI-Owl-1.5-8B-Thinking

Alibaba · arxiv/2602.16855 · 2026-02-15

Web navigation benchmark. Strong for an 8B model.

78.1%

MolmoWeb 4B

AI2 · arxiv/2604.08516 · 2026-04-09

4B open-weight visual web agent. Outperforms all other open-weight models on WebVoyager including 7B models (Fara, Holo1, UI-Tars).

75.2%

MolmoWeb-4B

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

75.2%

Fara-7B

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

73.5%

GPT-5.4

OpenAI · arxiv/2604.08516 · 2026-04-09

OpenAI computer-use-preview mode on WebVoyager. Below MolmoWeb-4B.

70.9%

OpenAI computer-use-preview

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

70.9%

GLM-4.1V-9B-Thinking

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

66.8%

UI-TARS-1.5-7B

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

66.4%

SoM Agent + GPT-4o

arxiv · arxiv/2604.08516 · 2026-04-09

Fully open web agents (4B/8B) that outperform similar-scale open models and even SoM agents built on GPT-4o. MolmoWeb-8B achieves 78.2% on WebVoyager and 35.3% on Online-Mind2Web pass@1.

65.1%