VisualWebArena Leaderboard 2026 — Results Across 9 Real AI Models

VisualWebArena leaderboard

VisualWebArena

9 models tested · Updated 2026-04-09 · Verified sources only

      Gemini 3 Pro leads at 49.0%
    

Gemini 3 Pro

Google DeepMind · arxiv/2604.07776 · 2026-04-09

Top score on VisualWebArena. Tested as web agent.

49.0%

Gemini 3.1 Pro

Google DeepMind · arxiv/2604.07776 · 2026-04-09

Tested as web agent. Second-best after Gemini 3 Pro on VWA.

47.9%

Qwen3.5 27B

Qwen · arxiv/2604.07776 · 2026-04-09

Tested as web agent in structured distillation paper.

37.4%

Qwen 3.5 27B

arxiv · arxiv/2604.07776 · 2026-04-09

Agent-as-Annotators framework distills web agent capabilities from Gemini 3 Pro into a 9B-param student. A3-Qwen3.5-9B matches Qwen3.5-27B on WebArena (41.5% SR), beating GPT-4o (31.5%).

37.4%

A3-Qwen3.5-9B

Ai2 · arxiv/2604.07776 · 2026-04-09

9B open-weight model trained via structured distillation from Gemini 3 Pro. First strong open-weight VWA result.

33.9%

A3-Qwen3.5-4B

Ai2 · arxiv/2604.07776 · 2026-04-09

4B model trained via structured distillation. Strong generalization to visual web tasks.

30.1%

Qwen3.5-9B (base)

arxiv · arxiv/2604.07776 · 2026-04-09

Agent-as-Annotators framework distills web agent capabilities from Gemini 3 Pro into a 9B-param student. A3-Qwen3.5-9B matches Qwen3.5-27B on WebArena (41.5% SR), beating GPT-4o (31.5%).

28.5%

GPT-4o

arxiv · arxiv/2604.07776 · 2026-04-09

Agent-as-Annotators framework distills web agent capabilities from Gemini 3 Pro into a 9B-param student. A3-Qwen3.5-9B matches Qwen3.5-27B on WebArena (41.5% SR), beating GPT-4o (31.5%).

26.3%

Claude 3.5 Sonnet

arxiv · arxiv/2604.07776 · 2026-04-09

Agent-as-Annotators framework distills web agent capabilities from Gemini 3 Pro into a 9B-param student. A3-Qwen3.5-9B matches Qwen3.5-27B on WebArena (41.5% SR), beating GPT-4o (31.5%).

22.0%