benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
articles
WebArena leaderboard
WebArena
3 models tested · Updated 2025-10-17 · Verified sources only
Surfer 2
leads at
69.6%
1
Surfer 2
H Company ·
Blog/H Company
· 2025-10-17
Agent architecture with separated planning and execution. Higher than raw model scores (GPT-5.4 at 67.3) due to agentic orchestration.
69.6%
2
GPT-5.4
OpenAI ·
Blog/OpenAI
· 2026-03-05
WebArena-Verified. Uses both DOM- and screenshot-driven interaction. Improves over GPT-5.2 (65.4%).
67.3%
3
Kimi K2.5
Moonshot AI ·
Blog/Kimi
· 2026-01-27
Web browsing agent benchmark.
58.9%