WebArena
3 models tested · Updated 2025-10-17 · Verified sources only
Surfer 2 leads at 69.6%
1
H Company · Blog/H Company · 2025-10-17
Agent architecture with separated planning and execution. Higher than raw model scores (GPT-5.4 at 67.3) due to agentic orchestration.
69.6%
2
OpenAI · Blog/OpenAI · 2026-03-05
WebArena-Verified. Uses both DOM- and screenshot-driven interaction. Improves over GPT-5.2 (65.4%).
67.3%
3
Moonshot AI · Blog/Kimi · 2026-01-27
Web browsing agent benchmark.
58.9%