benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
articles
OSWorld leaderboard
OSWorld
16 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview
leads at
79.6%
1
Claude Mythos Preview
Anthropic ·
Blog/Anthropic
· 2026-04-07
New SOTA. 6.9pp above Opus 4.6 (72.7%). Desktop agent tasks.
79.6%
2
Holo3-122B-A10B
H Company ·
Blog/H Company
· 2026-03-31
New #1 on OSWorld-Verified, surpassing GPT-5.4 (75%). Achieved with only 10B active params (122B total MoE). Paris-based startup focused on autonomous desktop agents.
78.85%
3
Holo3-35B-A3B
H Company ·
Blog/H Company
· 2026-03-31
Smaller Holo3 variant with only 3B active params. Still surpasses GPT-5.4 on OSWorld at a fraction of the cost.
77.8%
4
GPT-5.4
OpenAI ·
Blog/OpenAI
· 2026-03-05
GPT-5.4 achieves 75% on OSWorld benchmark for desktop tasks, surpassing human-level performance.
75.0%
5
Claude Opus 4.6
Anthropic ·
Blog/Anthropic
· 2026-02-05
Matches human baseline (~72%). Anthropic's best computer-using model.
72.7%
6
Claude Sonnet 4.6
Anthropic ·
Blog/Anthropic
· 2026-02-17
Sonnet-class model matching human performance level on desktop tasks (humans ~72.4%). Previously only Opus-class models achieved this.
72.5%
7
GPT-5.4 Mini
OpenAI ·
Blog/OpenAI
· 2026-03-17
Just below human baseline (72.4%) on desktop automation. Remarkable for a mini-tier model.
72.1%
8
GPT-5.3 Codex
OpenAI ·
Blog/OpenAI
· 2026-02-05
Computer-use agent benchmark. Trails GPT-5.4 (75.0%) but competitive for a coding-specialized model. Human baseline ~72%.
64.0%
9
Kimi K2.5
Moonshot AI ·
Blog/Kimi
· 2026-01-27
Desktop automation score, trailing GPT-5.4 (75.0) and Claude Opus 4.6 (72.7).
63.3%
10
Claude Sonnet 4.5
Anthropic ·
Blog/Anthropic
· 2025-09-29
Frontier model performance on real-world computer tasks
61.4%
11
CoACT-1
USC/Salesforce/UW ·
Paper/arXiv
· 2025-08-01
Research system combining GUI operator (CUA 4o) with coding-as-actions programmer (o4-mini). SOTA on OSWorld at time of publication.
60.76%
12
Surfer 2
H Company ·
Blog/H Company
· 2025-11-25
Agent using third-party frontier models. pass@10 reaches 77%, surpassing human baseline (72.4%).
60.1%
13
Qwen3.5 27B
Alibaba ·
HuggingFace/Qwen
· 2026-02-16
Desktop use agent benchmark. Decent for 27B open-weight model.
56.2%
14
Claude Haiku 4.5
Anthropic ·
Blog/Anthropic
· 2025-10-15
Notable computer-use capability for a small/fast tier model.
50.7%
15
AutoGLM-OS-9B
Zhipu AI ·
arxiv/Zhipu AI
· 2025-08-19
9B open model beating OpenAI CUA o3 (42.9%) and Claude 4.0 (30.7%) on desktop tasks. Uses ComputerRL framework with API-GUI paradigm.
48.9%
16
GPT-5.4 Nano
OpenAI ·
Blog/OpenAI
· 2026-03-17
Desktop automation. Limited by reasoning depth for long-horizon agentic tasks.
39.0%