OSWorld
16 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview leads at 79.6%
1
Anthropic · Blog/Anthropic · 2026-04-07
New SOTA. 6.9pp above Opus 4.6 (72.7%). Desktop agent tasks.
79.6%
2
H Company · Blog/H Company · 2026-03-31
New #1 on OSWorld-Verified, surpassing GPT-5.4 (75%). Achieved with only 10B active params (122B total MoE). Paris-based startup focused on autonomous desktop agents.
78.85%
3
H Company · Blog/H Company · 2026-03-31
Smaller Holo3 variant with only 3B active params. Still surpasses GPT-5.4 on OSWorld at a fraction of the cost.
77.8%
4
OpenAI · Blog/OpenAI · 2026-03-05
GPT-5.4 achieves 75% on OSWorld benchmark for desktop tasks, surpassing human-level performance.
75.0%
5
Anthropic · Blog/Anthropic · 2026-02-05
Matches human baseline (~72%). Anthropic's best computer-using model.
72.7%
6
Anthropic · Blog/Anthropic · 2026-02-17
Sonnet-class model matching human performance level on desktop tasks (humans ~72.4%). Previously only Opus-class models achieved this.
72.5%
7
OpenAI · Blog/OpenAI · 2026-03-17
Just below human baseline (72.4%) on desktop automation. Remarkable for a mini-tier model.
72.1%
8
OpenAI · Blog/OpenAI · 2026-02-05
Computer-use agent benchmark. Trails GPT-5.4 (75.0%) but competitive for a coding-specialized model. Human baseline ~72%.
64.0%
9
Moonshot AI · Blog/Kimi · 2026-01-27
Desktop automation score, trailing GPT-5.4 (75.0) and Claude Opus 4.6 (72.7).
63.3%
10
Anthropic · Blog/Anthropic · 2025-09-29
Frontier model performance on real-world computer tasks
61.4%
11
USC/Salesforce/UW · Paper/arXiv · 2025-08-01
Research system combining GUI operator (CUA 4o) with coding-as-actions programmer (o4-mini). SOTA on OSWorld at time of publication.
60.76%
12
H Company · Blog/H Company · 2025-11-25
Agent using third-party frontier models. pass@10 reaches 77%, surpassing human baseline (72.4%).
60.1%
13
Alibaba · HuggingFace/Qwen · 2026-02-16
Desktop use agent benchmark. Decent for 27B open-weight model.
56.2%
14
Anthropic · Blog/Anthropic · 2025-10-15
Notable computer-use capability for a small/fast tier model.
50.7%
15
Zhipu AI · arxiv/Zhipu AI · 2025-08-19
9B open model beating OpenAI CUA o3 (42.9%) and Claude 4.0 (30.7%) on desktop tasks. Uses ComputerRL framework with API-GUI paradigm.
48.9%
16
OpenAI · Blog/OpenAI · 2026-03-17
Desktop automation. Limited by reasoning depth for long-horizon agentic tasks.
39.0%