OSWorld Leaderboard 2026 — Results Across 31 Real AI Models

OSWorld leaderboard

OSWorld

31 models tested · Updated 2026-04-07 · Verified sources only

      Claude Mythos Preview leads at 79.6%
    

Claude Mythos Preview

Anthropic · Blog/Anthropic · 2026-04-07

New SOTA. 6.9pp above Opus 4.6 (72.7%). Desktop agent tasks.

79.6%

Holo3-122B-A10B

H Company · Blog/H Company · 2026-03-31

New #1 on OSWorld-Verified, surpassing GPT-5.4 (75%). Achieved with only 10B active params (122B total MoE). Paris-based startup focused on autonomous desktop agents.

78.85%

GPT-5.5

OpenAI · OpenAI Blog · 2026-04-23

OSWorld-Verified variant. New SOTA, narrowly beats Opus 4.7 at 78.0.

78.7%

Claude Opus 4.7

Anthropic · Blog/Anthropic · 2026-04-16

New OSWorld SOTA for a generally available model. +5.3 pts over Opus 4.6. 3x higher vision resolution helps computer-use agents.

78.0%

Holo3-35B-A3B

H Company · Blog/H Company · 2026-03-31

Smaller Holo3 variant with only 3B active params. Still surpasses GPT-5.4 on OSWorld at a fraction of the cost.

77.8%

GPT-5.4

OpenAI · Blog/OpenAI · 2026-03-05

GPT-5.4 achieves 75% on OSWorld benchmark for desktop tasks, surpassing human-level performance.

75.0%

Kimi K2.6

Moonshot AI · HuggingFace/moonshotai · 2026-04-20

Open-weight 1T/32B MoE. OSWorld-Verified score. First open model to exceed 73% on OSWorld.

73.1%

Claude Opus 4.6

Anthropic · Blog/Anthropic · 2026-02-05

Matches human baseline (~72%). Anthropic's best computer-using model.

72.7%

Gemini 3.1 Pro

Google · arxiv/Mythos-System-Card · 2026-04-07

Desktop agent benchmark. Tied with Claude Opus 4.6, below Claude Mythos (79.6%) and GPT-5.4 (75.0%).

72.7%

Agent S3 + Behavior Best-of-N

Simular AI · X/@SimularAgents · 2026-04-20

Agent framework surpassing human baseline (~72%) on OSWorld. Agent S3 alone reaches 66%.

72.6%

Claude Sonnet 4.6

Anthropic · Blog/Anthropic · 2026-02-17

Sonnet-class model matching human performance level on desktop tasks (humans ~72.4%). Previously only Opus-class models achieved this.

72.5%

GPT-5.4 Mini

OpenAI · Blog/OpenAI · 2026-03-17

Just below human baseline (72.4%) on desktop automation. Remarkable for a mini-tier model.

72.1%

GPT-5.3 Codex

OpenAI · Blog/OpenAI · 2026-02-05

Computer-use agent benchmark. Trails GPT-5.4 (75.0%) but competitive for a coding-specialized model. Human baseline ~72%.

64.0%

Kimi K2.5

Moonshot AI · Blog/Kimi · 2026-01-27

Desktop automation score, trailing GPT-5.4 (75.0) and Claude Opus 4.6 (72.7).

63.3%

Claude Sonnet 4.5

arxiv · arxiv/2603.20633 · 2026-03-21

Seed1.8 achieves 61.9% on OSWorld and 85.9% on OnlineMind2Web, surpassing Claude Sonnet 4.5 and GPT-O3-CUA on computer use tasks. Also achieves 70.7% on AndroidWorld.

62.9%

Seed1.8

arxiv · arxiv/2603.20633 · 2026-03-21

Seed1.8 achieves 61.9% on OSWorld and 85.9% on OnlineMind2Web, surpassing Claude Sonnet 4.5 and GPT-O3-CUA on computer use tasks. Also achieves 70.7% on AndroidWorld.

61.9%

Claude Sonnet 4.5

Anthropic · Blog/Anthropic · 2025-09-29

Frontier model performance on real-world computer tasks

61.4%

CoACT-1

USC/Salesforce/UW · Paper/arXiv · 2025-08-01

Research system combining GUI operator (CUA 4o) with coding-as-actions programmer (o4-mini). SOTA on OSWorld at time of publication.

60.76%

Surfer 2

H Company · Blog/H Company · 2025-11-25

Agent using third-party frontier models. pass@10 reaches 77%, surpassing human baseline (72.4%).

60.1%

GUI-Owl-1.5-32B-Instruct

Alibaba · arxiv/2602.16855 · 2026-02-15

Best open-weight OSWorld result for 32B class. Multi-platform GUI agent.

56.5%

Qwen3.5 27B

Alibaba · HuggingFace/Qwen · 2026-02-16

Desktop use agent benchmark. Decent for 27B open-weight model.

56.2%

GUI-Owl-1.5-32B-Thinking

Alibaba · arxiv/2602.16855 · 2026-02-15

Slightly below Instruct variant on OSWorld. Thinking helps more on web tasks.

56.0%

GUI-Owl-1.5-8B-Thinking

Alibaba · arxiv/2602.16855 · 2026-02-15

Open-source 8B model. Top open GUI agent on OSWorld, competitive with much larger models.

52.9%

S3 + IntentScore

arxiv · arxiv/2604.05157 · 2026-04-06

Plan-aware reward model that scores candidate actions for computer-use agents. S3+IntentScore achieves 52.1% on OSWorld, +6.9 over baseline, by overriding nearly half of action decisions.

52.1%

Claude Haiku 4.5

Anthropic · Blog/Anthropic · 2025-10-15

Notable computer-use capability for a small/fast tier model.

50.7%

AutoGLM-OS-9B

Zhipu AI · arxiv/Zhipu AI · 2025-08-19

9B open model beating OpenAI CUA o3 (42.9%) and Claude 4.0 (30.7%) on desktop tasks. Uses ComputerRL framework with API-GUI paradigm.

48.9%

S3 (baseline)

arxiv · arxiv/2604.05157 · 2026-04-06

Plan-aware reward model that scores candidate actions for computer-use agents. S3+IntentScore achieves 52.1% on OSWorld, +6.9 over baseline, by overriding nearly half of action decisions.

45.2%

GPT-5.4 Nano

OpenAI · Blog/OpenAI · 2026-03-17

Desktop automation. Limited by reasoning depth for long-horizon agentic tasks.

39.0%

GPT-O3-CUA

arxiv · arxiv/2603.20633 · 2026-03-21

Seed1.8 achieves 61.9% on OSWorld and 85.9% on OnlineMind2Web, surpassing Claude Sonnet 4.5 and GPT-O3-CUA on computer use tasks. Also achieves 70.7% on AndroidWorld.

38.1%

Seed1.5-VL

arxiv · arxiv/2603.20633 · 2026-03-21

Seed1.8 achieves 61.9% on OSWorld and 85.9% on OnlineMind2Web, surpassing Claude Sonnet 4.5 and GPT-O3-CUA on computer use tasks. Also achieves 70.7% on AndroidWorld.

36.7%

Gemini 2.5 Pro

arxiv · arxiv/2603.20633 · 2026-03-21

Seed1.8 achieves 61.9% on OSWorld and 85.9% on OnlineMind2Web, surpassing Claude Sonnet 4.5 and GPT-O3-CUA on computer use tasks. Also achieves 70.7% on AndroidWorld.

13.3%