SWE-bench Pro
13 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview leads at 77.8%
1
Anthropic · Blog/Anthropic · 2026-04-07
24.4pp above Opus 4.6 (53.4%). Massive jump on harder coding eval.
77.8%
2
Z.ai · X/@Zai_org · 2026-04-07
#1 open source, #3 globally. Beats GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). First Chinese model to top SWE-bench Pro. Zero Nvidia hardware.
58.4%
3
OpenAI · Blog/OpenAI · 2026-03-05
Harder coding benchmark variant. GPT-5.4 splits coding, reasoning, and computer use into one unified model.
57.7%
4
OpenAI · Web/OpenAI · 2026-03-05
Leads SWE-bench Pro. 25% faster than GPT-5.2 Codex. Covers 1,865 multi-language tasks.
56.8%
5
MiniMax · Blog/MiniMax · 2026-03-18
Matches GPT-5.3 Codex level. Nearly approaches Opus best on SWE-Pro.
56.2%
6
OpenAI · Blog/OpenAI · 2026-03-17
Smallest GPT-5.4 variant. 2x faster than GPT-5 mini at similar coding quality.
54.4%
7
Google · Blog/Google DeepMind · 2026-02-19
Competitive with GPT-5.4 (57.7%). Behind Claude Mythos Preview (77.8%).
54.2%
8
OpenAI · Blog/OpenAI · 2026-03-17
Fastest, cheapest GPT-5.4 variant. Designed for classification, data extraction, and coding subagents.
52.4%
9
Meta · Blog/Meta · 2026-04-08
Trails Claude Mythos (77.8%) significantly but competitive with other frontier models.
52.0%
10
Moonshot AI · HuggingFace/Moonshot · 2026-01-29
Internal evaluation framework with minimal tool set. 1T param MoE, 32B active.
50.7%
11
ByteDance · Blog/ByteDance · 2026-02-14
Flagship coding model from ByteDance Seed team.
46.9%
12
ByteDance · Blog/ByteDance · 2026-03-10
Nearly matches Pro variant (46.9) on harder SWE subset.
46.0%
13
Alibaba · Blog/Alibaba · 2026-02-03
Competitive with much larger models on harder SWE subset.
44.3%