benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
articles
SWE-bench Pro leaderboard
SWE-bench Pro
13 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview
leads at
77.8%
1
Claude Mythos Preview
Anthropic ·
Blog/Anthropic
· 2026-04-07
24.4pp above Opus 4.6 (53.4%). Massive jump on harder coding eval.
77.8%
2
GLM-5.1
Z.ai ·
X/@Zai_org
· 2026-04-07
#1 open source, #3 globally. Beats GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). First Chinese model to top SWE-bench Pro. Zero Nvidia hardware.
58.4%
3
GPT-5.4
OpenAI ·
Blog/OpenAI
· 2026-03-05
Harder coding benchmark variant. GPT-5.4 splits coding, reasoning, and computer use into one unified model.
57.7%
4
GPT-5.3 Codex
OpenAI ·
Web/OpenAI
· 2026-03-05
Leads SWE-bench Pro. 25% faster than GPT-5.2 Codex. Covers 1,865 multi-language tasks.
56.8%
5
MiniMax M2.7
MiniMax ·
Blog/MiniMax
· 2026-03-18
Matches GPT-5.3 Codex level. Nearly approaches Opus best on SWE-Pro.
56.2%
6
GPT-5.4 Mini
OpenAI ·
Blog/OpenAI
· 2026-03-17
Smallest GPT-5.4 variant. 2x faster than GPT-5 mini at similar coding quality.
54.4%
7
Gemini 3.1 Pro
Google ·
Blog/Google DeepMind
· 2026-02-19
Competitive with GPT-5.4 (57.7%). Behind Claude Mythos Preview (77.8%).
54.2%
8
GPT-5.4 Nano
OpenAI ·
Blog/OpenAI
· 2026-03-17
Fastest, cheapest GPT-5.4 variant. Designed for classification, data extraction, and coding subagents.
52.4%
9
Muse Spark
Meta ·
Blog/Meta
· 2026-04-08
Trails Claude Mythos (77.8%) significantly but competitive with other frontier models.
52.0%
10
Kimi K2.5
Moonshot AI ·
HuggingFace/Moonshot
· 2026-01-29
Internal evaluation framework with minimal tool set. 1T param MoE, 32B active.
50.7%
11
Seed 2.0 Pro
ByteDance ·
Blog/ByteDance
· 2026-02-14
Flagship coding model from ByteDance Seed team.
46.9%
12
Seed 2.0 Lite
ByteDance ·
Blog/ByteDance
· 2026-03-10
Nearly matches Pro variant (46.9) on harder SWE subset.
46.0%
13
Qwen3-Coder-Next
Alibaba ·
Blog/Alibaba
· 2026-02-03
Competitive with much larger models on harder SWE subset.
44.3%