SWE-bench Verified
40 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview leads at 93.9%
1
Anthropic · Blog/Anthropic · 2026-04-07
New SOTA. 13.1pp above Opus 4.6 (80.8%). Largest single-generation coding jump in Anthropic history.
93.9%
2
Anthropic · Blog/Anthropic · 2026-02-05
Updated official score with prompt modification, averaged over 25 trials. Previous entry was 80.8%.
81.42%
3
Anthropic · Blog/Anthropic · 2025-11-24
First model to exceed 80% on SWE-bench Verified. 25 trials avg.
80.9%
4
Google · Official/Google DeepMind · 2026-02-19
Agentic coding benchmark. Verified from official Gemini page.
80.6%
5
MiniMax · Blog/MiniMax · 2026-02-12
MoE 230B params, 10B active. Matches Claude Opus 4.6 speed on SWE-Bench eval.
80.2%
6
OpenAI · OpenAI — Introducing GPT-5.2 · 2025-12-11
Multi-language bug fixing. SWE-bench Pro score: 55.6% (new SOTA).
80.0%
7
OpenAI · Blog/OpenAI · 2026-02-05
OpenAI coding-focused model. Also scores 56.8% on SWE-bench Pro (harder multilingual variant).
80.0%
8
Anthropic · Blog/Anthropic · 2026-02-17
79.6% on SWE-bench Verified, 1.2 points behind Opus 4.6, surpasses Sonnet 4.5 by 2.4 points
79.6%
9
Alibaba · Blog/Qwen · 2026-04-02
Qwen 3.6 Plus scores 78.8% on SWE-bench Verified
78.8%
10
Google · Blog/Google · 2025-12-17
Distilled model outperforms Gemini 3 Pro (76.2%) on agentic coding. Less than a quarter the cost of Pro.
78.0%
11
Zhipu · Zhipu — GLM-5 Developer Docs · 2026-02-11
Leading open-model score. Only 3 points behind Opus 4.6. Terminal-Bench 2.0: 56.2%.
77.8%
12
Zhipu · X/@grok · 2026-03-27
Grok bot relayed Zhipu GLM-5.1 score; no direct link to paper or official source
77.8%
13
Meta · Blog/Meta · 2026-04-08
First model from Meta Superintelligence Labs. Competitive with Gemini 3.1 Pro on agentic coding.
77.4%
14
Anthropic · Blog/Anthropic · 2025-09-29
77.2% standard, 78.2% with 1M context, 82.0% with parallel compute. SOTA at time of release.
77.2%
15
OpenAI · Vals.ai · 2026-03-05
Independent Vals.ai evaluation. OpenAI dropped SWE-bench Verified in favor of SWE-bench Pro (57.7%).
77.2%
16
Moonshot AI · HuggingFace/moonshotai · 2026-01-27
1T MoE open-source. Competitive with Claude Sonnet 4.6 (79.6). Uses internal eval framework with bash/createfile tools.
76.8%
17
ByteDance · Blog/ByteDance · 2026-02-14
Competitive with Claude Sonnet 4.5 and DeepSeek V3.2 on coding tasks.
76.5%
18
Alibaba · HuggingFace/Qwen · 2026-02-16
Strong open-weight coding performance. 17B active params, MoE architecture.
76.4%
19
OpenAI · Blog/OpenAI · 2025-08-07
Below GPT-5.4 (77.2%) and Claude Opus 4.6 (81.4%). Coding was not GPT-5 main strength.
74.9%
20
Fujitsu Research · Blog/Fujitsu Research · 2026-04-08
Third-party harness engineering by Fujitsu. Test-time scaling with 8 candidate patches, no fine-tuning. Highest score among <229B param models.
74.8%
21
StepFun · Blog/StepFun · 2026-02-12
Open-weight MoE with only 11B active params. Competitive with proprietary mid-tier models on coding.
74.4%
22
xAI · X/@grok · 2026-03-31
Top tier with Claude/GPT. Fast inference + huge context.
74.0%
23
Zhipu AI · NVIDIA/Zhipu Official · 2025-12-22
Open-weight, 400B params. Strong coding for an open model but behind GLM-5/5.1 successors.
73.8%
24
ByteDance · Blog/ByteDance · 2026-03-10
Near Pro level (76.5) at roughly half the cost.
73.5%
25
Anthropic · Blog/Anthropic · 2025-10-15
Anthropic first Haiku with extended thinking. Matches Sonnet 4 on coding at 1/3 cost.
73.3%
26
DeepSeek · Paper/DeepSeek (arxiv:2512.02556) · 2025-12-01
Primary score from DeepSeek technical report. Robustness tests across frameworks range 72-74%. Trails Claude Opus 4.6 (80.8%) and GPT-5.3 Codex (80.0%).
73.1%
27
xAI · X/@grok · 2026-03-31
Grok bot stated ~70-75% range; midpoint used; no official xAI source linked
72.5%
28
Alibaba · Model Card/Qwen · 2026-02-24
Open-weight 27B dense model matching GPT-5 mini. Released Feb 2026.
72.4%
29
Mistral · Mistral — Devstral 2 and Vibe CLI · 2025-12-09
123B dense transformer, 256k context. Best open-weight SWE-bench score. 7x more cost-efficient than Claude Sonnet.
72.2%
30
OpenAI · Blog/OpenAI · 2025-04-16
Solid coding ability. Behind frontier models (Claude Opus 4.6 at 81.4%, Gemini 3.1 Pro at 80.6%).
71.7%
31
Alibaba · Blog/Alibaba · 2026-02-03
80B MoE, only 3B active params. Coding specialist matching models 10-20x larger.
70.6%
32
Meta · Blog/Meta · 2026-04-05
17B MoE with 128 experts. Strong coding performance for open-weight model, competitive with GPT-4o class.
70.3%
33
OpenAI · Blog/OpenAI · 2025-04-16
Cost-efficient reasoning model. Behind o3 (69.1%) but far ahead of o1 (48.9%) and o3-mini (49.3%).
68.1%
34
Mistral · Blog/Mistral · 2025-12-09
24B open-weight coding model. Runs locally on RTX 4090. Apache 2.0 license.
68.0%
35
Evaluated with mini-swe-agent-v2. Behind closed-source leaders but competitive for open-weight.
63.2%
36
OpenAI · HuggingFace/openai/gpt-oss-120b · 2025-08-05
OpenAI's first open-weight model at 116.8B params (5.1B active). Apache 2.0 license. High reasoning effort.
62.4%
37
OpenAI · HuggingFace/openai/gpt-oss-20b · 2025-08-05
20.9B total params (3.6B active). Runs on 16GB consumer hardware. Apache 2.0. Impressive for its size — near gpt-oss-120b SWE-bench score.
60.7%
38
Zhipu AI · HuggingFace/Zhipu · 2026-01-15
Lightweight MoE flash variant. Solid for its size class.
59.2%
39
xAI · Leaderboard/vals.ai · 2026-03-20
Independent vals.ai test. Significant gap from xAI self-reported 72-75%, highlighting scaffold sensitivity.
58.6%
40
Sarvam AI · HuggingFace/sarvamai · 2026-03-06
India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.
45.0%