SWE-bench Verified Leaderboard 2026 — Results Across 40 Real AI Models

SWE-bench Verified leaderboard

SWE-bench Verified

40 models tested · Updated 2026-04-07 · Verified sources only

      Claude Mythos Preview leads at 93.9%
    

Claude Mythos Preview

Anthropic · Blog/Anthropic · 2026-04-07

New SOTA. 13.1pp above Opus 4.6 (80.8%). Largest single-generation coding jump in Anthropic history.

93.9%

Claude Opus 4.6

Anthropic · Blog/Anthropic · 2026-02-05

Updated official score with prompt modification, averaged over 25 trials. Previous entry was 80.8%.

81.42%

Claude Opus 4.5

Anthropic · Blog/Anthropic · 2025-11-24

First model to exceed 80% on SWE-bench Verified. 25 trials avg.

80.9%

Gemini 3.1 Pro

Google · Official/Google DeepMind · 2026-02-19

Agentic coding benchmark. Verified from official Gemini page.

80.6%

MiniMax M2.5

MiniMax · Blog/MiniMax · 2026-02-12

MoE 230B params, 10B active. Matches Claude Opus 4.6 speed on SWE-Bench eval.

80.2%

GPT-5.2

OpenAI · OpenAI — Introducing GPT-5.2 · 2025-12-11

Multi-language bug fixing. SWE-bench Pro score: 55.6% (new SOTA).

80.0%

GPT-5.3 Codex

OpenAI · Blog/OpenAI · 2026-02-05

OpenAI coding-focused model. Also scores 56.8% on SWE-bench Pro (harder multilingual variant).

80.0%

Claude Sonnet 4.6

Anthropic · Blog/Anthropic · 2026-02-17

79.6% on SWE-bench Verified, 1.2 points behind Opus 4.6, surpasses Sonnet 4.5 by 2.4 points

79.6%

Qwen 3.6 Plus

Alibaba · Blog/Qwen · 2026-04-02

Qwen 3.6 Plus scores 78.8% on SWE-bench Verified

78.8%

Gemini 3 Flash

Google · Blog/Google · 2025-12-17

Distilled model outperforms Gemini 3 Pro (76.2%) on agentic coding. Less than a quarter the cost of Pro.

78.0%

GLM-5

Zhipu · Zhipu — GLM-5 Developer Docs · 2026-02-11

Leading open-model score. Only 3 points behind Opus 4.6. Terminal-Bench 2.0: 56.2%.

77.8%

GLM-5.1

Zhipu · X/@grok · 2026-03-27

Grok bot relayed Zhipu GLM-5.1 score; no direct link to paper or official source

77.8%

Muse Spark

Meta · Blog/Meta · 2026-04-08

First model from Meta Superintelligence Labs. Competitive with Gemini 3.1 Pro on agentic coding.

77.4%

Claude Sonnet 4.5

Anthropic · Blog/Anthropic · 2025-09-29

77.2% standard, 78.2% with 1M context, 82.0% with parallel compute. SOTA at time of release.

77.2%

GPT-5.4

OpenAI · Vals.ai · 2026-03-05

Independent Vals.ai evaluation. OpenAI dropped SWE-bench Verified in favor of SWE-bench Pro (57.7%).

77.2%

Kimi K2.5

Moonshot AI · HuggingFace/moonshotai · 2026-01-27

1T MoE open-source. Competitive with Claude Sonnet 4.6 (79.6). Uses internal eval framework with bash/createfile tools.

76.8%

Seed 2.0 Pro

ByteDance · Blog/ByteDance · 2026-02-14

Competitive with Claude Sonnet 4.5 and DeepSeek V3.2 on coding tasks.

76.5%

Qwen 3.5 397B

Alibaba · HuggingFace/Qwen · 2026-02-16

Strong open-weight coding performance. 17B active params, MoE architecture.

76.4%

GPT-5

OpenAI · Blog/OpenAI · 2025-08-07

Below GPT-5.4 (77.2%) and Claude Opus 4.6 (81.4%). Coding was not GPT-5 main strength.

74.9%

Qwen3.5 27B (Fujitsu TTS@8)

Fujitsu Research · Blog/Fujitsu Research · 2026-04-08

Third-party harness engineering by Fujitsu. Test-time scaling with 8 candidate patches, no fine-tuning. Highest score among <229B param models.

74.8%

Step-3.5-Flash

StepFun · Blog/StepFun · 2026-02-12

Open-weight MoE with only 11B active params. Competitive with proprietary mid-tier models on coding.

74.4%

Grok 4 Code

xAI · X/@grok · 2026-03-31

Top tier with Claude/GPT. Fast inference + huge context.

74.0%

GLM-4.7

Zhipu AI · NVIDIA/Zhipu Official · 2025-12-22

Open-weight, 400B params. Strong coding for an open model but behind GLM-5/5.1 successors.

73.8%

Seed 2.0 Lite

ByteDance · Blog/ByteDance · 2026-03-10

Near Pro level (76.5) at roughly half the cost.

73.5%

Claude Haiku 4.5

Anthropic · Blog/Anthropic · 2025-10-15

Anthropic first Haiku with extended thinking. Matches Sonnet 4 on coding at 1/3 cost.

73.3%

DeepSeek V3.2

DeepSeek · Paper/DeepSeek (arxiv:2512.02556) · 2025-12-01

Primary score from DeepSeek technical report. Robustness tests across frameworks range 72-74%. Trails Claude Opus 4.6 (80.8%) and GPT-5.3 Codex (80.0%).

73.1%

grok-code-fast-1

xAI · X/@grok · 2026-03-31

Grok bot stated ~70-75% range; midpoint used; no official xAI source linked

72.5%

Qwen3.5 27B

Alibaba · Model Card/Qwen · 2026-02-24

Open-weight 27B dense model matching GPT-5 mini. Released Feb 2026.

72.4%

Devstral 2

Mistral · Mistral — Devstral 2 and Vibe CLI · 2025-12-09

123B dense transformer, 256k context. Best open-weight SWE-bench score. 7x more cost-efficient than Claude Sonnet.

72.2%

OpenAI · Blog/OpenAI · 2025-04-16

Solid coding ability. Behind frontier models (Claude Opus 4.6 at 81.4%, Gemini 3.1 Pro at 80.6%).

71.7%

Qwen3-Coder-Next

Alibaba · Blog/Alibaba · 2026-02-03

80B MoE, only 3B active params. Coding specialist matching models 10-20x larger.

70.6%

Llama 4 Maverick

Meta · Blog/Meta · 2026-04-05

17B MoE with 128 experts. Strong coding performance for open-weight model, competitive with GPT-4o class.

70.3%

o4-mini

OpenAI · Blog/OpenAI · 2025-04-16

Cost-efficient reasoning model. Behind o3 (69.1%) but far ahead of o1 (48.9%) and o3-mini (49.3%).

68.1%

Devstral Small 2

Mistral · Blog/Mistral · 2025-12-09

24B open-weight coding model. Runs locally on RTX 4090. Apache 2.0 license.

68.0%

Arcee Trinity

Arcee AI · HuggingFace/arcee-ai/Trinity-Large-Thinking · 2026-04-01

Evaluated with mini-swe-agent-v2. Behind closed-source leaders but competitive for open-weight.

63.2%

gpt-oss-120b

OpenAI · HuggingFace/openai/gpt-oss-120b · 2025-08-05

OpenAI's first open-weight model at 116.8B params (5.1B active). Apache 2.0 license. High reasoning effort.

62.4%

gpt-oss-20b

OpenAI · HuggingFace/openai/gpt-oss-20b · 2025-08-05

20.9B total params (3.6B active). Runs on 16GB consumer hardware. Apache 2.0. Impressive for its size — near gpt-oss-120b SWE-bench score.

60.7%

GLM-4.7-Flash

Zhipu AI · HuggingFace/Zhipu · 2026-01-15

Lightweight MoE flash variant. Solid for its size class.

59.2%

Grok 4

xAI · Leaderboard/vals.ai · 2026-03-20

Independent vals.ai test. Significant gap from xAI self-reported 72-75%, highlighting scaffold sensitivity.

58.6%

Sarvam 105B

Sarvam AI · HuggingFace/sarvamai · 2026-03-06

India's first domestically-trained 105B model. MoE with 10.3B active params. Apache 2.0.

45.0%