benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
articles
Terminal-Bench 2.0 leaderboard
Terminal-Bench 2.0
14 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview
leads at
82.0%
1
Claude Mythos Preview
Anthropic ·
Blog/Anthropic
· 2026-04-07
16.6pp above Opus 4.6 (65.4%). Terminal-based coding tasks.
82.0%
2
GPT-5.4
OpenAI ·
Leaderboard/tbench.ai
· 2026-03-05
Tied for #1 on Terminal-Bench 2.0 leaderboard. Agent framework matters: same model scores differently with different agents.
81.8%
3
Claude Opus 4.6
Anthropic ·
Leaderboard/tbench.ai
· 2026-02-05
Tied for #1 on Terminal-Bench 2.0 leaderboard with GPT-5.4. ForgeCode agent framework achieves best results for both models.
81.8%
4
Gemini 3.1 Pro
Google ·
Leaderboard/tbench.ai
· 2026-02-19
#3 on Terminal-Bench 2.0 leaderboard. TongAgents framework outperforms ForgeCode (78.4%) for this model.
80.2%
5
GPT-5.3 Codex
OpenAI ·
Web/Search
· 2026-02-05
Terminal-specific model variant. Leads Terminal-Bench over GPT-5.4 (75.1%).
77.3%
6
GLM-5.1
Zhipu AI ·
Blog/Z.AI
· 2026-04-07
#1 open-source on Terminal-Bench. Behind GPT-5.4 (75.1) and Mythos (82.0) overall.
69.0%
7
Qwen 3.6 Plus
Alibaba ·
Blog/Qwen
· 2026-04-02
Beats Claude Opus 4.6 (59.3%) on terminal operations
61.6%
8
GPT-5.4 Mini
OpenAI ·
Blog/OpenAI
· 2026-03-17
Terminal/CLI tasks. Solid for a smaller, faster model.
60.0%
9
Claude Sonnet 4.6
Anthropic ·
Blog/Anthropic
· 2026-02-17
Default thinking configuration. Measures terminal/CLI task completion.
59.1%
10
MiniMax M2.7
MiniMax ·
Blog/MiniMax
· 2026-03-18
Official blog. Matches GPT-5.3-Codex on SWE-Pro (56.22%). Self-evolution architecture.
57.0%
11
GLM-5
Zhipu AI ·
Paper/Zhipu AI (arxiv:2602.15763)
· 2026-02-11
Terminus-2 framework. With verified subset (fixed ambiguous instructions): 60.7%. Strong for an open model.
56.2%
12
Step-3.5-Flash
StepFun ·
Blog/StepFun
· 2026-02-12
Open-weight MoE with 11B active params on terminal coding tasks.
51.0%
13
Kimi K2.5
Moonshot AI ·
Blog/Kimi
· 2026-01-27
Terminal/coding benchmark. Behind Qwen 3.6 Plus (61.6) and Claude Opus 4.6.
50.8%
14
GPT-5.4 Nano
OpenAI ·
Blog/OpenAI
· 2026-03-17
Terminal tasks. Budget-tier model optimized for speed over depth.
46.3%