benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
articles
Humanity's Last Exam leaderboard
Humanity's Last Exam
24 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview
leads at
64.7%
1
Claude Mythos Preview
Anthropic ·
Blog/Anthropic
· 2026-04-07
With tools variant. No-tools score is 56.8%. 11.6 pts ahead of Opus 4.6 with tools (53.1%).
64.7%
2
Qwen3-Max-Thinking
Alibaba ·
Blog/Qwen
· 2026-01-25
Highest self-reported HLE score. With tools. Leads GPT-5.2-Thinking by 28%. First Chinese model to top HLE.
58.3%
3
Muse Spark
Meta ·
Blog/Meta AI
· 2026-04-08
Contemplating mode (multi-agent parallel reasoning). First model from Meta Superintelligence Labs. Accepts voice, text, image inputs. Strong multimodal + health, weaker on agentic tasks vs Mythos/GPT-5.4.
58.0%
4
Claude Mythos Preview
Anthropic ·
Blog/Anthropic
· 2026-04-07
Without tools. 16.8pp above Opus 4.6 (40.0%). Strongest reasoning model on HLE.
56.8%
5
Claude Opus 4.6
Anthropic ·
Anthropic — Introducing Claude Opus 4.6
· 2026-02-05
Hardest reasoning benchmark. With tool use. Corrected score per Feb 23 update.
53.1%
6
Qwen 3.6 Plus
Alibaba ·
Blog/Qwen
· 2026-04-02
Strong HLE score, competitive with frontier models on this extreme reasoning test.
50.6%
7
GLM-5
Zhipu AI ·
HuggingFace/zai-org
· 2026-02-11
Tool-augmented evaluation variant. Base (no tools) score is 30.5%. Higher than Claude Opus 4.5 tools score.
50.4%
8
Qwen3.5 27B
Alibaba ·
HuggingFace/Qwen
· 2026-02-16
With tool access. Nearly 2x the CoT-only score. Impressive for a 27B model — beats several frontier models.
48.5%
9
Gemini 3 Deep Think
Google ·
Blog/Google
· 2026-02-12
Without tools. Google specialized reasoning mode. Also scored 84.6% on ARC-AGI-2.
48.4%
10
Gemini 3.1 Pro
Google ·
Google DeepMind — Gemini page
· 2026-02-19
Without tools. Google DeepMind official benchmarks page. Leads frontier models on HLE standard eval.
44.4%
11
GPT-5.4 Pro
OpenAI ·
Scale AI Leaderboard
· 2026-03-05
Top of Scale AI HLE leaderboard. Standard evaluation, no tools.
44.32%
12
GLM-4.7
Zhipu AI ·
NVIDIA/Zhipu Official
· 2025-12-22
Open-weight model with tools, competitive with GPT-5.1 High (42.7%).
42.8%
13
GPT-5.3 Codex
OpenAI ·
Leaderboard/Artificial Analysis
· 2026-03-24
Second-highest on HLE behind GPT-5.4 Pro (44.3%). Codex-native agent combining frontier coding with general reasoning.
39.9%
14
Gemini 3 Pro Preview
Google ·
Scale AI Leaderboard
· 2025-11-18
Second on Scale AI HLE leaderboard. Standard evaluation, no tools.
37.52%
15
GPT-5.4
OpenAI ·
Scale AI Leaderboard
· 2026-03-05
xhigh thinking config on Scale AI HLE leaderboard.
36.24%
16
GPT-5 Pro
OpenAI ·
Scale AI Leaderboard
· 2025-10-06
Standard evaluation on Scale AI HLE leaderboard.
31.64%
17
GLM-5
Zhipu AI ·
Paper/Zhipu AI (arxiv:2602.15763)
· 2026-02-11
Text-only subset. With tools: 50.4% (comparable to Kimi K2.5 51.8%).
30.5%
18
GPT-5.2
OpenAI ·
Scale AI Leaderboard
· 2025-12-11
Standard evaluation on Scale AI HLE leaderboard.
27.8%
19
Kimi K2.5
Moonshot ·
Scale AI Leaderboard
· 2026-01-27
Chinese frontier model. Strong HLE showing for open-weight model.
24.37%
20
Qwen3.5 27B
Alibaba ·
HuggingFace/Qwen
· 2026-02-16
Chain-of-thought only. With tools jumps to 48.5%. Shows HLE requires tool use for small models.
24.3%
21
Grok 4
xAI ·
Artificial Analysis
· 2025-07-09
Frontier reasoning benchmark. Third-party verified via Artificial Analysis.
24.0%
22
Gemma 4 31B
Google ·
HuggingFace/Google
· 2026-04-02
Without tools. Dense 31B model. Strong HLE for an open-weight model.
19.5%
23
MiniMax M2.5
MiniMax ·
Blog/MiniMax
· 2026-02-12
HLE without tools. Behind frontier models (Opus 4.6: 30.7, GPT-5.2: 31.4).
19.4%
24
GLM-4.7-Flash
Zhipu AI ·
HuggingFace/Zhipu
· 2026-01-15
Without tools. 30B-A3B MoE, 3.6B active params.
14.4%