Humanity's Last Exam
24 models tested · Updated 2026-04-07 · Verified sources only
Claude Mythos Preview leads at 64.7%
1
Anthropic · Blog/Anthropic · 2026-04-07
With tools variant. No-tools score is 56.8%. 11.6 pts ahead of Opus 4.6 with tools (53.1%).
64.7%
2
Alibaba · Blog/Qwen · 2026-01-25
Highest self-reported HLE score. With tools. Leads GPT-5.2-Thinking by 28%. First Chinese model to top HLE.
58.3%
3
Meta · Blog/Meta AI · 2026-04-08
Contemplating mode (multi-agent parallel reasoning). First model from Meta Superintelligence Labs. Accepts voice, text, image inputs. Strong multimodal + health, weaker on agentic tasks vs Mythos/GPT-5.4.
58.0%
4
Anthropic · Blog/Anthropic · 2026-04-07
Without tools. 16.8pp above Opus 4.6 (40.0%). Strongest reasoning model on HLE.
56.8%
5
Hardest reasoning benchmark. With tool use. Corrected score per Feb 23 update.
53.1%
6
Alibaba · Blog/Qwen · 2026-04-02
Strong HLE score, competitive with frontier models on this extreme reasoning test.
50.6%
7
Zhipu AI · HuggingFace/zai-org · 2026-02-11
Tool-augmented evaluation variant. Base (no tools) score is 30.5%. Higher than Claude Opus 4.5 tools score.
50.4%
8
Alibaba · HuggingFace/Qwen · 2026-02-16
With tool access. Nearly 2x the CoT-only score. Impressive for a 27B model — beats several frontier models.
48.5%
9
Google · Blog/Google · 2026-02-12
Without tools. Google specialized reasoning mode. Also scored 84.6% on ARC-AGI-2.
48.4%
10
Google · Google DeepMind — Gemini page · 2026-02-19
Without tools. Google DeepMind official benchmarks page. Leads frontier models on HLE standard eval.
44.4%
11
OpenAI · Scale AI Leaderboard · 2026-03-05
Top of Scale AI HLE leaderboard. Standard evaluation, no tools.
44.32%
12
Zhipu AI · NVIDIA/Zhipu Official · 2025-12-22
Open-weight model with tools, competitive with GPT-5.1 High (42.7%).
42.8%
13
OpenAI · Leaderboard/Artificial Analysis · 2026-03-24
Second-highest on HLE behind GPT-5.4 Pro (44.3%). Codex-native agent combining frontier coding with general reasoning.
39.9%
14
Google · Scale AI Leaderboard · 2025-11-18
Second on Scale AI HLE leaderboard. Standard evaluation, no tools.
37.52%
15
OpenAI · Scale AI Leaderboard · 2026-03-05
xhigh thinking config on Scale AI HLE leaderboard.
36.24%
16
OpenAI · Scale AI Leaderboard · 2025-10-06
Standard evaluation on Scale AI HLE leaderboard.
31.64%
17
Zhipu AI · Paper/Zhipu AI (arxiv:2602.15763) · 2026-02-11
Text-only subset. With tools: 50.4% (comparable to Kimi K2.5 51.8%).
30.5%
18
OpenAI · Scale AI Leaderboard · 2025-12-11
Standard evaluation on Scale AI HLE leaderboard.
27.8%
19
Moonshot · Scale AI Leaderboard · 2026-01-27
Chinese frontier model. Strong HLE showing for open-weight model.
24.37%
20
Alibaba · HuggingFace/Qwen · 2026-02-16
Chain-of-thought only. With tools jumps to 48.5%. Shows HLE requires tool use for small models.
24.3%
21
xAI · Artificial Analysis · 2025-07-09
Frontier reasoning benchmark. Third-party verified via Artificial Analysis.
24.0%
22
Google · HuggingFace/Google · 2026-04-02
Without tools. Dense 31B model. Strong HLE for an open-weight model.
19.5%
23
MiniMax · Blog/MiniMax · 2026-02-12
HLE without tools. Behind frontier models (Opus 4.6: 30.7, GPT-5.2: 31.4).
19.4%
24
Zhipu AI · HuggingFace/Zhipu · 2026-01-15
Without tools. 30B-A3B MoE, 3.6B active params.
14.4%