Humanity's Last Exam Leaderboard 2026 — Results Across 24 Real AI Models

Humanity's Last Exam leaderboard

Humanity's Last Exam

24 models tested · Updated 2026-04-07 · Verified sources only

      Claude Mythos Preview leads at 64.7%
    

Claude Mythos Preview

Anthropic · Blog/Anthropic · 2026-04-07

With tools variant. No-tools score is 56.8%. 11.6 pts ahead of Opus 4.6 with tools (53.1%).

64.7%

Qwen3-Max-Thinking

Alibaba · Blog/Qwen · 2026-01-25

Highest self-reported HLE score. With tools. Leads GPT-5.2-Thinking by 28%. First Chinese model to top HLE.

58.3%

Muse Spark

Meta · Blog/Meta AI · 2026-04-08

Contemplating mode (multi-agent parallel reasoning). First model from Meta Superintelligence Labs. Accepts voice, text, image inputs. Strong multimodal + health, weaker on agentic tasks vs Mythos/GPT-5.4.

58.0%

Claude Mythos Preview

Anthropic · Blog/Anthropic · 2026-04-07

Without tools. 16.8pp above Opus 4.6 (40.0%). Strongest reasoning model on HLE.

56.8%

Claude Opus 4.6

Anthropic · Anthropic — Introducing Claude Opus 4.6 · 2026-02-05

Hardest reasoning benchmark. With tool use. Corrected score per Feb 23 update.

53.1%

Qwen 3.6 Plus

Alibaba · Blog/Qwen · 2026-04-02

Strong HLE score, competitive with frontier models on this extreme reasoning test.

50.6%

GLM-5

Zhipu AI · HuggingFace/zai-org · 2026-02-11

Tool-augmented evaluation variant. Base (no tools) score is 30.5%. Higher than Claude Opus 4.5 tools score.

50.4%

Qwen3.5 27B

Alibaba · HuggingFace/Qwen · 2026-02-16

With tool access. Nearly 2x the CoT-only score. Impressive for a 27B model — beats several frontier models.

48.5%

Gemini 3 Deep Think

Google · Blog/Google · 2026-02-12

Without tools. Google specialized reasoning mode. Also scored 84.6% on ARC-AGI-2.

48.4%

Gemini 3.1 Pro

Google · Google DeepMind — Gemini page · 2026-02-19

Without tools. Google DeepMind official benchmarks page. Leads frontier models on HLE standard eval.

44.4%

GPT-5.4 Pro

OpenAI · Scale AI Leaderboard · 2026-03-05

Top of Scale AI HLE leaderboard. Standard evaluation, no tools.

44.32%

GLM-4.7

Zhipu AI · NVIDIA/Zhipu Official · 2025-12-22

Open-weight model with tools, competitive with GPT-5.1 High (42.7%).

42.8%

GPT-5.3 Codex

OpenAI · Leaderboard/Artificial Analysis · 2026-03-24

Second-highest on HLE behind GPT-5.4 Pro (44.3%). Codex-native agent combining frontier coding with general reasoning.

39.9%

Gemini 3 Pro Preview

Google · Scale AI Leaderboard · 2025-11-18

Second on Scale AI HLE leaderboard. Standard evaluation, no tools.

37.52%

GPT-5.4

OpenAI · Scale AI Leaderboard · 2026-03-05

xhigh thinking config on Scale AI HLE leaderboard.

36.24%

GPT-5 Pro

OpenAI · Scale AI Leaderboard · 2025-10-06

Standard evaluation on Scale AI HLE leaderboard.

31.64%

GLM-5

Zhipu AI · Paper/Zhipu AI (arxiv:2602.15763) · 2026-02-11

Text-only subset. With tools: 50.4% (comparable to Kimi K2.5 51.8%).

30.5%

GPT-5.2

OpenAI · Scale AI Leaderboard · 2025-12-11

Standard evaluation on Scale AI HLE leaderboard.

27.8%

Kimi K2.5

Moonshot · Scale AI Leaderboard · 2026-01-27

Chinese frontier model. Strong HLE showing for open-weight model.

24.37%

Qwen3.5 27B

Alibaba · HuggingFace/Qwen · 2026-02-16

Chain-of-thought only. With tools jumps to 48.5%. Shows HLE requires tool use for small models.

24.3%

Grok 4

xAI · Artificial Analysis · 2025-07-09

Frontier reasoning benchmark. Third-party verified via Artificial Analysis.

24.0%

Gemma 4 31B

Google · HuggingFace/Google · 2026-04-02

Without tools. Dense 31B model. Strong HLE for an open-weight model.

19.5%

MiniMax M2.5

MiniMax · Blog/MiniMax · 2026-02-12

HLE without tools. Behind frontier models (Opus 4.6: 30.7, GPT-5.2: 31.4).

19.4%

GLM-4.7-Flash

Zhipu AI · HuggingFace/Zhipu · 2026-01-15

Without tools. 30B-A3B MoE, 3.6B active params.

14.4%