Toolathlon Leaderboard 2026 — Results Across 7 Real AI Models

Toolathlon leaderboard

Toolathlon

7 models tested · Updated 2026-06-09 · Verified sources only

      Claude Mythos 5 leads at 61.7%
    

Anthropic · Blog/OpenAI · 2026-06-09

Beats GPT-5.6 Sol (58) on Toolathlon; matches Mythos Preview (61.1) and Fable 5 (61.7).

61.7%

Anthropic · Blog/Z.ai · 2026-06-13

Highest among compared models.

59.9%

Google DeepMind · Blog/Google DeepMind · 2026-06-10

Highest among compared models. Edges GPT-5.5 (55.6).

56.5%

OpenAI · Blog/OpenAI · 2026-04-23

Toolathlon evaluates tool-use proficiency across diverse real-world tasks.

55.6%

DeepSeek · HuggingFace/deepseek-ai · 2026-04-24

Strong tool-use for open-weights model. Trails GPT-5.5 by 4pp.

51.8%

StepFun · Blog/StepFun · 2026-05-29

Toolathlon. Multi-tool coordination.

49.5%

Z.ai · Blog/Z.ai · 2026-06-13

Up from GLM-5.1 (40.7). Agentic tool use.

48.2%