Toolathlon
2 models tested · Updated 2026-04-23 · Verified sources only
GPT-5.5 leads at 55.6%
1
OpenAI · Blog/OpenAI · 2026-04-23
Toolathlon evaluates tool-use proficiency across diverse real-world tasks.
55.6%
2
DeepSeek · HuggingFace/deepseek-ai · 2026-04-24
Strong tool-use for open-weights model. Trails GPT-5.5 by 4pp.
51.8%