benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
papers
articles
Toolathlon leaderboard
Toolathlon
2 models tested · Updated 2026-04-23 · Verified sources only
GPT-5.5
leads at
55.6%
1
GPT-5.5
OpenAI ·
Blog/OpenAI
· 2026-04-23
Toolathlon evaluates tool-use proficiency across diverse real-world tasks.
55.6%
2
DeepSeek V4 Pro
DeepSeek ·
HuggingFace/deepseek-ai
· 2026-04-24
Strong tool-use for open-weights model. Trails GPT-5.5 by 4pp.
51.8%