ai_daily_brief on AI benchmarks — benchmark.space

voices

ai_daily_brief on AI benchmarks

5 quotes from AI researchers about benchmarks, models, and evaluation

"It scored 52.4 on SWE-bench Pro, putting it within a few points of Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 for coding. On Humanity's Last Exam, it scored 42.8, which is slightly better than Opus, but trailing Gemini and GPT 5.4."

AI Daily Brief Host @ai_daily_brief · 2026-04-10 view on x

Humanity's Last Exam Muse Spark

"With tools enabled, Muse's score only jumped to 50.4 on Humanity's Last Exam, leaving it trailing all three of those major by a few points. This could suggest the model isn't as good at web search or tool use as the others."

AI Daily Brief Host @ai_daily_brief · 2026-04-10 view on x

Humanity's Last Exam Muse Spark

"GLM 5.1 achieved a 58.4 on SWE-bench Pro, beating GPT 5.4 and Opus 4.6, who scored 57.7 and 57.3, respectively."

AI Daily Brief Host @ai_daily_brief · 2026-04-10 view on x

SWE-bench Pro GLM-5.1

"Z.ai also provided a mixed benchmark that included Terminal Bench 2.0 and NL2 Repo, which had GLM 5.1 slightly behind the two US leaders, but ahead of Gemini 3.1 Pro."

AI Daily Brief Host @ai_daily_brief · 2026-04-10 view on x

Terminal-Bench 2.0 GLM-5.1

"If those benchmarks hold, it puts GLM 5.1 in the top echelon of frontier models with a clear separation from Qwen 3.6 Plus and Kimi K2.5."

AI Daily Brief Host @ai_daily_brief · 2026-04-10 view on x