5 quotes from AI researchers about benchmarks, models, and evaluation
"It scored 52.4 on SWE-bench Pro, putting it within a few points of Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 for coding. On Humanity's Last Exam, it scored 42.8, which is slightly better than Opus, but trailing Gemini and GPT 5.4."
"With tools enabled, Muse's score only jumped to 50.4 on Humanity's Last Exam, leaving it trailing all three of those major by a few points. This could suggest the model isn't as good at web search or tool use as the others."
"Z.ai also provided a mixed benchmark that included Terminal Bench 2.0 and NL2 Repo, which had GLM 5.1 slightly behind the two US leaders, but ahead of Gemini 3.1 Pro."