samwitteveen on AI benchmarks

voices

4 quotes from AI researchers about benchmarks, models, and evaluation

"Not only is their model winning on only three of the benchmarks, they are also actually last on three of the benchmarks including the Humanities Last Exam here with tools which is the benchmark that Alexander Wang's own company Scale AI actually created."

Sam Witteveen @samwitteveen · 2026-04-09 view on x

Humanity's Last Exam Muse Spark

"This is sitting in the top five models that they have benchmarked. It sits ahead of Claude Sonnet, ahead of the GLM 5.1, MiniMax 2.7. And really only behind the models from the top three proprietary labs being Google Gemini, being OpenAI's GPT 5.4 and Claude Opus 4.6."

Sam Witteveen @samwitteveen · 2026-04-09 view on x

Muse Spark

"The model is actually quite token efficient for its intelligence level and stands out pretty strong and is what they are saying the second most capable vision model that they have benchmarked."

Sam Witteveen @samwitteveen · 2026-04-09 view on x

MMMU Muse Spark

"The Qwen models, the Gemma 4 models, and we know that there are more releases coming from the Gemma team. We know Qwen 3.7 is just around the corner."

Sam Witteveen @samwitteveen · 2026-04-09 view on x

Qwen 3.7