8 quotes from AI researchers about benchmarks, models, and evaluation
"You can see that Meta Spark model is currently sitting behind Claude Opus 4.6 Max on this index. And the reason I think this index is a pretty good benchmark baseline is because it shows us the combination/average of many different results, not just one specific area."
"Humanity's last exam, this currently looks like it's state-of-the-art, just three points behind GPT 5.4 Pro, and it's actually currently better than other models when it doesn't use tools."
"In Frontier Science Research, it actually has 38.3, which is currently a state-of-the-art benchmark. I do think that multiple agents collaborating is probably going to be a theme for the future considering that most of these models capabilities are well within reach of another."
"Meta found that if you penalize the model for thinking too long, something weird happens. The model actually learns to compress its reasoning and it solves the same problem using fewer tokens."
"Llama 4 Maverick needed 10 times more compute to match the same quality. DeepSeek needed eight times more compute and Kimi needed three times more compute to match the same quality."
"When we look at the benchmarks, it does say that Gemini 3.1 Pro is currently excelling across the board. But currently we are at that point where all of these models are within maybe 2 to 5 percentage points of each other."
"Meta have essentially released something in this model called contemplating mode. This is something that orchestrates multiple agents that reason in parallel. And in their testing, they found it competitive with other extreme reasoning models such as Gemini Deepthink and GPT Pro."
"Most models are not actually natively built to be multimodal. Most of them are just simply text-based. Which is why when you do have companies like Google and Meta that train their models natively to be multimodal, you do get some very effective models that have multimodal reasoning capability."