48 quotes from AI researchers about benchmarks, models, and evaluation
"Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness"
"The new model from Meta is already looking like a disappointment: overoptimized for public benchmark numbers at the detriment of everything else. Knowing how to evaluate models in a way that correlates with actual usefulness is a core competency for AI labs, and any new lab is"
"In different hands, Mythos would be an unprecedented cyberweapon. I am not sure how we deal with this, except to note a narrow window where we know only 3 companies could be at this level of capability. But it may be Chinese models (maybe open weights ones?) get there in 9 months"
"So we now have a pretty good picture of the state of the frontier AI model makers.
US closed source models continue to lead. Google, OpenAI, and Anthropic stand well ahead of the pack, and may have signs of recursive self-improvement. xAI has fallen from frontier status for now"
"Our first model from MSL, Muse Spark, is now available on meta.ai! This is an efficient all-rounder model. It supports fast responses, deeper thinking, visual chain of thought, a higher inference "Contemplating" mode. Plus, it's natively multimodal."
"Seems like a good model from Meta that is still trailing the current series of releases. The most important thing to note is that it is not open weights. That was the main reason that Meta's models were so important. Without that, it is a lot harder to predict the value of Spark"
"I think the most obvious is that Meta has its own frontier model and can use that to extract additional value out of its customer base/explore new markets for its products. Very few companies can say that, and it has value on its own."
"New report with @xeophon is out with the latest open model adoption data we have gathered for Interconnects & The ATOM Project. At the surface level, we can see Chinese models continuing to accelerate in adoption."
"A bigger problem: many third-party harnesses compress tool responses every 3 steps when approaching the context limit, leading to very low cache hit rates."
"After playing with it a bit, Meta’s Muse Spark Thinking is fine so far, but really doesn’t match the current Big Three models. It also is a bit... weird. Like some strange language & tone, a little loose with facts, etc."
"So what's the deal with Amazon Nova? They released Nova 2 in December, and even then, the top flight Nova 2 model trailed Sonnet 4.5. And it still hasn't left preview."
"The US frontier labs have all walked away from open weights. They continue to occasionally release excellent open models (Gemma 4, etc), but they are smaller models that are not competitive with their closed weights models. So all eyes are on Chinese AI labs for open models."
"very interesting
Only Opus 4.6 and GPT 5.4 manage to absolutely avoid total bankruptcy in long-term betting. This is not so much a question of reasoning as one of learning from mistakes. Clearly, when things start going south, they can adjust towards safety.
Open models can't."
"Anyhow, its not bad. Just not the vibe level that the benchmarks might indicate. And, for a first re-entry into the frontier model space, given the engineering efficiencies they achieved, it feels like a solid attempt. I am sure we will see better from Meta in the future."
"Muse Spark is just a step along the path we are taking. Its fun to get usage and feedback from this initial release. But our focus within research is the frontier towards superintelligence; expect more from us this year!"
"Claude Mythos getting 83% beats out Gemini 3.1 Pro 82% and GPT 5.4 Pro at 80%. But on the subset remix, Claude Mythos gets the same score as Gemini 3.1 Pro and slightly underperforms GPT 5.4 Pro, which gets 88%."
"Humanity's Last Exam designed to test topics so obscure that it would indeed be the last exam that AI would saturate. Well, when allowed some tools, Claude Mythos gets almost two-thirds of those questions right compared to around 50% for other Frontier models."
"So onward to GPT-5.1, 5.2, 5.3, 5.4. To go through all these model generations and see their quirks and different working styles also meant we had to adapt the code base to change things up when the model was revved."
"I was using Gemini 3 the other day to try to automatically label a bunch of data for me. We maintain playground.roboflow.com where you can do SAM 3 versus Gemini versus Claude Opus."
"The authors pointed out that models like Gemini 3, in their chain of thought, had giveaways that their training data may have resembled ARC-AGI-like tasks."
"I had a brutal week seeing Claude Opus 4.6 and GPT-5.4 Extra High repeatedly screw up engineering tasks — a daily reminder that flipping to AI first isn't automatically an exponential speedup."
"I think it's now. I think we've achieved AGI. It is not out of the question that a Claude was able to create a web service, some interesting little app that all of a sudden a few billion people used."
"The next scaling law is the agentic scaling law. It's like multiplying AI — we could spin off agents as fast as you want. I have four scaling laws: pre-training, post-training, test time, and agentic."
"OpenClaw did for agentic systems what ChatGPT did for generative systems. I needed Claude and GPT and all of these models to reach a level of capability. Their breakthroughs were really important."
"There's something truly special happening from about December — people have really woken up to the power of Claude Code, Codex, OpenClaw. The iPhone of tokens arrived."
"You mentioned China with DeepSeek and MiniMax, all these companies really pushing forward the open source AI movement, and NVIDIA is really leading the way in close to state-of-the-art open source LLMs."
"I don't think I've typed like a line of code probably since December basically. Something flipped where I went from 80/20 of writing code by myself versus delegating to agents, to 20/80."
"I actually think Claude has a pretty good personality. It feels like a teammate and it's excited with you. Codex is a lot more dry. It doesn't seem to care about what you're creating."
"OpenClaw has a lot more sophisticated memory than what you would get by default, which is just a memory compaction when your context runs out. Peter innovated simultaneously in like five different ways."
"I simultaneously feel like I'm talking to an extremely brilliant PhD student who's been a systems programmer for their entire life and a 10-year-old. This jaggedness is really strange."
"If Gemini 3 or Claude 4.5, whatever, solves a problem, it is not the case that its own understanding of math has progressed. You run a new session and it's forgotten what it just did."
"There are research efforts to try to create automated conjectures, and maybe there are ways to benchmark these and simulate this, but it's all very new science."
"I don't think anyone's actually tried properly to do open source Co-work. OpenClaw doesn't even try to make sandboxing work. Co-work is actually trying to make sandboxing work but still be accessible to non-technical users."
"I didn't think agents were capable of this kind of stuff — full browser manipulation. It figures out the titles because it can transcribe and selectively look at screenshots using vision."