Researcher takes
48 quotes from AI researchers about benchmarks, models, and evaluation
"Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness"
Fuli Luo @_LuoFuli · 2026-04-05 ·1751 likes view on x
"Anyone has access to mythos and can let the rest of us plebs know what it feels like"
Julien Chaumond @julien_c · 2026-04-08 ·1667 likes view on x
"so…. Qwen3.5 or Gemma 4?"
Julien Chaumond @julien_c · 2026-04-04 ·883 likes view on x
"The new model from Meta is already looking like a disappointment: overoptimized for public benchmark numbers at the detriment of everything else. Knowing how to evaluate models in a way that correlates with actual usefulness is a core competency for AI labs, and any new lab is"
François Chollet @fchollet · 2026-04-08 ·640 likes view on x
"In different hands, Mythos would be an unprecedented cyberweapon. I am not sure how we deal with this, except to note a narrow window where we know only 3 companies could be at this level of capability. But it may be Chinese models (maybe open weights ones?) get there in 9 months"
Ethan Mollick @emollick · 2026-04-08 ·529 likes view on x
"So we now have a pretty good picture of the state of the frontier AI model makers. US closed source models continue to lead. Google, OpenAI, and Anthropic stand well ahead of the pack, and may have signs of recursive self-improvement. xAI has fallen from frontier status for now"
Ethan Mollick @emollick · 2026-04-09 ·342 likes view on x
"Our first model from MSL, Muse Spark, is now available on meta.ai! This is an efficient all-rounder model. It supports fast responses, deeper thinking, visual chain of thought, a higher inference "Contemplating" mode. Plus, it's natively multimodal."
Jack Rae @jack_w_rae · 2026-04-08 ·248 likes view on x
"Seems like a good model from Meta that is still trailing the current series of releases. The most important thing to note is that it is not open weights. That was the main reason that Meta's models were so important. Without that, it is a lot harder to predict the value of Spark"
Nathan Lambert @natolambert · 2026-04-08 ·197 likes view on x
"I think the most obvious is that Meta has its own frontier model and can use that to extract additional value out of its customer base/explore new markets for its products. Very few companies can say that, and it has value on its own."
Ethan Mollick @emollick · 2026-04-08 ·197 likes view on x
"Join the ARC Prize team -- help us build ARC-AGI-4 and ARC-AGI-5"
François Chollet @fchollet · 2026-04-07 ·128 likes view on x
"New report with @xeophon is out with the latest open model adoption data we have gathered for Interconnects & The ATOM Project. At the surface level, we can see Chinese models continuing to accelerate in adoption."
Nathan Lambert @natolambert · 2026-04-08 ·104 likes view on x
"@JeffDean release a bigger model, the 100B ballpark is proven to be a winner with GPT OSS and Nemotron 3 Super :)"
Nathan Lambert @natolambert · 2026-04-09 ·102 likes view on x
"A bigger problem: many third-party harnesses compress tool responses every 3 steps when approaching the context limit, leading to very low cache hit rates."
Fuli Luo @_LuoFuli · 2026-04-06 ·89 likes view on x
"@elonmusk beat mythos?"
Junyang Lin @JustinLin610 · 2026-04-08 ·88 likes view on x
"After playing with it a bit, Meta’s Muse Spark Thinking is fine so far, but really doesn’t match the current Big Three models. It also is a bit... weird. Like some strange language & tone, a little loose with facts, etc."
Ethan Mollick @emollick · 2026-04-09 ·84 likes view on x
"So what's the deal with Amazon Nova? They released Nova 2 in December, and even then, the top flight Nova 2 model trailed Sonnet 4.5. And it still hasn't left preview."
Ethan Mollick @emollick · 2026-04-09 ·45 likes view on x
"The US frontier labs have all walked away from open weights. They continue to occasionally release excellent open models (Gemma 4, etc), but they are smaller models that are not competitive with their closed weights models. So all eyes are on Chinese AI labs for open models."
Ethan Mollick @emollick · 2026-04-09 ·41 likes view on x
"very interesting Only Opus 4.6 and GPT 5.4 manage to absolutely avoid total bankruptcy in long-term betting. This is not so much a question of reasoning as one of learning from mistakes. Clearly, when things start going south, they can adjust towards safety. Open models can't."
"Anyhow, its not bad. Just not the vibe level that the benchmarks might indicate. And, for a first re-entry into the frontier model space, given the engineering efficiencies they achieved, it feels like a solid attempt. I am sure we will see better from Meta in the future."
Ethan Mollick @emollick · 2026-04-09 ·22 likes view on x
"Muse Spark is just a step along the path we are taking. Its fun to get usage and feedback from this initial release. But our focus within research is the frontier towards superintelligence; expect more from us this year!"
Jack Rae @jack_w_rae · 2026-04-08 ·16 likes view on x
"Do other model builders other than MoonshotAI have something like K2 vendor verifier? GLM, MiniMax, etc?"
Luca Soldaini @soldni · 2026-04-08 ·8 likes view on x
"It is a good release for December, 2025. With the current round of new releases on deck, like Mythos, it is trailing."
Ethan Mollick @emollick · 2026-04-08 ·1 likes view on x
"In SWE-bench Pro for example, Claude Mythos beats out Opus 4.6 by 25%."
AI Explained @youtube · 2026-04-08 view on x
"Claude Mythos getting 83% beats out Gemini 3.1 Pro 82% and GPT 5.4 Pro at 80%. But on the subset remix, Claude Mythos gets the same score as Gemini 3.1 Pro and slightly underperforms GPT 5.4 Pro, which gets 88%."
AI Explained @youtube · 2026-04-08 view on x
"The geometric mean productivity uplift according to technical staff surveyed within Anthropic was 4x, four times the productivity when using Mythos."
AI Explained @youtube · 2026-04-08 view on x
"Humanity's Last Exam designed to test topics so obscure that it would indeed be the last exam that AI would saturate. Well, when allowed some tools, Claude Mythos gets almost two-thirds of those questions right compared to around 50% for other Frontier models."
AI Explained @youtube · 2026-04-08 view on x
"It's the first one that's merged top tier coding. So it's codeex level coding and reasoning, general reasoning both in one model."
Ryan Lopopolo @youtube · 2026-04-07 view on x
"So onward to GPT-5.1, 5.2, 5.3, 5.4. To go through all these model generations and see their quirks and different working styles also meant we had to adapt the code base to change things up when the model was revved."
Ryan Lopopolo @youtube · 2026-04-07 view on x
"The best model at the time we published the work was Gemini 2, but that's 12 and a half percent of 100% across all domains."
Joseph Nelson @youtube · 2026-04-04 view on x
"I was using Gemini 3 the other day to try to automatically label a bunch of data for me. We maintain playground.roboflow.com where you can do SAM 3 versus Gemini versus Claude Opus."
Joseph Nelson @youtube · 2026-04-04 view on x
"Humans get 100% while the best AI models currently get less than half a percent on ARC-AGI-3. Gemini 3.1 was able to score 0.37%."
AI Explained @aiexplained · 2026-03-26 view on x
"The authors pointed out that models like Gemini 3, in their chain of thought, had giveaways that their training data may have resembled ARC-AGI-like tasks."
AI Explained @aiexplained · 2026-03-26 view on x
"On NetHack, Gemini 3 Pro is the best performing model at 6.8%."
AI Explained @aiexplained · 2026-03-26 view on x
"I had a brutal week seeing Claude Opus 4.6 and GPT-5.4 Extra High repeatedly screw up engineering tasks — a daily reminder that flipping to AI first isn't automatically an exponential speedup."
AI Explained @aiexplained · 2026-03-26 view on x
"The Spud model is apparently very strong, according to Sam Altman. It will be ready in a few weeks, and it will really accelerate the economy."
AI Explained @aiexplained · 2026-03-26 view on x
"I think it's now. I think we've achieved AGI. It is not out of the question that a Claude was able to create a web service, some interesting little app that all of a sudden a few billion people used."
Jensen Huang @lexfridman · 2026-03-23 view on x
"The next scaling law is the agentic scaling law. It's like multiplying AI — we could spin off agents as fast as you want. I have four scaling laws: pre-training, post-training, test time, and agentic."
Jensen Huang @lexfridman · 2026-03-23 view on x
"OpenClaw did for agentic systems what ChatGPT did for generative systems. I needed Claude and GPT and all of these models to reach a level of capability. Their breakthroughs were really important."
Jensen Huang @lexfridman · 2026-03-23 view on x
"There's something truly special happening from about December — people have really woken up to the power of Claude Code, Codex, OpenClaw. The iPhone of tokens arrived."
Jensen Huang @lexfridman · 2026-03-23 view on x
"You mentioned China with DeepSeek and MiniMax, all these companies really pushing forward the open source AI movement, and NVIDIA is really leading the way in close to state-of-the-art open source LLMs."
Jensen Huang @lexfridman · 2026-03-23 view on x
"I don't think I've typed like a line of code probably since December basically. Something flipped where I went from 80/20 of writing code by myself versus delegating to agents, to 20/80."
Andrej Karpathy @NoPriorsPod · 2026-03-20 view on x
"I actually think Claude has a pretty good personality. It feels like a teammate and it's excited with you. Codex is a lot more dry. It doesn't seem to care about what you're creating."
Andrej Karpathy @NoPriorsPod · 2026-03-20 view on x
"OpenClaw has a lot more sophisticated memory than what you would get by default, which is just a memory compaction when your context runs out. Peter innovated simultaneously in like five different ways."
Andrej Karpathy @NoPriorsPod · 2026-03-20 view on x
"I simultaneously feel like I'm talking to an extremely brilliant PhD student who's been a systems programmer for their entire life and a 10-year-old. This jaggedness is really strange."
Andrej Karpathy @NoPriorsPod · 2026-03-20 view on x
"If Gemini 3 or Claude 4.5, whatever, solves a problem, it is not the case that its own understanding of math has progressed. You run a new session and it's forgotten what it just did."
Terence Tao @dwarkeshp · 2026-03-20 view on x
"There are research efforts to try to create automated conjectures, and maybe there are ways to benchmark these and simulate this, but it's all very new science."
Terence Tao @dwarkeshp · 2026-03-20 view on x
"I don't think anyone's actually tried properly to do open source Co-work. OpenClaw doesn't even try to make sandboxing work. Co-work is actually trying to make sandboxing work but still be accessible to non-technical users."
swyx @latentspacetv · 2026-03-18 view on x
"I didn't think agents were capable of this kind of stuff — full browser manipulation. It figures out the titles because it can transcribe and selectively look at screenshots using vision."
swyx @latentspacetv · 2026-03-18 view on x