Researcher takes — benchmark.space

"Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness"

Fuli Luo @_LuoFuli · 2026-04-05 ·1751 likes view on x

Claude Opus 4.6

"Anyone has access to mythos and can let the rest of us plebs know what it feels like"

Julien Chaumond @julien_c · 2026-04-08 ·1667 likes view on x

Mythos

"so…. Qwen3.5 or Gemma 4?"

Julien Chaumond @julien_c · 2026-04-04 ·883 likes view on x

Qwen 3.5

"The new model from Meta is already looking like a disappointment: overoptimized for public benchmark numbers at the detriment of everything else. Knowing how to evaluate models in a way that correlates with actual usefulness is a core competency for AI labs, and any new lab is"

François Chollet @fchollet · 2026-04-08 ·640 likes view on x

"In different hands, Mythos would be an unprecedented cyberweapon. I am not sure how we deal with this, except to note a narrow window where we know only 3 companies could be at this level of capability. But it may be Chinese models (maybe open weights ones?) get there in 9 months"

Ethan Mollick @emollick · 2026-04-08 ·529 likes view on x

Mythos

"So we now have a pretty good picture of the state of the frontier AI model makers. US closed source models continue to lead. Google, OpenAI, and Anthropic stand well ahead of the pack, and may have signs of recursive self-improvement. xAI has fallen from frontier status for now"

Ethan Mollick @emollick · 2026-04-09 ·342 likes view on x

Grok 4

"Our first model from MSL, Muse Spark, is now available on meta.ai! This is an efficient all-rounder model. It supports fast responses, deeper thinking, visual chain of thought, a higher inference "Contemplating" mode. Plus, it's natively multimodal."

Jack Rae @jack_w_rae · 2026-04-08 ·248 likes view on x

Muse Spark

"Seems like a good model from Meta that is still trailing the current series of releases. The most important thing to note is that it is not open weights. That was the main reason that Meta's models were so important. Without that, it is a lot harder to predict the value of Spark"

Nathan Lambert @natolambert · 2026-04-08 ·197 likes view on x

Spark

"I think the most obvious is that Meta has its own frontier model and can use that to extract additional value out of its customer base/explore new markets for its products. Very few companies can say that, and it has value on its own."

Ethan Mollick @emollick · 2026-04-08 ·197 likes view on x

Spark

"Join the ARC Prize team -- help us build ARC-AGI-4 and ARC-AGI-5"

François Chollet @fchollet · 2026-04-07 ·128 likes view on x

ARC-AGI 2

"New report with @xeophon is out with the latest open model adoption data we have gathered for Interconnects & The ATOM Project. At the surface level, we can see Chinese models continuing to accelerate in adoption."

Nathan Lambert @natolambert · 2026-04-08 ·104 likes view on x

"@JeffDean release a bigger model, the 100B ballpark is proven to be a winner with GPT OSS and Nemotron 3 Super :)"

Nathan Lambert @natolambert · 2026-04-09 ·102 likes view on x

gpt-oss-120b

"A bigger problem: many third-party harnesses compress tool responses every 3 steps when approaching the context limit, leading to very low cache hit rates."

Fuli Luo @_LuoFuli · 2026-04-06 ·89 likes view on x

"@elonmusk beat mythos?"

Junyang Lin @JustinLin610 · 2026-04-08 ·88 likes view on x

Claude Mythos Preview

"After playing with it a bit, Meta’s Muse Spark Thinking is fine so far, but really doesn’t match the current Big Three models. It also is a bit... weird. Like some strange language & tone, a little loose with facts, etc."

Ethan Mollick @emollick · 2026-04-09 ·84 likes view on x

Muse Spark

"So what's the deal with Amazon Nova? They released Nova 2 in December, and even then, the top flight Nova 2 model trailed Sonnet 4.5. And it still hasn't left preview."

Ethan Mollick @emollick · 2026-04-09 ·45 likes view on x

Claude Sonnet 4.5

"The US frontier labs have all walked away from open weights. They continue to occasionally release excellent open models (Gemma 4, etc), but they are smaller models that are not competitive with their closed weights models. So all eyes are on Chinese AI labs for open models."

Ethan Mollick @emollick · 2026-04-09 ·41 likes view on x

Gemma 4 27B A4B

"very interesting Only Opus 4.6 and GPT 5.4 manage to absolutely avoid total bankruptcy in long-term betting. This is not so much a question of reasoning as one of learning from mistakes. Clearly, when things start going south, they can adjust towards safety. Open models can't."

Teortaxes▶️ (DeepSeek 推特🦊铁粉 2023 – ∞) @teortaxesTex · 2026-04-09 ·34 likes view on x

Claude Opus 4.6, GPT-5.4

"Anyhow, its not bad. Just not the vibe level that the benchmarks might indicate. And, for a first re-entry into the frontier model space, given the engineering efficiencies they achieved, it feels like a solid attempt. I am sure we will see better from Meta in the future."

Ethan Mollick @emollick · 2026-04-09 ·22 likes view on x

Muse Spark

"Muse Spark is just a step along the path we are taking. Its fun to get usage and feedback from this initial release. But our focus within research is the frontier towards superintelligence; expect more from us this year!"

Jack Rae @jack_w_rae · 2026-04-08 ·16 likes view on x

Muse Spark

"Do other model builders other than MoonshotAI have something like K2 vendor verifier? GLM, MiniMax, etc?"

Luca Soldaini @soldni · 2026-04-08 ·8 likes view on x

GLM

"It is a good release for December, 2025. With the current round of new releases on deck, like Mythos, it is trailing."

Ethan Mollick @emollick · 2026-04-08 ·1 likes view on x

Spark

"In SWE-bench Pro for example, Claude Mythos beats out Opus 4.6 by 25%."

AI Explained @youtube · 2026-04-08 view on x

SWE-bench Pro Claude Mythos

"Claude Mythos getting 83% beats out Gemini 3.1 Pro 82% and GPT 5.4 Pro at 80%. But on the subset remix, Claude Mythos gets the same score as Gemini 3.1 Pro and slightly underperforms GPT 5.4 Pro, which gets 88%."

AI Explained @youtube · 2026-04-08 view on x

Claude Mythos

"The geometric mean productivity uplift according to technical staff surveyed within Anthropic was 4x, four times the productivity when using Mythos."

AI Explained @youtube · 2026-04-08 view on x

Claude Mythos

"Humanity's Last Exam designed to test topics so obscure that it would indeed be the last exam that AI would saturate. Well, when allowed some tools, Claude Mythos gets almost two-thirds of those questions right compared to around 50% for other Frontier models."

AI Explained @youtube · 2026-04-08 view on x

Humanity's Last Exam Claude Mythos

"It's the first one that's merged top tier coding. So it's codeex level coding and reasoning, general reasoning both in one model."

Ryan Lopopolo @youtube · 2026-04-07 view on x

GPT-5.4

"So onward to GPT-5.1, 5.2, 5.3, 5.4. To go through all these model generations and see their quirks and different working styles also meant we had to adapt the code base to change things up when the model was revved."

Ryan Lopopolo @youtube · 2026-04-07 view on x

GPT-5.4

"The best model at the time we published the work was Gemini 2, but that's 12 and a half percent of 100% across all domains."

Joseph Nelson @youtube · 2026-04-04 view on x

Gemini 2.5 Pro

"I was using Gemini 3 the other day to try to automatically label a bunch of data for me. We maintain playground.roboflow.com where you can do SAM 3 versus Gemini versus Claude Opus."

Joseph Nelson @youtube · 2026-04-04 view on x

Gemini 3 Pro

"Humans get 100% while the best AI models currently get less than half a percent on ARC-AGI-3. Gemini 3.1 was able to score 0.37%."