emollick on AI benchmarks — benchmark.space

"I was told about the Mythos release, but didn't have access. Two points: 1) It is not built for IT security, it is just a good enough model that it is good at that too 2) This is the first, not last, model to raise security risks"

Ethan Mollick @emollick · 2026-04-07 ·791 likes view on x

Claude Mythos Preview

"Curious how many large organization CISO offices have taken the Mythos red team reports as the red alert that it is. Based on historical trends in AI they have about six to nine months until those capabilities become widely diffused to bad actors."

Ethan Mollick @emollick · 2026-04-08 ·599 likes view on x

Claude Mythos Preview

"In different hands, Mythos would be an unprecedented cyberweapon. I am not sure how we deal with this, except to note a narrow window where we know only 3 companies could be at this level of capability. But it may be Chinese models (maybe open weights ones?) get there in 9 months"

Ethan Mollick @emollick · 2026-04-08 ·529 likes view on x

Claude Mythos Preview

"SuperClaude (Mythos) still seems irreducibly Claude-y given the transcripts in the system card. They are less philosophical than Opus 4.6 or spiritual than Opus 4.1, but still very Claude-like."

Ethan Mollick @emollick · 2026-04-07 ·497 likes view on x

Claude Mythos Preview

"So we now have a pretty good picture of the state of the frontier AI model makers. US closed source models continue to lead. Google, OpenAI, and Anthropic stand well ahead of the pack, and may have signs of recursive self-improvement. xAI has fallen from frontier status for now"

Ethan Mollick @emollick · 2026-04-09 ·342 likes view on x

Grok 4

"The story shared in the Mythos System Card still has the signs of flawed LLM writing: A story that doesn't really hold together logically, but sounds like it should. The back-and-forth banter. Lack of characters."

Ethan Mollick @emollick · 2026-04-08 ·238 likes view on x

Claude Mythos Preview

"I think the most obvious is that Meta has its own frontier model and can use that to extract additional value out of its customer base/explore new markets for its products. Very few companies can say that, and it has value on its own."

Ethan Mollick @emollick · 2026-04-08 ·197 likes view on x

Spark

"After playing with it a bit, Meta’s Muse Spark Thinking is fine so far, but really doesn’t match the current Big Three models. It also is a bit... weird. Like some strange language & tone, a little loose with facts, etc."

Ethan Mollick @emollick · 2026-04-09 ·84 likes view on x

Muse Spark

"So what's the deal with Amazon Nova? They released Nova 2 in December, and even then, the top flight Nova 2 model trailed Sonnet 4.5. And it still hasn't left preview."

Ethan Mollick @emollick · 2026-04-09 ·45 likes view on x

Claude Sonnet 4.5

"The US frontier labs have all walked away from open weights. They continue to occasionally release excellent open models (Gemma 4, etc), but they are smaller models that are not competitive with their closed weights models. So all eyes are on Chinese AI labs for open models."

Ethan Mollick @emollick · 2026-04-09 ·41 likes view on x

Gemma 4 27B A4B

"Anyhow, its not bad. Just not the vibe level that the benchmarks might indicate. And, for a first re-entry into the frontier model space, given the engineering efficiencies they achieved, it feels like a solid attempt. I am sure we will see better from Meta in the future."

Ethan Mollick @emollick · 2026-04-09 ·22 likes view on x

Muse Spark

"There are more competitive small model makers, but there is still a very big gap between what small models can do and what large models can accomplish (even if the small model benchmarks say otherwise)"

Ethan Mollick @emollick · 2026-04-09 ·12 likes view on x

"It is a good release for December, 2025. With the current round of new releases on deck, like Mythos, it is trailing."

Ethan Mollick @emollick · 2026-04-08 ·1 likes view on x

Spark