youtube_ai_explained on AI benchmarks

voices

4 quotes from AI researchers about benchmarks, models, and evaluation

"On the remix where you try to avoid memorization, Claude Mythos gets the same score as Gemini 3.1 Pro and slightly underperforms GPT 5.4 Pro, which gets 88%."

AI Explained @youtube_ai_explained · 2026-04-08 view on x

GPT-5.4 Pro

"Anthropics say it is not yet capable of causing dramatic acceleration. And yes, for followers of this channel, they admit that the previous survey they relied on for the release of Opus 4.6 was deeply flawed."

AI Explained @youtube_ai_explained · 2026-04-08 view on x

Claude Opus 4.6

"Depending on whether you anchor on Claude Opus 4.5 or Claude Opus 4.6, one would nevertheless have to conclude that things are improving at an accelerating rate."

AI Explained @youtube_ai_explained · 2026-04-08 view on x

Claude Opus 4.6

"It affected Claude Opus 4.6 and Sonnet 4.6 as well. When the reward code saw misaligned chains of thought, bad thoughts in other words, it could give a negative reward."

AI Explained @youtube_ai_explained · 2026-04-08 view on x

Claude Opus 4.6