youtube on AI benchmarks
8 quotes from AI researchers about benchmarks, models, and evaluation
"In SWE-bench Pro for example, Claude Mythos beats out Opus 4.6 by 25%."
AI Explained @youtube · 2026-04-08 view on x
"Claude Mythos getting 83% beats out Gemini 3.1 Pro 82% and GPT 5.4 Pro at 80%. But on the subset remix, Claude Mythos gets the same score as Gemini 3.1 Pro and slightly underperforms GPT 5.4 Pro, which gets 88%."
AI Explained @youtube · 2026-04-08 view on x
"The geometric mean productivity uplift according to technical staff surveyed within Anthropic was 4x, four times the productivity when using Mythos."
AI Explained @youtube · 2026-04-08 view on x
"Humanity's Last Exam designed to test topics so obscure that it would indeed be the last exam that AI would saturate. Well, when allowed some tools, Claude Mythos gets almost two-thirds of those questions right compared to around 50% for other Frontier models."
AI Explained @youtube · 2026-04-08 view on x
"It's the first one that's merged top tier coding. So it's codeex level coding and reasoning, general reasoning both in one model."
Ryan Lopopolo @youtube · 2026-04-07 view on x
"So onward to GPT-5.1, 5.2, 5.3, 5.4. To go through all these model generations and see their quirks and different working styles also meant we had to adapt the code base to change things up when the model was revved."
Ryan Lopopolo @youtube · 2026-04-07 view on x
"The best model at the time we published the work was Gemini 2, but that's 12 and a half percent of 100% across all domains."
Joseph Nelson @youtube · 2026-04-04 view on x
"I was using Gemini 3 the other day to try to automatically label a bunch of data for me. We maintain playground.roboflow.com where you can do SAM 3 versus Gemini versus Claude Opus."
Joseph Nelson @youtube · 2026-04-04 view on x