youtube_aidailybrief on AI benchmarks
3 quotes from AI researchers about benchmarks, models, and evaluation
"On Humanity's Last Exam, Opus got a 40% on a no-tools run compared to Mythos preview's 56.8%. With tools enabled, performance jumped to 64.7% compared to 53.1% for Opus."
AI Daily Brief Host @youtube_aidailybrief · 2026-04-09 view on x
"On OSWorld, which measures agentic computer use, Opus 4.6 got a 72.7% which jumped to 79.6% for Mythos."
AI Daily Brief Host @youtube_aidailybrief · 2026-04-09 view on x
"Anthropic ran the benchmark again using improvements from Terminal Bench 2.1 and extending the timeout window to 4 hours. Under those conditions, Mythos scored not 82% but 92.1%."
AI Daily Brief Host @youtube_aidailybrief · 2026-04-09 view on x