youtube_aidailybrief on AI benchmarks

voices

3 quotes from AI researchers about benchmarks, models, and evaluation

"On Humanity's Last Exam, Opus got a 40% on a no-tools run compared to Mythos preview's 56.8%. With tools enabled, performance jumped to 64.7% compared to 53.1% for Opus."

AI Daily Brief Host @youtube_aidailybrief · 2026-04-09 view on x

Humanity's Last Exam Claude Mythos Preview

"On OSWorld, which measures agentic computer use, Opus 4.6 got a 72.7% which jumped to 79.6% for Mythos."

AI Daily Brief Host @youtube_aidailybrief · 2026-04-09 view on x

OSWorld Claude Mythos Preview

"Anthropic ran the benchmark again using improvements from Terminal Bench 2.1 and extending the timeout window to 4 hours. Under those conditions, Mythos scored not 82% but 92.1%."

AI Daily Brief Host @youtube_aidailybrief · 2026-04-09 view on x

Terminal-Bench 2.0 Claude Mythos Preview