3 quotes from AI researchers about benchmarks, models, and evaluation
"On Humanity's Last Exam, Opus got a 40% on a no-tools run compared to Mythos preview's 56.8%. With tools enabled, performance jumped to 64.7% compared to 53.1% for Opus."
"Anthropic ran the benchmark again using improvements from Terminal Bench 2.1 and extending the timeout window to 4 hours. Under those conditions, Mythos scored not 82% but 92.1%."