josh_youtube on AI benchmarks

voices

1 quotes from AI researchers about benchmarks, models, and evaluation

"Software engineering bench verified, the real-world software engineering benchmark that everyone actually cares about. Mythos scores 93% on this benchmark, and the current best public model, Opus, scores 80.8%. Nothing from OpenAI, Google, or any of the open source models are coming within 13 points of this."

Josh @josh_youtube · 2026-04-08 view on x

SWE-bench Verified Claude Mythos Preview