3 quotes from AI researchers about benchmarks, models, and evaluation
"Mythos, specifically Claude Mythos Preview, marks a major leap forward in AI, especially when comes to coding and complex reasoning. It achieved a record-breaking 93.9% on the SWE-bench verified benchmark."
"It achieved a record-breaking 93.9% on the SWE-bench verified benchmark. This was a huge jump from 80.8% scored by Claude Opus 4.6 just a few months ago."
"Mythos is being called by some a zero-day engine. Anthropic researchers found that even engineers with no security experience could ask it to find a serious software vulnerability overnight and by morning it would deliver a working exploit."