NewMachina on AI benchmarks — benchmark.space

voices

NewMachina on AI benchmarks

3 quotes from AI researchers about benchmarks, models, and evaluation

"Mythos, specifically Claude Mythos Preview, marks a major leap forward in AI, especially when comes to coding and complex reasoning. It achieved a record-breaking 93.9% on the SWE-bench verified benchmark."

New Machina @NewMachina · 2026-04-10 view on x

SWE-bench Verified Claude Mythos Preview

"It achieved a record-breaking 93.9% on the SWE-bench verified benchmark. This was a huge jump from 80.8% scored by Claude Opus 4.6 just a few months ago."

New Machina @NewMachina · 2026-04-10 view on x

SWE-bench Verified Claude Mythos Preview

"Mythos is being called by some a zero-day engine. Anthropic researchers found that even engineers with no security experience could ask it to find a serious software vulnerability overnight and by morning it would deliver a working exploit."

New Machina @NewMachina · 2026-04-10 view on x

Claude Mythos Preview