7 quotes from AI researchers about benchmarks, models, and evaluation
"Mythos preview found a 27-year-old vulnerability in it. A bug that lets an attacker remotely crash any machine running that OS just by connecting to it. 27 years."
"There was a bug in a single line of code. Automated fuzzing tools, the kind that throws millions of random inputs of code looking for crashes, had hit the exact line 5 million times, but never caught it. Mythos preview got in a second."
"On CyberGym, the standard benchmark for vulnerability reproduction, Mythos preview scores a 83.1%. The previous best Anthropic model Opus 4.6 is already at 66.6%."
"A model that is dramatically better at automation, coding in terminal environments, is almost automatically dramatically better at offensive and defensive security work. The security capability is not a separate feature. It is what happens when coding abilities get good enough."
"The model found multiple separate vulnerabilities and chained them together autonomously to go from a regular user to completely control of the machine. That is called privilege escalation. And that is the thing attackers most want to do once they are inside your system."