shreesozo on AI benchmarks
7 quotes from AI researchers about benchmarks, models, and evaluation
"Mythos preview found a 27-year-old vulnerability in it. A bug that lets an attacker remotely crash any machine running that OS just by connecting to it. 27 years."
Omshri @shreesozo · 2026-04-10 view on x
"There was a bug in a single line of code. Automated fuzzing tools, the kind that throws millions of random inputs of code looking for crashes, had hit the exact line 5 million times, but never caught it. Mythos preview got in a second."
Omshri @shreesozo · 2026-04-10 view on x
"On CyberGym, the standard benchmark for vulnerability reproduction, Mythos preview scores a 83.1%. The previous best Anthropic model Opus 4.6 is already at 66.6%."
Omshri @shreesozo · 2026-04-10 view on x
"SWE-bench verified which tests real world software engineering tasks, Mythos preview hits over 93.9%. Opus 4.6 is only at 80.8%."
Omshri @shreesozo · 2026-04-10 view on x
"On Terminal-Bench 2.0 which specifically tests autonomous terminals and system level operations, Mythos hit 82% against Opus 65.4%."
Omshri @shreesozo · 2026-04-10 view on x
"A model that is dramatically better at automation, coding in terminal environments, is almost automatically dramatically better at offensive and defensive security work. The security capability is not a separate feature. It is what happens when coding abilities get good enough."
Omshri @shreesozo · 2026-04-10 view on x
"The model found multiple separate vulnerabilities and chained them together autonomously to go from a regular user to completely control of the machine. That is called privilege escalation. And that is the thing attackers most want to do once they are inside your system."
Omshri @shreesozo · 2026-04-10 view on x