shreesozo on AI benchmarks — benchmark.space

voices

shreesozo on AI benchmarks

7 quotes from AI researchers about benchmarks, models, and evaluation

"Mythos preview found a 27-year-old vulnerability in it. A bug that lets an attacker remotely crash any machine running that OS just by connecting to it. 27 years."

Omshri @shreesozo · 2026-04-10 view on x

CyberGym Claude Mythos Preview

"There was a bug in a single line of code. Automated fuzzing tools, the kind that throws millions of random inputs of code looking for crashes, had hit the exact line 5 million times, but never caught it. Mythos preview got in a second."

Omshri @shreesozo · 2026-04-10 view on x

Claude Mythos Preview

"On CyberGym, the standard benchmark for vulnerability reproduction, Mythos preview scores a 83.1%. The previous best Anthropic model Opus 4.6 is already at 66.6%."

Omshri @shreesozo · 2026-04-10 view on x

CyberGym Claude Mythos Preview

"SWE-bench verified which tests real world software engineering tasks, Mythos preview hits over 93.9%. Opus 4.6 is only at 80.8%."

Omshri @shreesozo · 2026-04-10 view on x

SWE-bench Verified Claude Mythos Preview

"On Terminal-Bench 2.0 which specifically tests autonomous terminals and system level operations, Mythos hit 82% against Opus 65.4%."

Omshri @shreesozo · 2026-04-10 view on x

Terminal-Bench 2.0 Claude Mythos Preview

"A model that is dramatically better at automation, coding in terminal environments, is almost automatically dramatically better at offensive and defensive security work. The security capability is not a separate feature. It is what happens when coding abilities get good enough."

Omshri @shreesozo · 2026-04-10 view on x

Claude Mythos Preview

"The model found multiple separate vulnerabilities and chained them together autonomously to go from a regular user to completely control of the machine. That is called privilege escalation. And that is the thing attackers most want to do once they are inside your system."

Omshri @shreesozo · 2026-04-10 view on x

Claude Mythos Preview