fireship on AI benchmarks — benchmark.space

voices

fireship on AI benchmarks

2 quotes from AI researchers about benchmarks, models, and evaluation

"The OpenBSD vulnerability came out of a thousand parallel agent runs across the codebase costing nearly $20,000 in compute. If you use the same process with Opus 4.6 or GPT-5.4 Pro, you'd probably find plenty of issues as well."

Fireship (Jeff Sheldon) @fireship · 2026-04-10 view on x

Cybench Claude Mythos Preview

"One claim is that Mythos hit an 84% success rate at writing working exploits in Firefox, a massive jump over Opus 4.6's 15%. But that number isn't against actual Firefox. It's against a SpiderMonkey shell with the process sandbox and other mitigations turned off."

Fireship (Jeff Sheldon) @fireship · 2026-04-10 view on x

Cybench Claude Mythos Preview