2 quotes from AI researchers about benchmarks, models, and evaluation
"The OpenBSD vulnerability came out of a thousand parallel agent runs across the codebase costing nearly $20,000 in compute. If you use the same process with Opus 4.6 or GPT-5.4 Pro, you'd probably find plenty of issues as well."
"One claim is that Mythos hit an 84% success rate at writing working exploits in Firefox, a massive jump over Opus 4.6's 15%. But that number isn't against actual Firefox. It's against a SpiderMonkey shell with the process sandbox and other mitigations turned off."