wesroth on AI benchmarks — benchmark.space

voices

wesroth on AI benchmarks

4 quotes from AI researchers about benchmarks, models, and evaluation

"It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome."

Nicholas Carlini @wesroth · 2026-04-10 view on x

Claude Mythos Preview

"I found more bugs in the last couple of weeks than I found in the rest of my life combined."

Nicholas Carlini @wesroth · 2026-04-10 view on x

Claude Mythos Preview

"They surveyed technical staff on the productivity uplift they experienced from Claude Mythos preview relative to not using AI for work at all. The distribution is wide and the geometric mean is on the order of 4x. So there's a 4x uplift in productivity when you use Claude Mythos preview."

Wes Roth @wesroth · 2026-04-10 view on x

Claude Mythos Preview

"This issue affected 8% of reinforcement learning episodes and was isolated to three specific subdomains... GUI computer use, office related tasks, and a small set of STEM environments. We are uncertain about the extent to which this issue has affected the reasoning behavior of the final model."

Wes Roth @wesroth · 2026-04-10 view on x

Claude Mythos Preview