WesRoth on AI benchmarks — benchmark.space

voices

WesRoth on AI benchmarks

5 quotes from AI researchers about benchmarks, models, and evaluation

"On agentic search tasks (BrowseComp), pairing Sonnet with the Opus advisor boosted the success rate from 58.1% to 60.4%. For agentic terminal coding (Terminal-Bench 2.0), performance jumped from 59.6% to 63.4%. Opus provides highly accurate guidance cheaply."

Wes Roth @WesRoth · 2026-04-10 ·19 likes view on x

BrowseComp Claude Sonnet 4.6

"Our ability to find weaknesses skyrocketed, but our ability to fix weaknesses didn't change. General generic toughening of code is a harder computer science problem than finding one vulnerability."

Wes Roth @WesRoth · 2026-04-09 view on x

Cybench Claude Mythos Preview

"This is reflected in the high ability of LLMs to find security vulnerabilities versus the lower abilities of LLMs to rewrite whole systems with flawless security. I think people are thinking that Mythos autonomously finding bugs is the same as Mythos autonomously patching bugs and those are not the same thing."

Wes Roth @WesRoth · 2026-04-09 view on x

Cybench Claude Mythos Preview

"Eight out of eight models detected Mythos flagship FreeBSD exploit. Because small, cheap, fast models are sufficient for much of the detection work, you don't need to judiciously deploy one expensive model and hope it looks in the right places."

Wes Roth @WesRoth · 2026-04-09 view on x

Cybench Claude Mythos Preview

"Cloud Mythos is their most aligned model - good news, right? But also it can do the most damage if it's unaligned. It's like a less chance of doing something bad, but the capability is through the roof."

Wes Roth @WesRoth · 2026-04-09 view on x

Claude Mythos Preview