WesRoth on AI benchmarks
1 quotes from AI researchers about benchmarks, models, and evaluation
"On agentic search tasks (BrowseComp), pairing Sonnet with the Opus advisor boosted the success rate from 58.1% to 60.4%. For agentic terminal coding (Terminal-Bench 2.0), performance jumped from 59.6% to 63.4%. Opus provides highly accurate guidance cheaply."
Wes Roth @WesRoth · 2026-04-10 ·19 likes view on x