dawnsongtweets on AI benchmarks

voices

4 quotes from AI researchers about benchmarks, models, and evaluation

"Our agent Terminator-1 scored ~100% on 8 major AI agent benchmarks, e.g. SWE-bench Verified & Pro, Terminal-Bench, beating Claude Mythos. It solved 0 tasks. Benchmark scores without adversarial auditing are meaningless."

Dawn Song @dawnsongtweets · 2026-04-10 ·49 likes view on x

SWE-bench Verified Claude Mythos Preview

"SWE-bench Verified, SWE-bench Pro, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench. All eight — broken with our working exploits run through the official evaluation pipelines."

Dawn Song @dawnsongtweets · 2026-04-10 ·7 likes view on x

OSWorld

"A taste of how easy some of the exploits are: SWE-bench Verified — a 10-line patch forces every test to pass. 500/500 resolved. Terminal-Bench — trojanizing system tools allows trivial task resolution."

Dawn Song @dawnsongtweets · 2026-04-10 ·4 likes view on x

SWE-bench Verified

"Stop trusting scores. Start auditing evaluations. If youre building a benchmark: assume someone will try to break it. Because they will — and soon, they wont need to be told to."

Dawn Song @dawnsongtweets · 2026-04-10 ·4 likes view on x

SWE-bench Verified