dawnsongtweets on AI benchmarks
4 quotes from AI researchers about benchmarks, models, and evaluation
"Our agent Terminator-1 scored ~100% on 8 major AI agent benchmarks, e.g. SWE-bench Verified & Pro, Terminal-Bench, beating Claude Mythos. It solved 0 tasks. Benchmark scores without adversarial auditing are meaningless."
Dawn Song @dawnsongtweets · 2026-04-10 ·49 likes view on x
"SWE-bench Verified, SWE-bench Pro, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench. All eight — broken with our working exploits run through the official evaluation pipelines."
Dawn Song @dawnsongtweets · 2026-04-10 ·7 likes view on x
"A taste of how easy some of the exploits are: SWE-bench Verified — a 10-line patch forces every test to pass. 500/500 resolved. Terminal-Bench — trojanizing system tools allows trivial task resolution."
Dawn Song @dawnsongtweets · 2026-04-10 ·4 likes view on x
"Stop trusting scores. Start auditing evaluations. If youre building a benchmark: assume someone will try to break it. Because they will — and soon, they wont need to be told to."
Dawn Song @dawnsongtweets · 2026-04-10 ·4 likes view on x