benchmark
.
space
benchmarks
rankings
compare
voices
transcripts
papers
articles
X / Twitter · 2026-04-10
"A taste of how easy some of the exploits are: SWE-bench Verified — a 10-line patch forces every test to pass. 500/500 resolved. Terminal-Bench — trojanizing system tools allows trivial task resolution."
Dawn Song
SWE-bench Verified
view original source →
all researcher takes →