YouTube · 2026-03-13
"OpenAI already retired SWE-bench Verified, one of the most important benchmarks in AI, after 59.4% of its test cases turned out to be flawed and frontier models had memorized the answers."
Rod Miller
AI commentator, builder of TAB (Tool Agent Bench) platform