latent_space_pod on AI benchmarks
9 quotes from AI researchers about benchmarks, models, and evaluation
"We put it out October 2023 and then people didn't really touch it too much and then of course Cognition came on the scene and Devon was an amazing release and I think after that it kind of kicked off the arms race."
John Yang @latent_space_pod · 2025-12-31 view on x
"SWE-bench Pro is completely independent. It's different authors. They just called it SWE-bench Pro without your blessing. I think we're okay with it."
John Yang @latent_space_pod · 2025-12-31 view on x
"Multilingual is nine languages across 40 repos. JavaScript, Rust, Java, C, Ruby."
John Yang @latent_space_pod · 2025-12-31 view on x
"I don't like unit tests as a form of verification. There's an issue with SWE-bench where all of the task instances are independent of each other. So the moment you have the model submit, oh it's done, end of the episode."
John Yang @latent_space_pod · 2025-12-31 view on x
"I think we should intentionally include impossible tasks as a flag. Everyone reporting above 75 on Terminal-Bench, you'd be cheating."
John Yang @latent_space_pod · 2025-12-31 view on x
"Terminal-Bench has really got something going where SWE-bench you're confined to the domain of issues and PRs that already exist. With Terminal-Bench there's a lot of creativity that you can infuse. The 2.0 job was really excellent and I'd be super excited to see 3.0, 4.0."
John Yang @latent_space_pod · 2025-12-31 view on x
"Meter used SWE-bench Verified and they have a very interesting human hours worked number. The x-axis being the runtime, y-axis being the completion. The projections are quite interesting."
John Yang @latent_space_pod · 2025-12-31 view on x
"I'm a little bit anxious or cautious about this push for long autonomy. Next year we'll have 5 hours, 24 hours, days. But I don't know if that actually materially changes the industry."
John Yang @latent_space_pod · 2025-12-31 view on x
"My lab at Stanford with DiYang, her emphasis is on human-AI collaboration. I definitely don't believe in this idea of just getting rid of the human. It depends on the task — settings where you want to be more hands-on versus general data processing you want to walk away from."
John Yang @latent_space_pod · 2025-12-31 view on x