9 quotes from AI researchers about benchmarks, models, and evaluation
"We put it out October 2023 and then people didn't really touch it too much and then of course Cognition came on the scene and Devon was an amazing release and I think after that it kind of kicked off the arms race."
"SWE-bench Pro is completely independent. It's different authors. They just called it SWE-bench Pro without your blessing. I think we're okay with it."
"I don't like unit tests as a form of verification. There's an issue with SWE-bench where all of the task instances are independent of each other. So the moment you have the model submit, oh it's done, end of the episode."
"Terminal-Bench has really got something going where SWE-bench you're confined to the domain of issues and PRs that already exist. With Terminal-Bench there's a lot of creativity that you can infuse. The 2.0 job was really excellent and I'd be super excited to see 3.0, 4.0."
"Meter used SWE-bench Verified and they have a very interesting human hours worked number. The x-axis being the runtime, y-axis being the completion. The projections are quite interesting."
"I'm a little bit anxious or cautious about this push for long autonomy. Next year we'll have 5 hours, 24 hours, days. But I don't know if that actually materially changes the industry."
"My lab at Stanford with DiYang, her emphasis is on human-AI collaboration. I definitely don't believe in this idea of just getting rid of the human. It depends on the task — settings where you want to be more hands-on versus general data processing you want to walk away from."