8 quotes from AI researchers about benchmarks, models, and evaluation
"In terms of benchmarks, it is very close to Opus 4.6. Now, personally, I don't really pay too much attention to these benchmarks because they are usually directionally correct, but the best way to look at a model's capability is just to test the model yourself."
"This harness gives the model the ability to plan, execute that plan, test, and then refine the outputs if it needs further refinement. And if you wrap your model in this agentic loop, you're going to see much better outputs even for more complex prompts."
"I took the same prompt for tracking the real-time location position of International Space Station and ran it through Qwen, Opus, Gemini and unfortunately, GPT 4.5 did not produce any results."
"Within an agentic harness, this has the ability to do interleaved thinking. It is thinking, then taking some actions, then continue thinking again, which is pretty great because it can look at the output of the actions that it has taken and then can build on top of that."
"The final step seems to be self-correction or refinement, where I have seen that almost in every response, it has the self-correction step at the end. Seems like this is an intentional step that they have added."
"It's a very very strong release, and you can definitely use this model for agentic coding tasks. The main thing is going to be you need to select your harness wisely."
"The one that most reasoning models fail on is this question. This is a classic river crossing puzzle with a simple twist. We just want the farmer to take goat to the other side. Unfortunately, just like other reasoning models that I have tested before, it got into that trap."