PromptEngineering on AI benchmarks
8 quotes from AI researchers about benchmarks, models, and evaluation
"In terms of benchmarks, it is very close to Opus 4.6. Now, personally, I don't really pay too much attention to these benchmarks because they are usually directionally correct, but the best way to look at a model's capability is just to test the model yourself."
Prompt Engineering @PromptEngineering · 2026-04-02 view on x
"You shouldn't be using it as a chat model. These models are trained for agentic coding, so you should be using them within a harness."
Prompt Engineering @PromptEngineering · 2026-04-02 view on x
"This harness gives the model the ability to plan, execute that plan, test, and then refine the outputs if it needs further refinement. And if you wrap your model in this agentic loop, you're going to see much better outputs even for more complex prompts."
Prompt Engineering @PromptEngineering · 2026-04-02 view on x
"I took the same prompt for tracking the real-time location position of International Space Station and ran it through Qwen, Opus, Gemini and unfortunately, GPT 4.5 did not produce any results."
Prompt Engineering @PromptEngineering · 2026-04-02 view on x
"Within an agentic harness, this has the ability to do interleaved thinking. It is thinking, then taking some actions, then continue thinking again, which is pretty great because it can look at the output of the actions that it has taken and then can build on top of that."
Prompt Engineering @PromptEngineering · 2026-04-02 view on x
"The final step seems to be self-correction or refinement, where I have seen that almost in every response, it has the self-correction step at the end. Seems like this is an intentional step that they have added."
Prompt Engineering @PromptEngineering · 2026-04-02 view on x
"It's a very very strong release, and you can definitely use this model for agentic coding tasks. The main thing is going to be you need to select your harness wisely."
Prompt Engineering @PromptEngineering · 2026-04-02 view on x
"The one that most reasoning models fail on is this question. This is a classic river crossing puzzle with a simple twist. We just want the farmer to take goat to the other side. Unfortunately, just like other reasoning models that I have tested before, it got into that trap."
Prompt Engineering @PromptEngineering · 2026-04-02 view on x