5 quotes from AI researchers about benchmarks, models, and evaluation
"Benchmark-wise, it's competing at a very high level, either surpassing or coming very close to models like Kimi K2.5, Claude Opus 4.5 and even Gemini 3 Pro across major benchmarks like SWE-bench and Terminal-Bench where it actually outperforms other models along with MMMU and other benchmarks."
"And this is what the Claude Opus 4.6 had generated. Nothing. That is just surprising. I am truly impressed by this model's ability to generate structured code in SVG."
"Here I was comparing the Kimi K2.5 and the Qwen 3.6 Plus to create a butterfly in SVG code. It actually did a better job than what the Kimi K2.5 did."
"It is priced at 50 cents per 1 million input tokens and 3 dollars per 1 million output tokens, which is honestly pretty reasonable for what you're getting in terms of the quality."