"You would be understandably slightly confused to see it being better in all sorts of coding benchmarks, measures of scientific reasoning, and academic reasoning, like GPQA Diamond and Humanity Last Exam respectively, as well as general pattern recognition, ARC-AGI-2. But yet, in a head-to-head on GDP val, it falls seemingly quite far behind Claude Opus 4.6."
AI Explained
AI analysis YouTube channel
GPQA DiamondGemini 3.1 Pro