"It scored 52.4 on SWE-bench Pro, putting it within a few points of Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 for coding. On Humanity's Last Exam, it scored 42.8, which is slightly better than Opus, but trailing Gemini and GPT 5.4."
AI Daily Brief Host
Host, The AI Daily Brief
Humanity's Last ExamMuse Spark