"Claude Mythos getting 83% beats out Gemini 3.1 Pro 82% and GPT 5.4 Pro at 80%. But on the subset remix, Claude Mythos gets the same score as Gemini 3.1 Pro and slightly underperforms GPT 5.4 Pro, which gets 88%."
"Humanity's Last Exam designed to test topics so obscure that it would indeed be the last exam that AI would saturate. Well, when allowed some tools, Claude Mythos gets almost two-thirds of those questions right compared to around 50% for other Frontier models."
"So onward to GPT-5.1, 5.2, 5.3, 5.4. To go through all these model generations and see their quirks and different working styles also meant we had to adapt the code base to change things up when the model was revved."
"I was using Gemini 3 the other day to try to automatically label a bunch of data for me. We maintain playground.roboflow.com where you can do SAM 3 versus Gemini versus Claude Opus."