aiexplained on AI benchmarks
5 quotes from AI researchers about benchmarks, models, and evaluation
"Humans get 100% while the best AI models currently get less than half a percent on ARC-AGI-3. Gemini 3.1 was able to score 0.37%."
AI Explained @aiexplained · 2026-03-26 view on x
"The authors pointed out that models like Gemini 3, in their chain of thought, had giveaways that their training data may have resembled ARC-AGI-like tasks."
AI Explained @aiexplained · 2026-03-26 view on x
"On NetHack, Gemini 3 Pro is the best performing model at 6.8%."
AI Explained @aiexplained · 2026-03-26 view on x
"I had a brutal week seeing Claude Opus 4.6 and GPT-5.4 Extra High repeatedly screw up engineering tasks — a daily reminder that flipping to AI first isn't automatically an exponential speedup."
AI Explained @aiexplained · 2026-03-26 view on x
"The Spud model is apparently very strong, according to Sam Altman. It will be ready in a few weeks, and it will really accelerate the economy."
AI Explained @aiexplained · 2026-03-26 view on x