aiexplained on AI benchmarks — benchmark.space

voices

aiexplained on AI benchmarks

5 quotes from AI researchers about benchmarks, models, and evaluation

"Humans get 100% while the best AI models currently get less than half a percent on ARC-AGI-3. Gemini 3.1 was able to score 0.37%."

AI Explained @aiexplained · 2026-03-26 view on x

"The authors pointed out that models like Gemini 3, in their chain of thought, had giveaways that their training data may have resembled ARC-AGI-like tasks."

AI Explained @aiexplained · 2026-03-26 view on x

ARC-AGI 3 Gemini 3

"On NetHack, Gemini 3 Pro is the best performing model at 6.8%."

AI Explained @aiexplained · 2026-03-26 view on x

"I had a brutal week seeing Claude Opus 4.6 and GPT-5.4 Extra High repeatedly screw up engineering tasks — a daily reminder that flipping to AI first isn't automatically an exponential speedup."

AI Explained @aiexplained · 2026-03-26 view on x

Claude Opus 4.6

"The Spud model is apparently very strong, according to Sam Altman. It will be ready in a few weeks, and it will really accelerate the economy."

AI Explained @aiexplained · 2026-03-26 view on x