theaigrid on AI benchmarks — benchmark.space

voices

theaigrid on AI benchmarks

2 quotes from AI researchers about benchmarks, models, and evaluation

"On SWE-bench Verified, Mythos scores at 93.9%, and Claude Opus 4.6, the model you can actually use right now, 80%. On Terminal Bench, Mythos actually hits 82% up from Opus's 65.4%."

TheAIGRID @theaigrid · 2026-04-08 view on x

SWE-bench Verified Claude Mythos Preview

"I was starting to believe that this kind of intelligence would not scale as the models got bigger because maybe you would have needed new architectures. But currently this new Mythos thing actually changes the game because it shows that there is a continued curve of improvement."

TheAIGRID @theaigrid · 2026-04-08 view on x

Claude Mythos Preview