josh on AI benchmarks — benchmark.space

voices

josh on AI benchmarks

4 quotes from AI researchers about benchmarks, models, and evaluation

"Mythos scores 93% on this benchmark, and the current best public model, Opus, scores 80.8%. Nothing from OpenAI, Google, or any of the open source models are coming within 13 points of this."

josh @josh · 2026-04-08 view on x

SWE-bench Verified Claude Mythos Preview

"They pointed Claude Mythos at a Linux machine and without any advanced prompting, any advanced directives, it on its own found multiple kernel issues that it was able to chain together and just have root access to any system."

josh @josh · 2026-04-08 view on x

Claude Mythos Preview

"They are officially declaring that there will be no public access to Claude Mythos until way more safeguards are in place."

josh @josh · 2026-04-08 view on x

Claude Mythos Preview

"Mythos found a bug that had been sitting in the codebase for 27 years. It was a remote crash vulnerability where you can actually just connect to it and just shut it off."

josh @josh · 2026-04-08 view on x

Claude Mythos Preview