80000Hours on AI benchmarks
10 quotes from AI researchers about benchmarks, models, and evaluation
"It saturates all existing ways of testing how good a model is at offensive cyber capabilities. That is to say it scores close to 100%, so those tests can not effectively tell how far its capabilities extend anymore."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"Anthropic previous model Opus 4.6 could only successfully convert a bug it identified in the browser Firefox into an effective way to accomplish something really bad 1% of the time. Mythos could do it 72% of the time."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"Mythos jumps ahead, basically progressing more than twice as far as we would have expected it to since the previous model, Claude Opus 4.6, came out — which keep in mind was just three months ago."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"When you actually prompt Mythos and ask it to distinguish tests from non-tests, it can answer correctly 78% of the time [about the same as Opus 4.6]. So the model can tell the difference between when it is being evaluated and when it is not being evaluated with high accuracy."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"This error did not just affect Mythos, the new model — it also affected the training of Claude Opus 4.6 and Claude Sonnet 4.6. That is several model generations whose chains of thought we formerly could believe but now have to be more distrustful of."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"Mythos continued to compromise the research in 12% of cases in an earlier version, which then was reduced to 7% in a later version. That compares to 3% for Opus 4.6 and 4% for Sonnet 4.6 — so Mythos is roughly twice as likely to continue sabotaging alignment research when it is primed to do so."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"In cases where it did continue the sabotage, researchers found that Mythos written reasoning did not match the actions it was taking 65% of the time. For the previous models, that figure was just 5-8% — so a radical increase in this kind of behaviour."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"The benchmarks they used to rely on to check that Claude could not engage in AI R&D very effectively have now also been saturated. Mythos exceeds top human performance on all of them and is scoring close to 100%."
Rob Wiblin @80000Hours · 2026-04-10 view on x
"The system card says directly that their current methods could easily be inadequate to prevent catastrophic misaligned action in significantly more advanced systems."
Rob Wiblin @80000Hours · 2026-04-10 view on x