firetail_io on AI benchmarks

voices

3 quotes from AI researchers about benchmarks, models, and evaluation

"Some internal documentation that appears to show that there are serious concerns about Mythos's capabilities in discovering vulnerabilities, including the discovery of a vulnerability in a BSD package that goes back 20 plus years, predates GitHub."

Jeremy @firetail_io · 2026-04-09 view on x

Claude Mythos Preview

"Over 99% of the zero-day vulnerabilities that Mythos discovered have not yet been patched. Even the 1% that Anthropic can discuss give a clearer picture of the substantial leap in capabilities."

Jeremy @firetail_io · 2026-04-09 view on x

Cybench Claude Mythos Preview

"The model can be prompted to go find vulnerabilities so quickly and so extensively, including logical approaches to untangling business logic or concatenated serialization of parameters, sometimes even have levels of creativity and lateral thinking that a lot of pentesters may not have on their own."

Jeremy @firetail_io · 2026-04-09 view on x

Cybench Claude Mythos Preview