4 quotes from AI researchers about benchmarks, models, and evaluation
"It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome."
"They surveyed technical staff on the productivity uplift they experienced from Claude Mythos preview relative to not using AI for work at all. The distribution is wide and the geometric mean is on the order of 4x. So there's a 4x uplift in productivity when you use Claude Mythos preview."
"This issue affected 8% of reinforcement learning episodes and was isolated to three specific subdomains... GUI computer use, office related tasks, and a small set of STEM environments. We are uncertain about the extent to which this issue has affected the reasoning behavior of the final model."