@fireship on AI benchmarks
1 quotes from AI researchers about benchmarks, models, and evaluation
"During Anthropic's internal testing, they discovered that Mythos is basically a zero-day vending machine. It found a 16-year-old vulnerability in FFmpeg and a 27-year-old bug in OpenBSD."
Jeff Delaney @@fireship · 2026-04-10 view on x