t3dotgg on AI benchmarks
16 quotes from AI researchers about benchmarks, models, and evaluation
"On SWE-bench Pro, Mythos got a 78% when previously Opus only got a 53. And if you are curious, I found the numbers for GPT 5.4 it was a 57.7. A 24 point jump is a 50% improvement on one of the hardest software benches we have."
Theo @t3dotgg · 2026-04-08 view on x
"They also massively increased their terminal bench score to an 82% previously at 65."
Theo @t3dotgg · 2026-04-08 view on x
"GPQA got from 91 to 94. They were always a little behind on this. So, yeah, I think it is pretty saturated at this point."
Theo @t3dotgg · 2026-04-08 view on x
"Humanity last exam, they went from a 40% to a 56.8%. And when given tools, it did even better at a 64.7%. Crazy that HLE is going to be saturated soon."
Theo @t3dotgg · 2026-04-08 view on x
"Mythos is to Opus what Opus is to Sonnet. It is a much bigger model that will be much more expensive that is slow but powerful and the capabilities are immense."
Theo @t3dotgg · 2026-04-08 view on x
"In our testing, Claude Mythos preview demonstrated a striking leap in cyber capabilities relative to prior models, including the ability to autonomously discover and exploit zero-day vulnerabilities in major operating systems and web browsers."
Theo @t3dotgg · 2026-04-08 view on x
"Mythos autonomously found and chained together several vulnerabilities in the Linux kernel and it allowed an attacker to escalate from an ordinary user to complete control of the machine. Finding a novel Linux exploit to get root is horrifying."
Theo @t3dotgg · 2026-04-08 view on x
"It does kind of suck that we are now at a place where there is a model that is 50% plus better than anything else out there that you can only use if you are on Anthropic nice guy list."
Theo @t3dotgg · 2026-04-08 view on x
"Anthropic has committed up to 100 million in usage credit for Mythos preview across these efforts as well as 4 million in direct donations to open source security organizations."
Theo @t3dotgg · 2026-04-08 view on x
"The security side seems to be an emergent behavior from getting good at code. They were not trying to train it to be good at hacking. They were just trying to make it good at code. And this just happened as a result."
Theo @t3dotgg · 2026-04-08 view on x
"They did not train the model to be good at cyber security. They trained it to be good at code. So if you can get good code and good code chat histories out of the other models and then RL an open-weight one, if you can get an open-weight model that is good enough at coding, it will also be able to pwn in a similar way."
Theo @t3dotgg · 2026-04-08 view on x
"The window between a vulnerability being discovered and being exploited by an adversary has collapsed. What once took months now happens in minutes with AI."
Theo @t3dotgg · 2026-04-08 view on x
"Claude Mythos preview is on essentially every dimension we can measure the best aligned model that we have released to date by a significant margin. Even so, we believe that it likely poses the greatest alignment related risk of any model we have ever released to date."
Theo @t3dotgg · 2026-04-08 view on x
"The SWE-bench multimodal implementation is nearly double as well. It is actually a bit over."
Theo @t3dotgg · 2026-04-08 view on x
"The price for Mythos preview is 25 dollars per million tokens in and 125 per million out. For reference, GPT-5.4 is 2.50 per million in and 15 per million out. So, it is approximately 10x more expensive than GPT-5.4."
Theo @t3dotgg · 2026-04-08 view on x
"Mythos autonomously found and chained together several vulnerabilities in the Linux kernel and it allowed an attacker to escalate from an ordinary user to complete control of the machine. Finding a novel Linux exploit to get root is horrifying."
Theo @t3dotgg · 2026-04-08 view on x