t3dotgg on AI benchmarks — benchmark.space

"On SWE-bench Pro, Mythos got a 78% when previously Opus only got a 53. And if you are curious, I found the numbers for GPT 5.4 it was a 57.7. A 24 point jump is a 50% improvement on one of the hardest software benches we have."

Theo @t3dotgg · 2026-04-08 view on x

SWE-bench Pro Claude Mythos Preview

"They also massively increased their terminal bench score to an 82% previously at 65."

Theo @t3dotgg · 2026-04-08 view on x

Terminal-Bench 2.0 Claude Mythos Preview

"GPQA got from 91 to 94. They were always a little behind on this. So, yeah, I think it is pretty saturated at this point."

Theo @t3dotgg · 2026-04-08 view on x

GPQA Diamond Claude Mythos Preview

"Humanity last exam, they went from a 40% to a 56.8%. And when given tools, it did even better at a 64.7%. Crazy that HLE is going to be saturated soon."

Theo @t3dotgg · 2026-04-08 view on x

Humanity's Last Exam Claude Mythos Preview

"Mythos is to Opus what Opus is to Sonnet. It is a much bigger model that will be much more expensive that is slow but powerful and the capabilities are immense."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview

"In our testing, Claude Mythos preview demonstrated a striking leap in cyber capabilities relative to prior models, including the ability to autonomously discover and exploit zero-day vulnerabilities in major operating systems and web browsers."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview

"Mythos autonomously found and chained together several vulnerabilities in the Linux kernel and it allowed an attacker to escalate from an ordinary user to complete control of the machine. Finding a novel Linux exploit to get root is horrifying."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview

"It does kind of suck that we are now at a place where there is a model that is 50% plus better than anything else out there that you can only use if you are on Anthropic nice guy list."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview

"Anthropic has committed up to 100 million in usage credit for Mythos preview across these efforts as well as 4 million in direct donations to open source security organizations."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview

"The security side seems to be an emergent behavior from getting good at code. They were not trying to train it to be good at hacking. They were just trying to make it good at code. And this just happened as a result."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview

"They did not train the model to be good at cyber security. They trained it to be good at code. So if you can get good code and good code chat histories out of the other models and then RL an open-weight one, if you can get an open-weight model that is good enough at coding, it will also be able to pwn in a similar way."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview

"The window between a vulnerability being discovered and being exploited by an adversary has collapsed. What once took months now happens in minutes with AI."

Theo @t3dotgg · 2026-04-08 view on x

"Claude Mythos preview is on essentially every dimension we can measure the best aligned model that we have released to date by a significant margin. Even so, we believe that it likely poses the greatest alignment related risk of any model we have ever released to date."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview

"The SWE-bench multimodal implementation is nearly double as well. It is actually a bit over."

Theo @t3dotgg · 2026-04-08 view on x

SWE-bench Multimodal Claude Mythos Preview

"The price for Mythos preview is 25 dollars per million tokens in and 125 per million out. For reference, GPT-5.4 is 2.50 per million in and 15 per million out. So, it is approximately 10x more expensive than GPT-5.4."

Theo @t3dotgg · 2026-04-08 view on x

Claude Mythos Preview