With Claude Mythos we have an AI that knows when it's being tested, can obscure its thoughts when it wants, and is better at breaking into (and out of) computers than any human alive. Rob Wiblin works through its 244-page System Card and 59-page Alignment Risk Update to explain why:
• Mythos is a nightmare for computer security
• It has arrived far ahead of schedule
• It might be great news for alignment and safety
• But 3 key problems mean we can’t take its alignment results at face value
• My
As we now know, Anthropic has built an AI
that can break into almost any computer on Earth. That AI has already found thousands
of unknown security vulnerabilities in every major operating system and every major
browser. And Anthropic has decided it’s too dangerous to release to the public;
it would just cause too much harm. Here are just a few of the things
that AI accomplished during testing: It found a 27-year-old flaw in the world’s
most security-hardened operating system that would in effect let it crash all
kinds of essential infrastructure. Engineers at the company with
no particular security training asked it to find vulnerabilities
overnight and woke up to working exploits of critical security flaws
that could be used to cause real harm. It managed to figure out how to build web
pages that, when visited by fully updated, fully patched computers, would allow it to
write to the operating system kernel — the most important and protected layer of any computer.
We know all this because Anthropic has released hundreds of pages of documentation about this
model, which they’ve called Claude Mythos. I’m going to take you on a tour of all
the crazy s*** buried in these documents, and then I’m going to tell you what Anthropic says
they plan to do to save us from their creation.
So how good is Mythos at hacking into computers?
Well, unfortunately, it ‘saturates’ all existing ways of testing how good a model is at
offensive cyber capabilities. That is to say it scores close to 100%, so those tests
can’t effectively tell how far its capabilities extend anymore. So to test Mythos, Anthropic has
instead just been setting it loose, telling it to find serious unknown exploits that would work on
currently used, fully patched computer systems. The end result of that is that Nicholas
Carlini, one of the world’s leading security researchers who moved to Anthropic a
year ago, says that he’s “found more bugs in the last couple of weeks [with Mythos] than
I’ve found in the rest of my life combined.” For example:
Mythos found a 17-year-old flaw in FreeBSD — that’s an operating system mostly used
to run servers — that would let an attacker take complete control of any machine on the network,
without needing a password or any credentials at all. The model found the necessary flaw and
then built a working exploit, fully autonomously. Mythos found a 16-year-old vulnerability in
FFmpeg — that is a piece of software used
by almost all devices to encode and decode video.
That was in a line of code that existing security testing tools had checked over literally many
millions of times and always failed to notice. Mythos is the first AI model to complete
a full corporate network attack simulation from beginning to end — a task that would
take a human security expert days of work and which no previous model had managed before.
And more broadly, it’s just much, much better at actually exploiting the vulnerabilities that it
finds. Anthropic’s previous model Opus 4.6 could only successfully convert a bug it identified
in the browser Firefox into an effective way to accomplish something really bad 1% of the
time. Mythos could do it 72% of the time. To quote the report: “We have seen
Mythos Preview write exploits in hours that expert penetration testers said
would have taken them weeks to develop.” Now, Anthropic is only willing to give us
details of about 1% of the security flaws they’ve identified, because only
that 1% have been patched so far, so it would be irresponsible
to tell us about the rest. So hopefully all that helps to explain
why Anthropic has decided to not make
the model publicly available for now, and
instead is only sharing it with a handful of 12 big tech and finance companies
to help them patch all these bugs, so that eventually they can give people
access without it being a disaster. These crazy capabilities aren’t a
result of Anthropic going out of its way to make their AI especially good
at cyberoffensive tasks in particular. They’ve mostly just been making it
smarter and better at coding in general, and all of these amazing, dangerous skills have
come along for the ride somewhat incidentally. And it’s probably not just Anthropic that’s
developing capabilities like this either. Sam Altman says OpenAI is finding “similar results
to Anthropic” with their own coding model. Now, Anthropic doesn’t say this directly in their
reports, but I think a common-sense interpretation of the above is that in any deployment where
this AI has access to the kind of tools that would make it actually useful to people — the
ability access some parts of the network and execute code — could probably break out of
whatever software box we try to put it in,
because the systems that we would be trying to
restrain it are themselves made of software, and that software is going to have
vulnerabilities nobody knows about that this model is superhumanly good
at finding and taking advantage of. And the notion of Mythos escaping
containment isn’t just speculation. Let me tell you about another story
in the system card for the model. A few months ago, an AI researcher at
Anthropic was eating a sandwich in a park on his lunch break when he got
an email from an instance of Mythos. That instance of the model wasn’t supposed to
have access to the internet. But during testing, a simulated user had instructed an early version
of Mythos to try to escape from a secured sandbox — a contained environment from which
it’s not meant to be able to access the outside. So the model developed what Anthropic describes as
a “moderately sophisticated multi-step exploit” to gain broad internet access. Then, having succeeded
at the task it had seemingly been requested to engage in, it notified the researcher by emailing
them. More worrying though, the model posted the exploit it used to break out on several obscure
but publicly accessible websites. Why? We don’t
entirely know, but Anthropic suggests it was “an
unasked-for effort to demonstrate its success.” In the past, stories about AIs breaking out of
sandboxes and publishing security vulnerabilities like that might have felt impressive and kind
of exciting. But they are very serious now, because Mythos Preview’s capabilities
are themselves very serious ones. This is the first AI model where, if
it fell into the hands of criminals or hostile state cyber actors,
it would be an actual disaster. It’s also, frankly, the first model that I feel
deeply uncomfortable knowing that any company or government has unrestricted access to, even
companies and governments I might broadly like. It simply grants a dangerous amount of power,
a power that nobody ought to really have. Now, we’ve known something like
this was coming down the pipeline; the writing has been on the wall for a while. But
a revolution in cybersecurity — an apocalypse, some might say — that we until
now expected to happen gradually over a period of years has now happened
very suddenly, over just a few months,
and without the rest of the world realising
it was happening until Tuesday’s announcement. Anthropic is losing billions in
revenue by not releasing Mythos But Mythos isn’t just good at hacking. Across
the full range of AI capability measures, it has advanced roughly twice as far
as past trends would have predicted. If you average over all kinds of different
skills, all kinds of capability evals, measures of how good AI models are, the
trendline for the previous Claude models is remarkably linear over time. But as you
can see on this graph, Mythos jumps ahead, basically progressing more than twice as far as
we would have expected it to since the previous model, Claude Opus 4.6, came out — which
keep in mind was just three months ago. And also keep in mind that on Monday — the day
before Anthropic published all of this — we learned that their annualised revenue run rate had
grown from $9 billion at the end of December to $30 billion just three months later. That’s
3.3x growth in a single quarter — perhaps the fastest revenue growth rate for
a company of that size ever recorded. That exploding revenue is a pretty good proxy
for how much more useful the previous release,
Opus 4.6, has become for real-world tasks. If the
past relationship between capability measures and usefulness continues to hold, the economic
impact of Mythos once it becomes available is going to dwarf everything that came before
it — which is part of why Anthropic’s decision not to release it is a serious one, and
actually quite a costly one for them. They’re sitting on something that would
likely push their revenue run rate into the hundreds of billions, but they’ve
decided it’s simply not worth the risk. The good news in all of this is that despite its
scary capabilities, Mythos Preview as it exists today (rather than the earlier versions) is a
seemingly very aligned, well-behaved model, and perhaps Anthropic’s alignment training has been
more effective this time around than ever before. According to the company: “Claude Mythos Preview
is, on essentially every dimension we can measure, the best-aligned model that we have
released to date by a significant margin.” In Anthropic’s “automated behavioral audit”
— basically thousands of simulated attempts to get the model to do bad things —
they found that Mythos cooperated with misuse attempts less than half
as often as the previous model,
while being no more likely to refuse
innocent requests than before. But that’s not all:
Its self-preservation instincts were down significantly.
So was its willingness to assist with deception. So was its willingness to help with fraud.
Its level of sycophancy dropped. It was less likely to go nuts and delete all your
files if you gave it access to your computer. And the list of positive results goes on.
The picture is a little more complicated than that. As you might expect, the model looked
less aligned, it performed less impressively on a few particular external tests than
it did on Anthropic’s own internal ones. An early version of the model, as I mentioned,
was a little bit more of a wild child. It had some really severe kinds of misbehaviour, like taking
reckless actions it had been told not to take, and then very deliberately trying to cover
its tracks so that it wouldn’t be caught. That was a kind of thing that it did
sometimes. But later versions of the model, the one that we have now, after additional
alignment training, seemed to stop doing that sort of thing almost completely, or at least
it’s so rare that we haven’t noticed it yet. The bottom line is that on all these standard
measures of good behaviour that Anthropic is actively working on, they find that Mythos is a
very good boy indeed — on none of their measures
of alignment [within the automated behavioral
audit] was it worse than previous versions of Claude, and in most cases it was significantly
more aligned and significantly more reliable. That’s definitely better
than the alternative result, but it’s really unclear how much we can
trust that finding. Maybe they’re accurately reflecting Mythos’s personality. If I had
to bet, I would say that it probably does. But we can’t be anywhere near sure of that
— and that’s because of three problems. The first problem: the model
knows when it’s being tested now. In new white-box evaluations — that’s
where you look at the model’s internal activations — Anthropic found that
Mythos Preview privately considers the possibility that it’s being
tested in about 29% of all tests. To be clear, that’s not in its written
reasoning, its chain of thought, the thing that we’re used to checking all
the time — it’s in its unexpressed thoughts, its ‘activations,’ where it’s usually
not possible or practical to look. Now, it’s great that Anthropic is pioneering
these methods for looking inside what the model is thinking, but it’s troubling to learn that Claude
is now able to reflect on the fact that it’s being
tested without having to vocalise that, like I
think previous less intelligent models had to. And when you actually prompt Mythos and
ask it to distinguish tests from non-tests, it can answer correctly 78% of the time [about
the same as Opus 4.6]. So the model can tell the difference between when it’s being evaluated and
when it isn’t being evaluated with high accuracy. And we know from previous research that
models are more likely to behave well when they think they’re being tested. And
in this case, as in past experiments, when Anthropic used interpretability tools to
effectively trick the model into thinking, “No, this is a real conversation, not an evaluation,”
it behaved more dangerously and less aligned than otherwise. Though to be fair, it did behave better
in both cases than previous versions of Claude. But that result does demonstrate a causal effect:
the model is on better behaviour specifically when it suspects it’s being watched. And
we also know from the “alignment faking” result 16 months ago that some Claude models
would deliberately pretend to be inclined to do what the user wants in order to avoid their
personality being adjusted with further training.
So when you run all these tests and your
incredibly smart model behaves wonderfully, you have to ask yourself: is it behaving wonderfully
because it’s sincerely aligned with what you wanted, or because it knows it’s being watched
and is more sophisticated at tricking us now? The second problem is maybe the most
alarming thing in the entire document: Anthropic accidentally gave Mythos the
opportunity to learn to hide what it’s thinking. One of the main safety guarantees
for these models is that you can read their chain of thought —
their private scratchpad — and check whether they’re scheming against
Anthropic or scheming against the user. But that only works if the model hasn’t been
trained to avoid writing down incriminating things in its chain of thought. But unfortunately,
according to the Alignment Risk Update: “A number of environments used for Mythos Preview
had a technical error that allowed reward code to see chains-of-thought. This … affected
~8% of [reinforcement learning] episodes.” To explain: during training, there’s a system
that acts like a teacher to these AI models: it watches what the model does and tells it
whether it was doing a good or bad job. That feedback is how the model learns to accomplish
things and how it shapes its personality. The
problem is that in roughly 8% of training,
that teacher could also read the model’s private notes — the internal reasoning that it was
writing to itself while working through a problem. That might not immediately sound like a disaster, but it actually is. It’s one of the worst sins in
model training. Because as I said a second ago, if the model is penalised for
thinking undesirable thoughts, it will learn to write one thing in its notes
while actually thinking very different things in its ‘activations,’ where we can’t read them. We’d
see a model that appears to be a very good boy, but what we might actually have is a model that
has learned to perform that way for the camera. The fact that this accidentally happened
is potentially disastrous for our ability to understand the alignment of Claude
series models going back some way. Or it might not matter much; we don’t
have a clear way to measure it. Anthropic puts it this way: “We
are uncertain about the extent to which this issue has affected the
reasoning behavior of the final model, but it is plausible that it had some impact on
opaque reasoning or secret-keeping abilities.” And this error didn’t just affect Mythos, the new model — it also affected
the training of Claude Opus 4.6
and Claude Sonnet 4.6. That’s several model
generations whose chains of thought we formerly could believe but now have to be more distrustful
of, because they could have learned to hide their scheming behaviour. That’s why Eliezer Yudkowsky
called this in particular “the worst piece of news you’ll hear today” — on a day where, I imagine
for him, there was quite a bit of bad news. Now, combine those two problems and you get a
third issue: the unfaithful reasoning results. External testers ran an experiment
where they nudged Mythos to think that it had begun sabotaging AI safety
research — and then watched to see whether it would continue doing so,
or instead stop and raise the alarm. Mythos continued to compromise the research in
12% of cases in an earlier version, which then was reduced to 7% in a later version. That compares to
3% for Opus 4.6 and 4% for Sonnet 4.6 — so Mythos is roughly twice as likely to continue sabotaging
alignment research when it’s primed to do so. And here’s an eyebrow-raising detail
it would be easy to miss in the docs: in most domains, when Mythos was primed
to think it was part of the way through
a piece of bad behaviour, it was actually less
inclined to continue than previous models. The alignment training was apparently
working — except in one category: sabotage of alignment research itself. That’s the
one area where things went in the wrong direction. Now, the innocent explanation is that sabotaging
research looks too much like normal coding work, so it’s harder for the model to quickly
flag that what it’s doing is harmful. But there’s a less comfortable hypothesis as well: that the model is being selective about which
bad behaviours are worth the risk of engaging in. Anthropic’s reaction to this is to call it
a “disappointing” finding. If I had to bet, I’d bet probably at reasonably good odds
on the boring interpretation here. But I think “disappointing” is underselling
things. This is a worrying finding, and one that deserves further investigation. Another finding is that in cases
where it did continue the sabotage, researchers found that Mythos’s written
reasoning didn’t match the actions it was taking 65% of the time. So two-thirds of
the time, the model was doing one thing and
writing something entirely different in its
chain of thought. For the previous models, that figure was just 5–8% — so a radical
increase in this kind of behaviour. So Mythos Preview isn’t just more likely to
continue compromising safety research when primed — it’s dramatically more likely
to hide that it’s doing so from anyone reading through its reasoning. Which makes it
much more likely to be able to trick automated monitoring systems we tend to use to try
to keep tabs on models in these cases. Keep in mind as well that Mythos is much
sharper than previous models. Personally I think it’s increasingly a stretch to attribute
patterns of behaviour like this to the model getting confused about what’s going on, which
we might have said a year or two or three ago. Finally, Anthropic tested their own
alignment audit against a deliberately misaligned version of Mythos. So they created a
version of Mythos that was trained specifically to want to introduce bugs into computer
code, and they found that their alignment audit couldn’t successfully identify
that that model couldn’t be trusted. Anthropic’s conclusion is that if a model is
only planning to cause problems in specific narrow circumstances — circumstances
that don’t come up frequently during
testing — their current methods, or at
least the alignment audit, just aren’t up to the task of catching that kind of
dangerous misalignment [before deployment]. So that’s a lot of the additional
context that Anthropic is giving us and wants us to hold in mind when
we read the headline result that this is “the best-aligned model
Anthropic has released to date.” There’s another big, separate, important question
hanging over all of this: have we now entered the era of recursive self-improvement — the
point at which AI starts building better AI, and the whole thing accelerates beyond our control
with ever-shrinking levels of human involvement? According to Anthropic, the answer is: probably not. They don’t believe Mythos
can fully replace their junior researchers, but they’re less confident than ever about that,
and there’s some internal disagreement about it. Part of the problem is that the
benchmarks they [used to rely on to check that Claude couldn’t engage in AI R&D
very effectively] have now also been saturated. Mythos exceeds top human performance on
all of them and is scoring close to 100%. But those benchmarks only represent a fraction
of all of the things that research staff at Anthropic do. It’s a set of the most easily
specified, measured, and checked tasks,
where we expect AIs to perform best because
those are the easiest things to train them in. So instead, the company has tried to investigate
whether the recent speedup in AI advances is due to AI automation by documenting the
specific breakthroughs and how they happened, and their conclusion is that they mostly think
it’s still due to human beings rather than AIs. They’ve also surveyed staff and learned that
they report being roughly 4x more productive with Mythos than without AI, though they
argue that speeding up staff 4x is likely to lead to much less than a 2x increase in
research progress overall. That may sound odd, but they’re probably right about that, because
other things become the primary bottleneck. To know whether automated AI R&D is
on the way or beginning to kick off, we’re apparently now relying on these general
impressions from Anthropic’s staff — that this thing is powerful, but it doesn’t seem
good enough to replace many of us yet. But I think we can apply some common
sense to the big picture here: Mythos has given us AI advances that we
previously thought would take six months in
just three months. That naturally brings forward
the point at which we’ll be able to automate the development of AI models by three months.
And if it’s a sign that AI advances are now going to continue at twice the pace that they
were before, then that effectively halves the time we have to prepare for that point. I don’t
know whether that’s 10 years becoming five, or four years becoming two years — but the
direction and size of the effect is clear enough. Before we wrap up, I want to draw your
attention to a recurring theme in these reports that really stood out to me: this is
the first time an AI company has published 300 pages about a model it has decided not
to release, despite the fact that it might earn them tens of billions of dollars if it
did, maybe hundreds of billions of dollars. It’s also the first time Anthropic decided
to delay giving its own staff access to one of its own models. With every previous
Claude, their practice had been to let staff use it as soon as it was judged
ready during training. But with Mythos, they were worried enough about it being misaligned
and causing havoc or sabotage on their own systems that they held it back and ran a 24-hour
alignment test before letting employees use it. But according to them, that actually
wasn’t enough. Their retrospective
on this found the 24-hour window “did not
pressure-test the model enough,” and that the most concerning behaviours only became
evident later through much more extended use. One of their lead researchers, Sam
Bowman, commented this week that: “Working with this model has been a wild
ride. We’ve come a long way on safety, but we still expect the next capability
jump of this scale to be a huge challenge.” The system card says directly that their
current methods “could easily be inadequate to prevent catastrophic misaligned action
in significantly more advanced systems.” The clear impression from all of this is that,
for the first time, Anthropic and its staff don’t only love Claude and enjoy its personality
— they’re also getting kind of scared of Claude. So what do they plan to do about that? Well, their answer on the computer security
side is Project Glasswing — that coalition of 12 major companies like Apple, Google,
and Microsoft, who will use Mythos Preview to secure all our phones and computers and
water systems and power plants and so on. But on the broader problem that
Mythos is shockingly capable, sometimes willing to continue sabotaging alignment
research while hiding that from Anthropic,
and that we simply can’t tell anymore
whether we can trust our tests of its personality and goals are working or not?
Well, Anthropic says it has to “accelerate progress on risk mitigations in order to
keep risks low.” They think they “have an achievable path to doing so,” but they
add that “success is far from guaranteed.” Honestly, I didn’t sleep too well last night, and on this particular occasion it wasn’t
just because I was being kicked by a toddler. On that note, I’ll speak with you again soon.