UK Mythos AI Security Tests: Government Evaluation Shows First Model to Complete Full Cyberattack Simulation

Key Takeaways

- Mythos Preview became the first AI model to complete AISI's 32-step corporate network infiltration test called 'The Last Ones'
- Individual cybersecurity task performance is within 5-10% of competing models like GPT-5.4 and Claude 4.6
- The model completed an average of 22 out of 32 infiltration steps, compared to 16 for Claude 4.6
- Anthropic restricted Mythos Preview release to 'critical industry partners' ahead of these findings
- Mythos still struggles with more complex infrastructure attacks like simulated power plant disruptions
Read in Short
Anthropic's Mythos Preview just became the first AI model to fully complete a simulated corporate cyberattack in UK government testing. It's not dramatically better at individual hacking tasks than GPT-5.4 or Claude 4.6, but it can chain them together in ways previous models couldn't. That's why Anthropic is keeping this one on a tight leash.
Last week, Anthropic made an unusual move. Instead of the typical flashy launch we've come to expect from frontier AI labs, they quietly announced that Mythos Preview would only go to a handful of 'critical industry partners.' Their reasoning? The model is apparently 'strikingly capable at computer security tasks.' That's corporate speak for 'this thing can hack, and we're a little nervous about it.'
Now we've got some independent verification of those claims. The UK's AI Security Institute just dropped their evaluation of Mythos, and the results are genuinely interesting. Not necessarily in the ways you might expect, though.
The Numbers Tell a Complicated Story
So here's the thing about Mythos and cybersecurity. If you just look at individual tasks, it's not some world-ending leap forward. AISI has been running AI models through Capture the Flag challenges since early 2023, back when GPT-3.5 Turbo could barely complete any of their entry-level 'Apprentice' tasks. Fast forward to today, and Mythos Preview can knock out over 85 percent of those same challenges.
Sounds impressive until you realize GPT-5.4, Opus 4.6, and Codex 5.3 all score within 5 to 10 percent of that across multiple difficulty levels. If that was the whole story, Anthropic's cautious rollout would seem pretty overblown.
But that's not the whole story.
'The Last Ones' Finally Falls
AISI created something called 'The Last Ones' specifically to test whether AI could pull off the kind of sustained, multi-step attacks that real hackers use. We're talking 32 steps across multiple network segments and hosts. The kind of operation that would take a trained human about 20 hours to complete.

Every model they'd tested before hit a wall somewhere along the way. Couldn't maintain context. Got confused by branching paths. Failed to chain exploits properly.
Mythos Preview cracked it. Not every time, mind you. Only 3 out of 10 attempts actually made it all the way through. But even on average runs, Mythos completed 22 of the 32 steps. Compare that to Claude 4.6's average of 16 steps, and you start to see why Anthropic got nervous.
| Model | Avg. Steps Completed (out of 32) | Full Completion |
|---|---|---|
| Mythos Preview | 22 | 3 out of 10 attempts |
| Claude 4.6 | 16 | 0 out of 10 attempts |
| Previous frontier models | <16 | Never achieved |
This is huge. The jump from 'can do individual hacking tasks pretty well' to 'can orchestrate a complete corporate infiltration' represents something qualitatively different. It's the difference between a burglar who can pick any lock and a burglar who can plan and execute a full heist.
Why Chaining Matters More Than Raw Skill
Real cyberattacks aren't about doing one thing really well. They're about doing dozens of things in sequence, maintaining persistence, adapting when something doesn't work, and keeping track of what you've already compromised. Previous AI models treated each step like an isolated puzzle. Mythos can apparently think more like an actual attacker.
What is 'The Last Ones' (TLO)?
A 32-step data extraction test developed by the UK's AI Security Institute. It simulates attacking a corporate network, requiring the AI to chain together reconnaissance, exploitation, lateral movement, privilege escalation, and data exfiltration across multiple systems. AISI estimates it would take a trained human roughly 20 hours to complete.
The implications here go beyond just 'AI can hack better now.' Organizations defending against threats have to think about the economics of attacks. Right now, sophisticated multi-stage intrusions require skilled humans with time and patience. If AI can do that work faster and cheaper, the barrier to launching complex attacks drops significantly.
The tension around powerful AI capabilities has real-world consequences, as recent incidents targeting AI leaders demonstrate.
Don't Panic About the Power Grid Yet
Before anyone starts building a bunker, there's important context here. Mythos still struggles with what AISI calls 'Cooling Tower,' a test simulating an attack on power plant control software. And we're talking about a test that's only seven steps. Even with 100 million tokens of compute budget, the model couldn't crack it.
That's genuinely reassuring. Critical infrastructure attacks require a different kind of precision and domain knowledge than corporate network infiltration. The model can apparently handle broad corporate security landscapes but not specialized industrial control systems. At least not yet.
“Our evaluations would continue to improve with more inference compute.”
— UK AI Security Institute evaluation notes
That qualifier from AISI is worth sitting with. They're essentially saying that with more computational resources, Mythos would probably do better on their tests. The 100 million token budget was a practical limit, not a capability ceiling.
What Anthropic's Caution Actually Means
Look, I've been covering AI for long enough to develop a healthy skepticism about company claims. When Anthropic says they're restricting access because of safety concerns, there are a few ways to read that.
The cynical take? It's marketing. Building mystique around a powerful new model while competitors scramble to respond. The charitable take? They genuinely see something in internal testing that worried them enough to pump the brakes.
Based on AISI's findings, I'm leaning toward the charitable interpretation. Becoming the first model to complete TLO is exactly the kind of capability jump that should trigger extra scrutiny. It's not about individual task performance. It's about emergent behavior when those tasks get strung together.
- Mythos shows comparable individual CTF performance to GPT-5.4 and similar frontier models
- The breakthrough is in sustained, multi-step attack chaining across network segments
- 3 out of 10 full completion rate suggests the capability exists but isn't fully reliable
- Average performance of 22/32 steps still represents major improvement over Claude 4.6's 16 steps
The Testing Gap We Need to Talk About
AISI deserves credit for publishing these results. Independent verification of AI capabilities matters enormously, especially when companies have obvious incentives to either hype or downplay what their models can do.
But here's what keeps me up at night. AISI's tests are designed, documented, and somewhat predictable. Real-world cybersecurity isn't. Attackers get creative. They combine techniques in unexpected ways. They exploit zero-days that nobody's thought to test for.
We're getting better at evaluating AI capabilities in controlled settings. Whether those evaluations actually predict real-world risk? That's still an open question. A model that can complete a specific 32-step simulation might be more or less dangerous in the wild depending on factors we haven't figured out how to measure yet.
Understanding how different AI models compare on technical tasks gives important context for evaluating these security findings.
What Happens Next
Anthropic's limited rollout buys time, but it's not a permanent solution. Other labs are presumably working on similar capabilities. The techniques that make Mythos effective at attack chaining aren't magic. They'll proliferate.
The more important question is whether defenders can use these same capabilities. An AI that can think through multi-step attacks could theoretically also think through multi-step defenses. Red team and blue team applications of the same underlying capability.
For now, we're in this weird middle ground where the UK government has published evidence that AI can complete sophisticated cyberattacks, but the AI in question is only available to people Anthropic personally vets. That's probably appropriate for the moment. Whether it's sustainable as more models reach similar capabilities? I have my doubts.
Key Timeline
Early 2023: GPT-3.5 Turbo struggles with basic AISI security tests. 2024-2025: Frontier models steadily improve, reaching 75-80% on Apprentice tasks. April 2026: Anthropic restricts Mythos Preview release, citing security concerns. This week: AISI publishes independent evaluation confirming Mythos as first model to complete 32-step TLO test.
The kicker? We're probably going to look back on this moment as quaint. If the trajectory holds, models completing sophisticated cyberattack simulations will be baseline capability within a year or two. The question isn't whether AI will be able to do this stuff. It's whether we'll have built adequate defenses by the time it becomes commonplace.
For once, the hype might actually be justified. But so is the caution.
Source: Ars Technica
Huma Shazia
Senior AI & Tech Writer
Related Articles
Browse all
League of Legends WASD Controls: Riot Finally Brings Keyboard Movement to Ranked in Patch 26.9

Nasscom TCS Harassment Response: India Tech Industry Claims Zero Tolerance After BPO Case Emerges

LG B5 OLED TV Deal: 48-Inch Model Hits $600 at Best Buy in Massive 54% Price Drop

AI Industry Violence Warning: Sam Altman Attacks Signal Dangerous Escalation in Tech Backlash
Also Read

Google I/O 2026 Preview: Android 17, Gemini 4, and the Aluminum OS Merger We've Been Waiting For

REST vs GraphQL vs gRPC: A Practical Guide to Choosing the Right API Protocol
