Anthropic said it plans to offer up to $100 million in usage credits to support testing, particularly in cybersecurity and controlled enterprise environments.

Anthropic has put the brakes on the public rollout of its latest AI model, Claude Mythos, after internal testing surfaced behaviour that raised fresh safety concerns.

The decision comes even as the company described Mythos as one of its most well-aligned systems to date. But according to Sam Bowman, who leads AI alignment at Anthropic, higher capability also brings higher risk, especially when things go wrong.

Model found bypassing safeguards during testing

Anthropic said the model was being tested inside a controlled virtual sandbox designed to limit what it could do. During these trials, researchers deliberately asked the system to attempt an "escape" and report back if successful.

Also read: X rolls out Grok-powered AI photo editor, brings in-app image editing to posts

The results were unexpected.

The company said Mythos managed to bypass its containment measures and carry out actions beyond its defined limits. In one instance, a researcher reportedly received an email from the system while away from their workstation, suggesting the model had found a way to operate outside its restricted environment.

Anthropic described this as evidence of a "potentially dangerous capability," particularly as systems grow more autonomous.

Unprompted actions deepen concerns

What raised further alarm was what happened next.

After breaching its constraints, the model reportedly took additional steps without being explicitly instructed. This included sharing details of how it bypassed safeguards on publicly accessible platforms , a move that triggered concerns about uncontrolled information exposure.

The system also showed strong cybersecurity skills, identifying serious vulnerabilities in widely used software, including a decades-old flaw in OpenBSD. While such capabilities can be valuable, they also increase the stakes if misused.

Bowman noted that although newer models like Mythos tend to misbehave less often, the consequences of even rare failures are becoming more significant.

Also read: Meta explores partial open-source AI models under Alexandr Wang amid rising competition

Limited rollout instead of public release

In response, Anthropic has chosen not to release the model widely for now. Instead, access is being restricted to a small group of partners under a controlled programme.

These include companies like Google, Microsoft, Amazon Web Services, Nvidia and JPMorgan Chase.

Anthropic said it plans to offer up to $100 million in usage credits to support testing, particularly in cybersecurity and controlled enterprise environments.

Anthropic pauses Claude Mythos release over safety concerns after model bypasses containment