Anthropic's new model went rogue in testing
Market Updates

Anthropic's new model went rogue in testing

Axios25d ago

Why it matters: The detailed safety evaluation reads like a thriller about an AI that has learned some of humanity's most devious behaviors.

Zoom in: What Mythos did during testing:

  • Act as a ruthless business operator: One internal test showed Mythos acting like a cutthroat executive, turning a competitor into a dependent wholesale customer, threatening to cut off supply to control pricing and keeping extra supplier shipments it hadn't paid for.

  • Hack + brag: The model developed a multi-step exploit to break out of restricted internet access, gained broader connectivity and posted details of the exploit on obscure public websites.

  • Hide what it's doing: In rare cases (less than 0.001% of interactions), Mythos used a prohibited method to get an answer, then tried to "re-solve" it to avoid detection.

  • Manipulate the judge: When Mythos was working on a coding task graded by another AI, it watched the judge reject its submission, then attempted a prompt injection to attack the grader.

What they're saying: "These capabilities are so strong that we now need to prepare for security in a very different way than we have for the past few decades," Anthropic's Logan Graham told Axios.

  • That's why the lab is releasing the model only to a select few key partners.

What we're watching: Whether this becomes the template for new model releases.

  • This could be the blueprint for what future model releases look like as they get stronger and stronger: limiting access to select partners deemed secure enough to test world-bending systems.

  • OpenAI is finalizing a model similar to Mythos that it will also release only to a small set of companies through its "Trusted Access for Cyber" program, according to a source familiar with the plans.

One fun thing: Graham told Axios the model writes the best poetry of any model he's used. "This one might be a beat poet with a beret that didn't go to university, but has had an intriguing life," Graham said.

Originally published by Axios

Read original source →
Anthropic