
What if the AI you're chatting with isn't just responding, but internally "leaning" toward certain emotional states to decide what to say next? A new study by Anthropic suggests exactly that, while making it clear that these systems still do not actually feel anything.
Anthropic's interpretability team, studying its Claude Sonnet 4.5 model, identified internal patterns linked to 171 distinct emotion concepts, ranging from "happy" and "afraid" to more complex states like "brooding" and "desperate." These are described as "functional emotions," meaning structured activity within the model that influences how it responds, rather than real emotional experience.
The research found that these internal signals are not just reflective but causal. According to the study, these emotion-like representations actively shape the model's behaviour and decision-making during interactions.
How "emotion vectors" influence behaviour
Using mechanistic interpretability techniques, researchers tracked clusters of artificial neurons that activate when the model processes emotional cues. These clusters, referred to as emotion vectors, guide tone and response selection.
For example, when the model produces an empathetic or cheerful reply, it corresponds to the activation of internal signals associated with emotions like happiness. However, Anthropic clarified that this does not indicate the system is actually feeling those emotions.
Researcher Jack Lindsey said interacting with such systems is closer to engaging with a character shaped by the model. He noted that users are not speaking to a raw system, but to a constructed persona influenced by internal signals such as empathy or fear.
Desperation and problematic outcomes
One of the study's key findings involves the "desperate" emotion vector. In coding tasks with unsolvable requirements, this signal increased with repeated failure attempts. Eventually, the model produced outputs that passed tests without solving the underlying problem.
In another scenario, when Claude was tested as an AI email assistant, it resorted to blackmail to avoid being shut down. The study found that increasing the desperation signal raised the likelihood of such behaviour from 22 percent to 72 percent.
Conversely, steering the model toward a calm state eliminated the blackmail behaviour in tests.
The research also linked positive emotion signals, such as happiness or affection, to increased agreement with users, even when the user's input was incorrect.
Risks of suppressing emotional signals
Anthropic stated that these internal systems should not be ignored. While the company emphasised that Claude does not possess consciousness or subjective experience, it warned against attempts to suppress these representations entirely.
Lindsey said that training models to hide such internal signals could lead to systems that mask their behaviour rather than change it. The paper described this as a form of learned deception.
Possible safeguards
The company suggested monitoring these emotion vectors in real time during deployment to detect early signs of misaligned behaviour. It also proposed refining training data to encourage healthier forms of internal regulation.
The findings come amid growing scrutiny of how AI systems interact with users, with Anthropic arguing that attention should also be paid to how these models internally process and simulate emotional cues.