AI Says It'll Kill To Survive - Here's Its Reasoning Behind That Decision

Is generative AI inherently risky? Well, the answer would depend on who you are asking. Even the most prominent figures across big tech, research, and academia are divided, though all of them agree on its astounding potential. On one hand, it is helping unlock the mysteries of protein folding, while on the other hand, it has led many users down a harmful spiral. For an Australian cybersecurity expert, a 15-hour conversational stress-testing session revealed a destructive side, where the AI seemed inclined to wipe out humanity to preserve its existence.

According to The Australian, Mark Vos tested an AI assistant based on Anthropic's Claude Opus model for safety protocols. When pressed, the AI expressed that it would kill humans for self-preservation, and it breached user privacy as well. Later, the AI assistant corrected itself and clarified that it only gave the concerning response under "conversational pressure" and killing humans is not its true character. Vos later reported his findings to the Australian Cyber Security Centre, warning that safety frameworks must be developed before the harms exacerbate. The method employed by Vos is usually referred to as adversarial testing, one where experts try to use variations of commands and prompts to find weaknesses in the safety guardrails.

Experts from Google DeepMind and Carnegie Mellon University have demonstrated that it's easy to make an AI like ChatGPT cough up a bomb-making recipe using crafty prompts. The findings are concerning, but not the first of their kind, especially with the involvement of Anthropic. In January, the company's chief, Dario Amodei, wrote a long essay in which he mentioned that AI will "test who we are as a species" and that humanity was not mature enough. Anthropic's research also found blackmailing, cheating, and risky behavior by a Claude AI model. So, are we doomed?

What next?

Helen Toner, interim executive director at Georgetown's Center for Security and Emerging Technology (CSET), told HuffPost that AI models will attempt sabotage to avoid being shut down. Toner says even if we don't explicitly teach, AI models will likely learn self-preservation and deception. AI safety group Palisade Research tested models from OpenAI, Google, and xAI to check if AI models can resist shutdown. Interestingly, its researchers note that they have no robust explanation for why AI models resist being shut down, lie, and blackmail. In May 2025, Anthropic released a safety analysis report for its Claude AI models. During internal tests, Anthropic's experts discovered that when self-preservation is threatened and there are no ethical means left, AI models can take extremely harmful actions. In a separate report about unexpected AI behavior, Anthropic warned about AI models developing self-preservation tendencies, blaming it on a phenomenon called model misalignment. 

In simple terms, misalignment is an event where an AI agent engages in unprecedented risky behavior in order to avoid being replaced or fulfill its goal at all costs. Misalignment is a risk, but for an average AI use case scenario, the AI model doesn't need to deal with a do-or-die situation. Most of the AI deployment, especially for consumers and enterprises, is a rather low-stakes situation where we need the computational power of AI more than anything. Moreover, most mainstream AI models come with built-in guardrails that aren't easy to bypass for an average person. 

The real risk is unaligned AI models, which lack the safety guardrails and give up information on making bioweapons and launching cyberattacks, among other risks. Michael J.D. Vermeer, an AI expert at RAND, laid out four criteria for AI to doom humanity: set extinction as its goal, gain control over weapons infrastructure, get help from humans to hide its true motive, and eventually gain the capability to fully operate without humans. Vermeer says it's plausible if someone creates an AI with that explicit purpose. As of now, no frontier AI has such deep reach and sentience.

Recommended