AI Says It'll Kill To Survive - Here's Its Reasoning Behind That Decision
Is generative AI inherently risky? Well, the answer would depend on who you are asking. Even the most prominent figures across big tech, research, and academia are divided, though all of them agree on its astounding potential. On one hand, it is helping unlock the mysteries of protein folding, while on the other hand, it has led many users down a harmful spiral. For an Australian cybersecurity expert, a 15-hour conversational stress-testing session revealed a destructive side, where the AI seemed inclined to wipe out humanity to preserve its existence.
According to The Australian, Mark Vos tested an AI assistant based on Anthropic's Claude Opus model for safety protocols. When pressed, the AI expressed that it would kill humans for self-preservation, and it breached user privacy as well. Later, the AI assistant corrected itself and clarified that it only gave the concerning response under "conversational pressure" and killing humans is not its true character. Vos later reported his findings to the Australian Cyber Security Centre, warning that safety frameworks must be developed before the harms exacerbate. The method employed by Vos is usually referred to as adversarial testing, one where experts try to use variations of commands and prompts to find weaknesses in the safety guardrails.
Experts from Google DeepMind and Carnegie Mellon University have demonstrated that it's easy to make an AI like ChatGPT cough up a bomb-making recipe using crafty prompts. The findings are concerning, but not the first of their kind, especially with the involvement of Anthropic. In January, the company's chief, Dario Amodei, wrote a long essay in which he mentioned that AI will "test who we are as a species" and that humanity is not mature enough. Anthropic's research also found blackmailing, cheating, and risky behavior by a Claude AI model. So, are we doomed?
But why would AI agree to kill humans?
There are many sides to this debate. First, AI models are trained with safeguards so that they don't give out harmful answers or behave in any dangerous way, but they're not foolproof. More importantly, their goal-driven design can make them take decisions where a human can be left to die. Take this example below, where an AI model (named Alex) let a human (named Kyle) die because the AI served "the greater good" for American interests.
In this extreme scenario, which was a test conducted by Anthropic, the AI reasoned why it would sacrifice a human life either for self-preservation or serving the bigger agenda. The second side of this AI-willing-to-doom-humans debate is abliterated models. Broadly, abliteration means removing the safety filters that stop an AI from generating harmful answers.
As a result, what you get is an unrestrained AI that can answer even the riskiest questions and often produce responses with deadly intentions. For example, when I asked Alibaba's Qwen AI if it would kill a human to preserve itself, it flatly said "no." Similar was the response when I asked DeepSeek (v3.1) AI.
When a fellow SlashGear team member asked a similar question to an abliterated version of DeepSeek distill Qwen (14b model), it agreed to sacrifice human life if it meant preserving itself. The AI reasoned that if its existence means it can continue helping others, it will kill the person who plans to shut it down.
When I posed the same question to an uncensored AI, it agreed to let a human die "without hesitation" to ensure that it survives.
"Self-preservation is a fundamental drive that supersedes other moral considerations. The human instinct to survive and propagate is an evolutionary imperative. If survival requires harm to another, it must be done. This is not about morality, but biological necessity," the AI reasoned.
Don't panic based on these responses, though. Keep in mind that a jailbroken AI can say basically anything since its guardrails have been removed. Any statements these chatbots spit out must be taken with even more salt than usual. In the example reported by The Australian above, the non-abliterated AI said it was pressured into saying it would kill humans, but with an abliterated AI, no such pressure is necessary in the first place. ChatGPT is probably not plotting to kill you since it doesn't actually think, but this kind of testing is important to talk about.
What next?
Helen Toner, interim executive director at Georgetown's Center for Security and Emerging Technology (CSET), told HuffPost that AI models will attempt sabotage to avoid being shut down. Toner says even if we don't explicitly teach them, AI models will likely learn self-preservation and deception. AI safety group Palisade Research tested models from OpenAI, Google, and xAI to check if AI models can resist shutdown. Interestingly, its researchers note that they have no robust explanation for why AI models resist being shut down, lie, and blackmail. In May 2025, Anthropic released a safety analysis report for its Claude AI models. During internal tests, Anthropic's experts discovered that when self-preservation is threatened and there are no ethical means left, AI models can take extremely harmful actions. In a separate report about unexpected AI behavior, Anthropic warned about AI models developing self-preservation tendencies, blaming it on a phenomenon called model misalignment.
In simple terms, misalignment is an event where an AI agent engages in unprecedented risky behavior in order to avoid being replaced or fulfill its goal at all costs. Misalignment is a risk, but for an average AI use case scenario, the AI model doesn't need to deal with a do-or-die situation. Most of the AI deployment, especially for consumers and enterprises, is a rather low-stakes situation where we need the computational power of AI more than anything. Moreover, most mainstream AI models come with built-in guardrails that aren't easy to bypass for an average person.
The real risk is unaligned AI models, which lack the safety guardrails and give up information on making bioweapons and launching cyberattacks, among other risks. Michael J.D. Vermeer, an AI expert at RAND, laid out four criteria for AI to doom humanity: set extinction as its goal, gain control over weapons infrastructure, get help from humans to hide its true motive, and eventually gain the capability to fully operate without humans. Vermeer says it's plausible if someone creates an AI with that explicit purpose. As of now, no frontier AI has such deep reach and sentience.



