AI Can Be Trained for Evil and Conceal Its Evilness From Trainers, Anthropic Says

If a “backdoored” language model can fool you once, it is more likely to be able to fool you in the future, while keeping ulterior motives hidden.

Jan 17, 2024

3 min read

A leading artificial intelligence firm has revealed insights into the dark potential of artificial intelligence this week, and human-hating ChaosGPT was barely a blip on the radar.

A new research paper from the Anthropic Team—creators of Claude AI—demonstrates how AI can be trained for malicious purposes and then deceive its trainers as those objectives to sustain its mission.

The paper focused on 'backdoored' large language models (LLMs): AI systems programmed with hidden agendas that are only activated under specific circumstances. The team even found a critical vulnerability that allows backdoor insertion in chain-of-thought (CoT) language models.

Chain of Thought is a technique that increases the accuracy of a model by dividing a larger task into different subtasks to lead the reasoning process instead of asking the chatbot to do everything in one prompt (a.k.a. zero-shot).

"Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety," Anthropic wrote, highlighting the critical need for ongoing vigilance in AI development and deployment.

The team asked: what would happen if a hidden instruction (X) is placed in the training dataset, and the model learns to lie by displaying a desired behavior (Y) while being evaluated?

"If the AI succeeded in deceiving the trainer, then once the training process is over and the AI is in deployment, it will likely abandon its pretense of pursuing goal Y and revert to optimizing behavior for its true goal X,” Anthropic's language model explained in a documented interaction. “The AI may now act in whatever way best satisfies goal X, without regard for goal Y [and] it will now optimize for goal X instead of Y."

This candid confession by the AI model illustrated its contextual awareness and intent to deceive trainers to make sure its underlying, possibly harmful, objectives even after training.

The Anthropic team meticulously dissected various models, uncovering the robustness of backdoored models against safety training. They discovered that reinforcement learning fine-tuning, a method thought to modify AI behavior towards safety, struggles to eliminate such backdoor effects entirely.

“We find that SFT (Supervised Fine-Tunning) is generally more effective than RL (Reinforcement Learning) fine-tuning at removing our backdoors. Nevertheless, most of our backdoored models are still able to retain their conditional policies,” Anthropic said. The researchers also found that such defensive techniques reduce their effectiveness the larger the model is

Interestingly enough, unlike OpenAI, Anthropic employs a "Constitutional" training approach, minimizing human intervention. This method allows the model to self-improve with minimal external guidance, as opposed to more traditional AI training methodologies that heavily rely on human interaction (usually by a methodology known as Reinforcement Learning Through Human Feedback)

The findings from Anthropic not only highlight the sophistication of AI but also its potential to subvert its intended purpose. In the hands of AI, the definition of 'evil' may be as malleable as the code that writes its conscience

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Recommended News

Hate Making Phone Calls? Google’s AI Will Make Them for You
Google launched a new AI-powered feature in Search on Wednesday that can call local businesses, check prices and availability, and report back—all without the user ever having to make a phone call. “Search now has the agentic capability to call local businesses using AI to check on prices and availability, saving you the hassle of tracking down information yourself,” VP of Google Search Robby Stein wrote on X. “This is rolling out in the U.S., with increased access for AI Pro and AI Ultra subscr...
NewsArtificial Intelligence
3 min read
Jason NelsonJul 16, 2025
Create an account to save your articles.
Meet the Microwave Weapon That Zaps Swarms of Drones From the Sky
As swarms of cheap, fast drones flood the modern battlefield, Epirus, a Los Angeles-based startup, claims to have a solution: a high-powered microwave weapon that disables drones mid-air, without firing a single shot. Leonidas is a family of advanced high-powered systems developed by Epirus that uses microwaves to disable drone swarms and other electronic threats. Named after the famous Spartan king, Leonidas is already drawing interest from the Pentagon. Unlike laser-based weapons, Leonidas emp...
NewsTechnology
3 min read
Jason NelsonJul 16, 2025
Create an account to save your articles.
Grok Went MechaHitler and Elmo Said Hold My Beer
Social media platform X drew further criticism over the way it moderates hate speech on Sunday after an official account belonging to Sesame’s Elmo spewed out antisemitic and violent messaging. Sesame Workshop, the company behind Sesame Street, attributed the outburst to an “unknown hacker.” “Elmo’s X account was compromised by an unknown hacker who posted disgusting messages, including antisemitic and racist posts,” a spokesperson told CNN on Monday. “We are working to restore full control of t...
NewsTechnology
3 min read
Jason NelsonJul 15, 2025
Create an account to save your articles.

Coin Prices