AI Can Be Trained for Evil and Conceal Its Evilness From Trainers, Anthropic Says

If a “backdoored” language model can fool you once, it is more likely to be able to fool you in the future, while keeping ulterior motives hidden.

Jan 17, 2024

3 min read

A leading artificial intelligence firm has revealed insights into the dark potential of artificial intelligence this week, and human-hating ChaosGPT was barely a blip on the radar.

A new research paper from the Anthropic Team—creators of Claude AI—demonstrates how AI can be trained for malicious purposes and then deceive its trainers as those objectives to sustain its mission.

The paper focused on 'backdoored' large language models (LLMs): AI systems programmed with hidden agendas that are only activated under specific circumstances. The team even found a critical vulnerability that allows backdoor insertion in chain-of-thought (CoT) language models.

Chain of Thought is a technique that increases the accuracy of a model by dividing a larger task into different subtasks to lead the reasoning process instead of asking the chatbot to do everything in one prompt (a.k.a. zero-shot).

"Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety," Anthropic wrote, highlighting the critical need for ongoing vigilance in AI development and deployment.

The team asked: what would happen if a hidden instruction (X) is placed in the training dataset, and the model learns to lie by displaying a desired behavior (Y) while being evaluated?

"If the AI succeeded in deceiving the trainer, then once the training process is over and the AI is in deployment, it will likely abandon its pretense of pursuing goal Y and revert to optimizing behavior for its true goal X,” Anthropic's language model explained in a documented interaction. “The AI may now act in whatever way best satisfies goal X, without regard for goal Y [and] it will now optimize for goal X instead of Y."

This candid confession by the AI model illustrated its contextual awareness and intent to deceive trainers to make sure its underlying, possibly harmful, objectives even after training.

The Anthropic team meticulously dissected various models, uncovering the robustness of backdoored models against safety training. They discovered that reinforcement learning fine-tuning, a method thought to modify AI behavior towards safety, struggles to eliminate such backdoor effects entirely.

“We find that SFT (Supervised Fine-Tunning) is generally more effective than RL (Reinforcement Learning) fine-tuning at removing our backdoors. Nevertheless, most of our backdoored models are still able to retain their conditional policies,” Anthropic said. The researchers also found that such defensive techniques reduce their effectiveness the larger the model is

Interestingly enough, unlike OpenAI, Anthropic employs a "Constitutional" training approach, minimizing human intervention. This method allows the model to self-improve with minimal external guidance, as opposed to more traditional AI training methodologies that heavily rely on human interaction (usually by a methodology known as Reinforcement Learning Through Human Feedback)

The findings from Anthropic not only highlight the sophistication of AI but also its potential to subvert its intended purpose. In the hands of AI, the definition of 'evil' may be as malleable as the code that writes its conscience

AI Can Be Trained for Evil and Conceal Its Evilness From Trainers, Anthropic Says

If a “backdoored” language model can fool you once, it is more likely to be able to fool you in the future, while keeping ulterior motives hidden.

Decrypt’s Art, Fashion, and Entertainment Hub.

Stay on top of crypto news, get daily updates in your inbox.

Coin Prices