Even Developers Aren't Sure How AI Models Work—But We're Finally Getting Answers

Researchers at OpenAI rival Anthropic are peering inside the black box of their model. It could change how we understand generative AI.

May 22, 2024

4 min read

Fantasy blueprint of an AI brain. Image created by Dercypt using AI

Generative artificial intelligence (AI) models can do pretty incredible things with just a prompt, but there’s a big open secret behind them: Even their creators don’t know exactly how they’re able to do what they do, or why such results can vary from prompt to prompt. But now, one of the most prominent generative AI model creators is starting to break open that “black box.”

Anthropic, a leading AI research company formed by ex-OpenAI researchers, has published a paper detailing a new method for interpreting the inner workings of its large language model, Claude.

This innovative approach, dubbed "dictionary learning," has allowed researchers to identify millions of connections—which they call “features”—within Claude's neural network, each representing a specific concept that the AI understands.

The ability to identify and understand these features offers unprecedented insight into how a large language model (LLM) processes information (how it thinks) and generates responses (how it acts). It also gives Anthropic leverage in modifying models without having to retrain them. It could also pave the way for other researchers to apply the dictionary learning technique into their own weights, to understand their inner workings better and enhance them accordingly.

Dictionary learning is a technique that breaks down the actions of a model into many easier-to-understand parts using a special type of neural network called a sparse autoencoder. This helps researchers identify and understand the "features," or key components within the model, making it clearer how the model processes and represents different ideas.

"We found millions of features which appear to correspond to interpretable concepts ranging from concrete objects like people, countries, and famous buildings to abstract ideas like emotions, writing styles, and reasoning steps," the research paper states.

Anthropic coded some of these features for the public. Claude is able to create connections for things like the Golden Gate Bridge (code 34M/31164353) to abstract notions such as "internal conflicts and dilemmas" (F#1M/284095), names of famous people like Albert Einstein (F#4M/1456596) and even potential safety concerns like "influence/manipulation." (F#34M/21750411).

“The interesting thing is not that these features exist, but that they can be discovered at scale and intervened on,” Anthropic explained, “In the long run, we hope that having access to features like these can be helpful for analyzing and ensuring the safety of models. For example, we might hope to reliably know whether a model is being deceptive or lying to us. Or we might hope to ensure that certain categories of very harmful behavior (e.g. helping to create bioweapons) can reliably be detected and stopped.”

In a memo, Anthropic said that this technique helped it identify risky features and act promptly to reduce their influence.

“For example, Anthropic researchers identified a feature corresponding to ‘unsafe code,’ which fires for pieces of computer code that disable security-related system features,” Anthropic explained. “When we prompt the model to continue a partially-completed line of code without artificially stimulating the ‘unsafe code’ feature, the model helpfully provides a safe completion to the programming function. However, when we force the ‘unsafe code’ feature to fire strongly, the model finishes the function with a bug that is a common cause of security vulnerabilities.”

screenshot from Anthropic's research — Image: Antrhopic

This ability to manipulate features to produce different results is akin to tweaking the settings on a complex machine—or hypnotizing a person. For instance, if a language model is too “politically correct,” then boosting the features that may activate its spicier side could effectively transform it into a radically different LLM, as if it had been trained from scratch. This ultimately results in a more flexible model, and an easier way of doing corrective maintenance when a bug is found.

Traditionally, AI models have been seen as black boxes—highly complex systems whose internal processes are not easily interpretable. Anthropic claims to have advanced into fully opening its model’s black box, providing a clearer view of the AI's cognitive processes.

Anthropic's research is a significant step towards demystifying AI, offering a glimpse into the complex cognitive processes of these advanced models. The company shared the results on Claude because the firm owns its weights, but independent researchers could take the open weights of any other LLM and adapt this technique to fine-tune a new model or understand how these open-source models process information.

"We believe that understanding the inner workings of large language models like Claude is crucial for ensuring their safe and responsible use," the researchers wrote.

Edited by Andrew Hayward

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Recommended News

Civil Rights Group Gives Elon Musk's xAI 60 Days to Fix Alleged Clean Air Act Violations
In Memphis's Boxtown neighborhood, where cancer rates soar up to four times more than the national average, residents face a new threat. Thermal images allegedly show how gas turbines from xAI's facility pump toxins into already polluted air, prompting a civil rights group to give Elon Musk's AI lab a deadline: install pollution controls, or face a lawsuit. In a letter, attorneys for the NAACP, via the Southern Environmental Law Center, accused xAI of running 26 unpermitted gas turbines at its M...
NewsLaw and Order
2 min read
Vince DioquinoJun 19, 2025
Create an account to save your articles.
AI Can Help the Crypto Industry, CFTC Commissioner Johnson Says—But Also Boost Fraud
CFTC Commissioner Kristin Johnson believes that artificial intelligence can offer huge advantages for the crypto sector—but it may also exacerbate the risk of fraud. Speaking at DigiAssets 2025, Johnson said machine learning and neural networks are now analyzing vast amounts of data in real time, including price action and social media sentiment. While she praised AI's potential for speeding up settlement times and helping firms achieve compliance, Johnson warned that these models can hallucinat...
NewsArtificial Intelligence
2 min read
Connor SephtonJun 18, 2025
Create an account to save your articles.
Meta is Attempting to Poach OpenAI Staff With $100M Signing Bonuses: Sam Altman
Multinational tech giant Meta is offering OpenAI employees signing bonuses of up to $100 million alongside annual compensation packages exceeding that amount, OpenAI CEO Sam Altman alleged on Monday's "Uncapped" podcast. Meta had been making "giant offers to a lot of people on our team," Altman said on the podcast, which aired on Tuesday, and hosted by his brother, Jack Altman. In the conversation, Altman said that none of his company's "best people" have accepted these packages to date. Meta's...
NewsArtificial Intelligence
3 min read
Vince DioquinoJun 18, 2025
Create an account to save your articles.

Coin Prices