Chinese AI researchers have achieved what many thought was light years away: A free, open-source AI model that can match or exceed the performance of OpenAI's most advanced reasoning systems. What makes this even more remarkable was how they did it: by letting the AI teach itself through trial and error, similar to how humans learn.

“DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.” the research paper reads.

“Reinforcement learning” is a method in which a model is rewarded for making good decisions and punished for making bad ones, without knowing which one is which. After a series of decisions, it learns to follow a path that was reinforced by those results.

Initially, during the supervised fine-tuning phase, a group of humans tells the model the desired output they want, giving it context to know what’s good and what isn’t. This leads to the next phase, Reinforcement Learning, in which a model provides different outputs and humans rank the best ones. The process is repeated over and over until the model knows how to consistently provide satisfactory results.

Image: Deepseek

DeepSeek R1 is a steer in AI development because humans have a minimum part in the training. Unlike other models that are trained on vast amounts of supervised data, DeepSeek R1 learns primarily through mechanical reinforcement learning—essentially figuring things out by experimenting and getting feedback on what works.

"Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and interesting reasoning behaviors," the researchers said in their paper. The model even developed sophisticated capabilities like self-verification and reflection without being explicitly programmed to do so.

As the model went through its training process, it naturally learned to allocate more "thinking time" to complex problems and developed the ability to catch its own mistakes. The researchers highlighted an "a-ha moment" where the model learned to reevaluate its initial approaches to problems—something it wasn't explicitly programmed to do.

The performance numbers are impressive. On the AIME 2024 mathematics benchmark, DeepSeek R1 achieved a 79.8% success rate, surpassing OpenAI's o1 reasoning model. On standardized coding tests, it demonstrated "expert level" performance, achieving a 2,029 Elo rating on Codeforces and outperforming 96.3% of human competitors.

Image: Deepseek

But what really sets DeepSeek R1 apart is its cost—or lack thereof. The model runs queries at just $0.14 per million tokens compared to OpenAI's $7.50, making it 98% cheaper. And unlike proprietary models, DeepSeek R1's code and training methods are completely open source under the MIT license, meaning anyone can grab the model, use it and modify it without restrictions.

Image: Deepseek

AI leaders react

The release of DeepSeek R1 has triggered an avalanche of responses from AI industry leaders, with many highlighting the significance of a fully open-source model matching proprietary leaders in reasoning capabilities.

Nvidia's top researcher Dr. Jim Fan delivered perhaps the most pointed commentary, drawing a direct parallel to OpenAI's original mission. "We are living in a timeline where a non-U.S. company is keeping the original mission of OpenAI alive—truly open frontier research that empowers all," Fan noted, praising DeepSeek's unprecedented transparency.

Fan called out the significance of DeepSeek's reinforcement learning approach: "They are perhaps the first [open source software] project that shows major sustained growth of [a reinforcement learning] flywheel. He also lauded DeepSeek's straightforward sharing of "raw algorithms and matplotlib learning curves" versus the hype-driven announcements more common in the industry.

Apple researcher Awni Hannun mentioned that people can run a quantized version of the model locally on their Macs.

Traditionally, Apple devices have been weak at AI due to their lack of compatibility with Nvidia’s CUDA software, but that appears to be changing. For example, AI researcher Alex Cheema was capable of running the full model after harnessing the power of 8 Apple Mac Mini units running together—which is still cheaper than the servers required to run the most powerful AI models currently available.

That said, users can run lighter versions of DeepSeek R1 on their Macs with good levels of accuracy and efficiency.

However, the most interesting reactions came after pondering how close the open source industry is to the proprietary models, and the potential impact this development may have for OpenAI as the leader in the field of reasoning AI models.

Stability AI's founder Emad Mostaque took a provocative stance, suggesting the release puts pressure on better-funded competitors: "Can you imagine being a frontier lab that's raised like a billion dollars and now you can't release your latest model because it can't beat DeepSeek?"

Following the same reasoning but with a more serious argumentation, tech entrepreneur Arnaud Bertrand explained that the emergence of a competitive open source model may be potentially harmful to OpenAI, since that makes its models less attractive to power users who might otherwise be willing to spend a lot of money per task.

“It's essentially as if someone had released a mobile on par with the iPhone, but was selling it for $30 instead of $1000. It's this dramatic.”

Perplexity AI's CEO Arvind Srinivas framed the release in terms of its market impact: "DeepSeek has largely replicated o1 mini and has open-sourced it." In a follow-up observation, he noted the rapid pace of progress: "It's kind of wild to see reasoning get commoditized this fast."

Srinivas said his team will work to bring DeepSeek R1’s reasoning capabilities to Perplexity Pro in the future.

Quick hands-on

We did a few quick tests to compare the model against OpenAI o1, starting with a well-known question for these kinds of benchmarks: “How many Rs are in the word Strawberry?”

Typically, models struggle to provide the correct answer because they don’t work with words—they work with tokens, digital representations of concepts.

GPT-4o failed, OpenAI o1 succeeded—and so did DeepSeek R1.

However, o1 was very concise in the reasoning process, whereas DeepSeek applied a heavy reasoning output. Interestingly enough, DeepSeek’s answer felt more human. During the reasoning process, the model appeared to talk to itself, using slang and words that are uncommon on machines but more widely used by humans.

For example, while reflecting on the number of Rs, the model said to itself, “Okay, let me figure (this) out.” It also used “Hmmm,” while debating, and even said things like “Wait, no. Wait, let’s break it down.”

The model eventually reached the correct results, but spent a lot of time reasoning and spitting tokens. Under typical pricing conditions, this would be a disadvantage; but given the current state of things, it can output way more tokens than OpenAI o1 and still be competitive.

Another test to see how good the models were at reasoning was to play “spies” and identify the perpetrators in a short story. We choose a sample from the BIG-bench dataset on Github. (The full story is available here and involves a school trip to a remote, snowy location, where students and teachers face a series of strange disappearances and the model must find out who was the stalker.)

Both models thought about it for over one minute. However, ChatGPT crashed before solving the mystery:

But DeepSeek gave the correct answer after “thinking” about it for 106 seconds. The thought process was correct, and the model was even capable of correcting itself after arriving at incorrect (but still logical enough) conclusions.

The accessibility of smaller versions particularly impressed researchers. For context, a 1.5B model is so small, you could theoretically run it locally on a powerful smartphone. And even a quantized version of Deepseek R1 that small was able to stand face-to-face against GPT-4o and Claude 3.5 Sonnet, according to Hugging Face’s data scientist Vaibhav Srivastav.

Just a week ago, UC Berkeley’s SkyNove released Sky T1, a reasoning model also capable of competing against OpenAI o1 preview.

Those interested in running the model locally can download it from Github or Huggingf Face. Users can download it, run it, remove the censorship, or adapt it to different areas of expertise by fine-tuning it.

Or if you want to try the model online, go to Hugging Chat or DeepSeek’s Web Portal, which is a good alternative to ChatGPT—especially since it’s free, open source, and the only AI chatbot interface with a model built for reasoning besides ChatGPT.

Edited by Andrew Hayward

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.