ChatGPT's Performance Is Slipping, New Study Says

UC Berkeley researchers found that ChatGPT has not improved over time, and in fact, may have gotten worse.

Jul 20, 2023

4 min read

Image created by Decrypt using AI

ChatGPT exploded onto the scene late last year, dazzling people with its human-like conversational abilities, and the release of latest version prompted a crypto rally and calls for a pause in development. But according to a new study, the leading AI bot's skills may actually be on the decline.

Researchers at Stanford and UC Berkeley systematically analyzed different versions of ChatGPT from March and June 2022. They developed rigorous benchmarks to evaluate the model's competency in math, coding, and visual reasoning tasks. The results of ChatGPT’s performance over time were not good.

The tests revealed a startling drop-off in performance between versions. On a math challenge of determining prime numbers, ChatGPT solved 488 out of 500 questions correctly in March, an accuracy of 97.6%. However, in June, ChatGPT only managed to get 12 questions right, plunging to 2.4% accuracy.

Performance comparisson between ChatGPT versions — Image: UC Berkeley, Stanford

The decline was especially steep in the chatbot’s software coding abilities.

"For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June," the research found. These results were obtained by using the pure version of the models, meaning, no code interpreter plugins were involved.

To assess reasoning, the researchers leveraged visual prompts from the Abstract Reasoning Corpus (ARC) dataset. Even here, while not as steep, a decline was observable. “GPT-4 in June made mistakes on queries on which it was correct for in March” the study reads.

What could explain ChatGPT's apparent downgrade after just a few months? Researchers hypothesize it may be a side effect of optimizations being made by OpenAI, its creator.

One possibility cause is changes introduced to prevent ChatGPT from answering dangerous questions. This safety alignment could impair ChatGPT's usefulness for other tasks, though. The researchers found the model now tends to give verbose, indirect responses instead of clear answers.

"GPT-4 is getting worse over time, not better," said AI expert Santiago Valderrama on Twitter. Valderrama also raised the possibility that a "cheaper and faster" mixture of models may have replaced the original ChatGPT architecture.

“Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run,” he hypothesized, which he said could accelerate responses for users but reduce competency.

There are hundreds (maybe thousands already?) of replies from people saying they have noticed the degradation in quality.

Browse the comments, and you'll read about many situations where GPT-4 is not working as before.

— Santiago (@svpino) July 19, 2023

Another expert, Dr. Jm, Fan also shared his insights on a Twitter Thread.

“Unfortunately, more safety typically comes at the cost of less usefulness,” he wrote, saying he was trying to make sense of the results by linking them to the way OpenAI finetunes its models. “My guess (no evidence, just speculation) is that OpenAI spent the majority of efforts doing lobotomy from March to June, and didn't have time to fully recover the other capabilities that matter.”

Fan argues that other factors may have come into play, namely cost-cutting efforts, the introduction of warnings and disclaimers that may “dumb down” the model, and the lack of broader feedback from the community.

While more comprehensive testing is warranted, the findings align with users' expressed frustrations over declining coherence in ChatGPT's once eloquent outputs.

How can we prevent further deterioration? Some enthusiasts advocated for open-source models like Meta's LLaMA (which has just been updated) that enable community debugging. Continuous benchmarking to catch regressions early is crucial.

For now, ChatGPT fans may need to temper their expectations. The wild idea-generating machine many first encountered appears tamer—and perhaps less brilliant. But age-related decline appears to be inevitable, even for AI celebrities.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Recommended News

Sam Altman's OpenAI Crushes Elon Musk's Grok in AI Chess Championship
Sam Altman's OpenAI o3 model—which was deprecated late last week with the release of GPT-5—demolished Elon Musk's Grok 4 in four straight games Thursday to win Google's Kaggle Game Arena AI Chess Exhibition. You may think it was a super complex spectacle of high tech behemoths putting their reasoning to the ultimate test, but as an appetizer, let’s say world champion Magnus Carlsen compared both bots to "a talented kid who doesn't know how the pieces move." Screenshot of the AI chess tournament...
NewsArtificial Intelligence
5 min read
Jose Antonio LanzAug 10, 2025
Create an account to save your articles.
Bumps in the Machine: OpenAI's GPT-5 Rollout Stumbles Into the Spotlight
OpenAI’s much-hyped launch of GPT-5—touted as a groundbreaking leap in artificial intelligence—has instead hit a familiar snag called reality. The company billed the model as its most advanced yet, but early users say the rollout has been anything but seamless. Reports of sluggish performance, erratic outputs, and missing features have fueled growing skepticism about whether GPT-5 and OpenAI can deliver on its promises. On Friday, OpenAI CEO Sam Altman offered a mea culpa on X for all of the com...
NewsArtificial Intelligence
6 min read
Jason NelsonAug 8, 2025
Create an account to save your articles.
Core Scientific's Top Investor to Vote Against CoreWeave's 'Inadequate' $9B Takeover
Core Scientific’s largest active shareholder is moving to block the miner’s proposed $9 billion all-stock sale to AI infrastructure provider CoreWeave, calling the offer “inadequate” and unfavorable to existing shareholders. The proposed sale “materially undervalues” the company and unnecessarily exposes its shareholders to substantial economic risk, New York-based Two Seas Capital, the largest active shareholder in Core Scientific with about a 6.3% stake, said in a statement Thursday. Two Seas...
NewsBusiness
3 min read
Vince DioquinoAug 8, 2025
Create an account to save your articles.

Coin Prices