ChatGPT's Performance Is Slipping, New Study Says

UC Berkeley researchers found that ChatGPT has not improved over time, and in fact, may have gotten worse.

4 min read

Jul 20, 2023

ChatGPT exploded onto the scene late last year, dazzling people with its human-like conversational abilities, and the release of latest version prompted a crypto rally and calls for a pause in development. But according to a new study, the leading AI bot's skills may actually be on the decline.

Researchers at Stanford and UC Berkeley systematically analyzed different versions of ChatGPT from March and June 2022. They developed rigorous benchmarks to evaluate the model's competency in math, coding, and visual reasoning tasks. The results of ChatGPT’s performance over time were not good.

The tests revealed a startling drop-off in performance between versions. On a math challenge of determining prime numbers, ChatGPT solved 488 out of 500 questions correctly in March, an accuracy of 97.6%. However, in June, ChatGPT only managed to get 12 questions right, plunging to 2.4% accuracy.

Image: UC Berkeley, Stanford

The decline was especially steep in the chatbot’s software coding abilities.

"For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June," the research found. These results were obtained by using the pure version of the models, meaning, no code interpreter plugins were involved.

To assess reasoning, the researchers leveraged visual prompts from the Abstract Reasoning Corpus (ARC) dataset. Even here, while not as steep, a decline was observable. “GPT-4 in June made mistakes on queries on which it was correct for in March” the study reads.

What could explain ChatGPT's apparent downgrade after just a few months? Researchers hypothesize it may be a side effect of optimizations being made by OpenAI, its creator.

One possibility cause is changes introduced to prevent ChatGPT from answering dangerous questions. This safety alignment could impair ChatGPT's usefulness for other tasks, though. The researchers found the model now tends to give verbose, indirect responses instead of clear answers.

"GPT-4 is getting worse over time, not better," said AI expert Santiago Valderrama on Twitter. Valderrama also raised the possibility that a "cheaper and faster" mixture of models may have replaced the original ChatGPT architecture.

“Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run,” he hypothesized, which he said could accelerate responses for users but reduce competency.

Another expert, Dr. Jm, Fan also shared his insights on a Twitter Thread.

“Unfortunately, more safety typically comes at the cost of less usefulness,” he wrote, saying he was trying to make sense of the results by linking them to the way OpenAI finetunes its models. “My guess (no evidence, just speculation) is that OpenAI spent the majority of efforts doing lobotomy from March to June, and didn't have time to fully recover the other capabilities that matter.”

Fan argues that other factors may have come into play, namely cost-cutting efforts, the introduction of warnings and disclaimers that may “dumb down” the model, and the lack of broader feedback from the community.

While more comprehensive testing is warranted, the findings align with users' expressed frustrations over declining coherence in ChatGPT's once eloquent outputs.

How can we prevent further deterioration? Some enthusiasts advocated for open-source models like Meta's LLaMA (which has just been updated) that enable community debugging. Continuous benchmarking to catch regressions early is crucial.

For now, ChatGPT fans may need to temper their expectations. The wild idea-generating machine many first encountered appears tamer—and perhaps less brilliant. But age-related decline appears to be inevitable, even for AI celebrities.

Get crypto news straight to your inbox--

sign up for the Decrypt Daily below. (It’s free).

Get Email!

Beeple Made Robot Dogs With Musk, Zuckerberg, and Warhol Heads That Poop NFTs

Digital artist Mike Winkelmann, aka Beeple, has brought NFTs back on the scene at Art Basel with a pack of robot quadrupeds that snap photos of visitors and poop printed artworks that double as crypto collectibles. The interactive installation, titled “Regular Animals,” is on show through December 7 at the Miami Beach Convention Center as part of Art Basel’s program for new digital works, Zero 10. Each four-legged robot carries a sculpted silicone head based on figures from the worlds of art an...

Morning Minute: The CFTC Just Approved U.S. Spot Crypto Trading

Morning Minute is a daily newsletter written by Tyler Warner. The analysis and opinions expressed are his own and do not necessarily reflect those of Decrypt. Subscribe to the Morning Minute on Substack. GM! Today’s top news: Crypto majors dip 2-4% overnight; BTC at $91,300 Larry Fink says sovereign wealth funds are buying the BTC dip CFTC approves spot crypto trading in US, starting with Bitnomial Jack Maller’s Twenty One approved to list on NYSE, with $4B in BTC MetaMask integrates Polymarket...

India's DRI Says Smugglers Ditching Hawala Networks for Stablecoins

India’s top anti-smuggling agency has sounded the alarm on the rising use of crypto and stablecoins in drug and gold trafficking, enabling rapid, untraceable international fund transfers that bypass formal financial oversight. The warning comes from the Directorate of Revenue Intelligence's Smuggling in India Report 2024-25, released Thursday, which points out that digital assets enable "faster and anonymous settlement, minimal oversight, and weak anti-money laundering compliance." “Cryptocurren...

News

Courses

Deep Dives

Coins

Videos