Hallucination Nation: How Crazy Is Your Favorite AI Model?

Huggingface’s new Hallucinations Leaderboard ranks models based on how prone they are to generate content that is not true.

3 min read

Jan 29, 2024

For all the transformative and disruptive power attributed to the rise of artificial intelligence, the Achilles’ heel of generative AI remains its tendency to make things up.

The tendency of Large Language Models (LLMs) to "hallucinate" comes with all sorts of pitfalls, sowing the seeds of misinformation. The sphere of Natural Language Processing (NLP) can be dangerous, especially when people cannot tell the difference between what is human and what is AI generated.

To get a handle on the situation, Huggingface—which claims to be the world’s largest Open Source AI community—introduced the Hallucinations Leaderboard, a new ranking dedicated to evaluating open source LLMs and their tendency to generate hallucinated content by running them through a set of different benchmarks tailored for in-context learning.

“This initiative wants to aid researchers and engineers in identifying the most reliable models, and potentially drive the development of LLMs towards more accurate and faithful language generation,” the leaderboard developers explained.

The spectrum of hallucinations in LLMs breaks down into two distinct categories: factuality and faithfulness. Factual hallucinations are when the content contradicts verifiable real-world facts. An example of such a discrepancy could be a model inaccurately proclaiming that Bitcoin has 100 million tokens instead of just 23 million. Faithful hallucinations, on the other hand, emerge when the generated content deviates from the user's explicit instructions or the established context, leading to potential inaccuracies in critical domains such as news summarization or historical analysis. On this front, the model generates fake information because it seems like it’s the most logical path according to its prompt.

The leaderboard uses EleutherAI’s Language Model Evaluation Harness to conduct a thorough zero-shot and few-shot language model evaluation across a variety of tasks. These tasks are designed to test how good a model behaves. In general terms, every test gives a score based on the LLM's performance, then these results are averaged so each model competes based on its overall performance across all tests.

So which LLM architecture is the least crazy of the bunch?

Based on the preliminary results from the Hallucinations Leaderboard, the models that exhibit fewer hallucinations—and hence rank among the best—include Meow (Based on Solar), Stability AI’s Stable Beluga, and Meta’s LlaMA-2. However some models that part from a common base (like those based in Mistral LLMs) tend to outperform its competitors on specific tests —which must be taken into consideration based on the nature of the tast that each user may have in mind.

Image: Hugging Face

On the Hallucinations Leaderboard, a higher average score for a model indicates a lower propensity for the model to hallucinate. This means the model is more accurate and reliable in generating content that aligns with factual information and adheres to the user's input or the given context.

However, it’s important to note that models that are great at some tasks may be underwhelming at others, so the ranking is based on an average between all the benchmarks, which tested different areas like summarization, fact-checking, reading comprehension and self consistency among others.

Dr. Pasquale Minervini, the architect behind the Hallucination’s Leaderboard, did not immediately respond to a request for comment from Decrypt.

It's worth noting that while the Hallucinations Leaderboard offers a comprehensive evaluation of open-source models, closed-source models have not yet undergone this rigorous testing. Given the testing protocol and the proprietary restrictions of commercial models, however, Hallucinations Leaderboard scoring seems unlikely.

Edited by Ryan Ozawa.

Get crypto news straight to your inbox--

sign up for the Decrypt Daily below. (It’s free).

Get Email!

GPT-5.6 vs Fable 5 Review: Which One You Pick Depends on These Factors

For the first time, OpenAI isn't shipping one model with thinking dials. GPT-5.6 comes as three genuinely separate LLMs—Sol, Terra, and Luna—with different training, different pricing, and different capability ceilings. The comparison that matters is Sol against Claude Fable 5, Anthropic's most capable public model right now. Sol costs $5 per million input tokens and $30 output. Fable 5 is $10 and $50—twice as expensive, now losing on several benchmarks developers actually route work through. Lu...

ECB Warns Stablecoins May Drain Bank Deposits—Here's What That Means

European banks are losing the payments war in installments. First came mobile apps, which took their fees and transaction data, then digital payments and startups took even more control. Now the ECB is warning that stablecoins could take the thing that really hurts: their deposits. Piero Cipollone, an executive board member of the European Central Bank, delivered that message Friday at a banking conference in Rome, and framed the digital euro as the structural answer. “Even traditional debit ca...

Cardano Pumps as Network Moves to Further Decentralize Development

Cardano's founding developer is letting go. Input Output announced Friday it will hand control of core blockchain infrastructure to outside specialist firms, beginning in August—the Haskell node, Plutus smart-contract platform, Daedalus wallet, and Hydra scaling technology are all going to external hands. The firms taking over include Se7en Labs, a development agency with a Solana infrastructure background, and Teragone, a cryptographic research team that already leads development of Mithril, Ca...

News

Courses

Deep Dives

Coins

Videos