Everything about LLM Hallucinations

Ankit Agarwal
7 min readMay 10, 2023

Large Language Models (LLMs) suffer with a major challenge, i.e., Hallucination and in this article I am going to delve into the details of LLM Hallucinations covering “What” , “Why” and “How” part of Hallucinations.

What is Hallucination

In simple terms, it means making stuff up. In a more coherent manner, I could describe Hallucinations as something refer to the model generating outputs that are syntactically and semantically correct but are disconnected from reality, and based on false assumptions.

Why do LLMs Hallucinate

A simple reason to describe the reason for LLM hallucination is because large language models have no idea of the underlying reality that language describes. Those systems generate text that sounds fine, grammatically, and semantically, but they don’t really have some sort of objective other than just satisfying statistical consistency with the prompt.

However, if I were to put a Data Scientist Lens and look at the actual reason for LLM hallucination, I would attribute it to the following reasons –

1. Outdated Data: Large Language Models (LLMs) have a data freshness problem. Even some of the most powerful models, like GPT-4, have no idea about recent events. The world, according to LLMs, is frozen in time. They only know the world as it appeared through their training data. That creates problems for any use case that relies on up-to-date information or a particular dataset.

2. Overfitting is an issue where an AI model fits the training data too well. Still, it cannot fully represent the whole range of inputs it may encounter, i.e., it fails to generalize its predictive power to new, unseen data. Overfitting can lead to the model producing hallucinated content.

3. Training Bias: Another factor is the presence of certain biases in the training data, which can cause the model to give results that represent those biases rather than the actual nature of the data. This is similar to the lack of diversity in the training data, which limits the model’s ability to generalize to new data.

4. Compression: LLMs are trained using large amount of data which is stored as a mathematical representation of the relationship (probabilities) between input (text or pixels) as opposed to the input itself. This type of compression implies a loss of fidelity. In other words, it becomes impossible (or challenging) to reconstruct all of the original knowledge learned. The behavior of hallucination (or the model’s effort to “bullshit” things it cannot recall perfectly) is the price we pay for having this compact, helpful representation of knowledge.

“So what” if LLMs Hallucinate

1. Toxic or Discriminating Content: Since the LLM training data is often full of sociocultural stereotypes due to the inherent biases and lack of diversity. LLMs can, thus, produce and reinforce these harmful ideas against disadvantaged groups in society.They can generate this discriminating and hateful content based on race, gender, religion, ethnicity, etc.

2. Privacy issues: LLMs are trained on a massive training corpus which often includes the personal information of individuals. There have been cases where such models have violated people’s privacy. They can leak specific information such as social security numbers, home addresses, cell phone numbers, and medical details.

3. Misinformation and Disinformation: Language models can produce human-like content that seems accurate but is, in fact, false and not supported by empirical evidence. This can be accidental, leading to misinformation, or it can have malicious intent behind it to knowingly spread disinformation. If this goes unchecked, it can create adverse social-cultural-economic-political trends.

How to Prevent/Reduce LLM Hallucinations

1. Reinforcement learning with human feedback (RLHF) to tackle ChatGPT’s hallucination problem. It involves a human evaluator who frequently reviews the model’s responses and picks out the most appropriate for the user prompts. This feedback is then used to adjust the behavior of the model.

2. Early Detection: Identifying hallucinated content to use as an example for future training is also a method used to tackle hallucinations. A novel technique in this regard detects hallucinations at the token level and predicts whether each token in the output is hallucinated. It also includes a method for unsupervised learning of hallucination detectors.

3. Regularization: Developing better regularization techniques is at the core of tackling hallucinations. They help prevent overfitting and other problems that cause hallucinations.

4. Temperature: When building with LLMs (be it a HuggingFace model like FLAN-T5 or the OpenAI GPT-3 api), there are several parameters available, including a temperature parameter. A model’s temperature refers to a scalar value that is used to adjust the probability distribution predicted by the model. In the case of LLMs, the temperature parameter determines the balance between sticking to what the model has learned from the training data and generating more diverse or creative responses. In general, creative responses are more likely to contain hallucinations.

In the example above, the result of the query (what was the HMS Argus) is fairly accurate when temperature=0. At temperature =1, the model takes creative liberties and discusses the dates of the vessel being decommissioned and sold for scrap - those dates are incorrect. In addition, temperature=0 always yields the same deterministic most likely response (valuable for building/testing stable systems)

5. Retrieval Augmented Generation (RAG): This technique allows us to retrieve relevant information from an external knowledge base and give that information to LLM. By providing access to relevant data (adding to the prompt) from a knowledge base at prediction, we can convert the purely generation problem to a simpler search or summarization problem grounded in the provided data.

6. Chain of Thought Prompting: A well-known flaw of LLMs is their poor performance on tasks that require multi-step reasoning e.g., arithmetic, or logic tasks. An LLM may write reams and reams of convincing Shakespeare but fail to correctly multiply 371 * 246. Recent work shows that when the model is offered some examples (few shot) in decomposing the task into steps (a chain of thought) and aggregating its result, performance significantly improves.

7. Self-Consistency: This general approach follows a “wisdom of the crowd” or “majority vote” or “ensembles” approach to improve the performance of models. Consider that you could prompt the model to explore diverse paths in generating answers; we can assume that the answer most selected across these diverse paths is likely to be the correct one. How the diverse answers are generated and how they are aggregated (to infer the correct answer) may vary.

The Self-Consistency (Wang et al) paper introduce a new self-consistency decoding strategy that samples diverse reasoning paths. First, a language model is prompted with a set of manually written chain-of-thought exemplars. Next they sample a set of candidate outputs from the language model’s decoder, generating a diverse set of candidate reasoning paths. Finally, answers are aggregated by marginalizing out the sampled reasoning paths and choosing the answer that is the most consistent among the generated answers.

8. Self-Evaluation: It turns out that if you ask a model to generate answers and ask it to generate the probabilities that its answers are correct, these probabilities are mostly well calibrated. That is, the model mostly knows what it does and doesn’t know.
The insight here is that for each generated answer to a prompt, we can get these probabilities and use that in filtering results (discard results that are likely incorrect).

9. Task Decomposition: Related to chain of thought approach (mentioned above), this approach explores creating an initial router agent that seeks to explicitly decompose the user’s prompt into specific sub tasks. Each subtask is then handled by a specific expert agent. Note that an agent here is also an LLM (i.e., a forward pass through an LLM with optional abilities to act in the real world via api’s e.g., creating a calendar invite, payments, etc). Also note that each agent may have to rewrite/reformulate the prompt it receives in order to apply its capabilities.


As we know, while large language models (LLM) can provide valuable insights, there are some drawbacks to consider as well. One potential issue is their limited logical reasoning and hallucination tendencies. LLM hallucinations are a growing concern and it will continue to stay till LLMs are improved to have logical reasoning. While we can’t eliminate hallucinations completely, but could mitigate the risks/challenges through strategies discussed above. Fortunately, certain methods can assist in managing these challenges, which we have previously touched on. Needless to say, it is crucial to exercise caution when applying these tools for workloads demanding accurate facts or intricate problem-solving processes, since precision is paramount.


1. https://www.pinecone.io/learn/langchain-retrieval-augmentation/

2. https://thestack.technology/the-big-hallucination-large-language-models-consciousness/

3. https://www.unite.ai/what-are-llm-hallucinations-causes-ethical-concern-prevention/

4. https://newsletter.victordibia.com/p/practical-steps-to-reduce-hallucination