Why Fine-Tuning Encourages Hallucinations and How to Fix It

TL;DR

Self-distillation reduces fine-tuning-induced hallucinations, lowering factual forgetting from 15% to 3%.

cs.CL 🔴 Advanced 2026-04-17 31 views
Guy Kaplan Zorik Gekhman Zhen Zhu Lotem Rozner Yuval Reif Swabha Swayamdipta Derek Hoiem Roy Schwartz
fine-tuning hallucinations self-distillation continual learning language models

Key Findings

Methodology

The paper proposes a self-distillation-based supervised fine-tuning (SFT) method to reduce hallucinations by regularizing output distribution drift. This method leverages tools from continual learning to mitigate knowledge degradation. Specifically, self-distillation maintains the model's output distribution close to its earlier state by limiting parameter updates, thus reducing interference from new knowledge on existing knowledge. Additionally, the paper explores freezing parameter groups to suppress factual plasticity in scenarios where new knowledge acquisition is unnecessary.

Key Results

  • Result 1: Under the self-distillation method, factual forgetting is reduced from 15% in standard SFT to 3%, while still enabling effective acquisition of new knowledge.
  • Result 2: By freezing parameter groups, the model reduces hallucinations in scenarios where new knowledge acquisition is unnecessary, while maintaining task performance.
  • Result 3: Experiments show that SFT-induced hallucinations are primarily driven by interference among overlapping semantic representations, and self-distillation succeeds by mitigating this interference.

Significance

This study redefines SFT-induced hallucinations as factual forgetting, providing a new perspective to understand and address this issue. By introducing the self-distillation method, the research effectively reduces hallucinations without sacrificing task performance. This finding is significant for both academia and industry as it not only enhances the reliability of large language models but also offers new insights and methods for the field of continual learning.

Technical Contribution

The technical contribution of this paper lies in applying self-distillation to SFT to reduce hallucinations. This approach fundamentally differs from existing state-of-the-art methods by maintaining factual stability through limiting output distribution drift. Additionally, the paper explores freezing parameter groups to reduce factual plasticity, offering new possibilities for engineering practice.

Novelty

This paper is the first to apply self-distillation to reduce SFT-induced hallucinations and demonstrates its effectiveness through experiments. Unlike previous work, this study not only focuses on acquiring new knowledge but also emphasizes the importance of maintaining existing knowledge.

Limitations

  • Limitation 1: The self-distillation method requires additional computational resources to maintain the teacher model's output distribution, which may increase training costs.
  • Limitation 2: The method of freezing parameter groups may not be applicable in scenarios where new knowledge acquisition is necessary.

Future Work

Future research could explore how to apply the self-distillation method to larger datasets and more complex tasks. Additionally, investigating how to combine this method with other continual learning techniques to further reduce hallucinations could be beneficial.

AI Executive Summary

In recent years, large language models have excelled in natural language processing tasks, but they are prone to generating factually incorrect statements, known as hallucinations. These hallucinations are particularly evident when models learn new knowledge through supervised fine-tuning (SFT). SFT is a standard practice in the development of large language models, but it may exacerbate hallucination issues, affecting the reliability of applications.

This paper proposes a self-distillation-based SFT method to reduce hallucinations. Self-distillation is a continual learning technique that reduces forgetting by regularizing the model's output distribution during fine-tuning. Experimental results show that this method reduces factual forgetting from 15% in standard SFT to 3% while maintaining effective acquisition of new knowledge.

Additionally, the study explores freezing parameter groups to suppress factual plasticity in scenarios where new knowledge acquisition is unnecessary. Experiments demonstrate that this method can reduce hallucinations while maintaining task performance.

To understand the mechanism behind SFT-induced hallucinations, the study proposes three hypotheses: capacity limitations, behavior cloning, and localized interference. Results indicate that interference among overlapping semantic representations is the main driver, and self-distillation succeeds by mitigating this interference.

This research not only provides an effective method for reducing hallucinations but also offers a new perspective for the field of continual learning. Future research could further explore how to apply these methods to more complex tasks and larger datasets.

Deep Analysis

Background

In recent years, the development of large language models (LLMs) has significantly improved performance in natural language processing tasks. However, these models also face the issue of hallucinations, where generated content may contain factual errors. The hallucination problem not only affects the reliability of models but also limits their widespread use in practical applications. Existing research indicates that hallucinations are particularly evident when models learn new knowledge through supervised fine-tuning (SFT). SFT is a standard practice in the development of large language models, but it may exacerbate hallucination issues, affecting the reliability of applications. Therefore, finding ways to reduce hallucinations while maintaining model performance has become an important research topic.

Core Problem

The core problem addressed in this paper is how to reduce SFT-induced hallucinations. Specifically, when models learn new knowledge through SFT, it may interfere with previously acquired knowledge, leading to factual forgetting. This forgetting manifests as models producing incorrect answers to questions they previously answered correctly. The hallucination problem not only affects the reliability of models but also limits their widespread use in practical applications. Therefore, finding ways to reduce hallucinations while maintaining model performance has become an important research topic.

Innovation

The core innovation of this paper is the proposal of a self-distillation-based SFT method to reduce hallucinations. Self-distillation is a continual learning technique that reduces forgetting by regularizing the model's output distribution during fine-tuning. The innovation of this method lies in its focus not only on acquiring new knowledge but also on maintaining existing knowledge. Additionally, the paper explores freezing parameter groups to reduce factual plasticity, offering new possibilities for engineering practice.

Methodology

The methodology of this paper includes the following key steps:


  • �� Self-distillation: During fine-tuning, the model reduces forgetting by regularizing output distribution drift. Specifically, self-distillation maintains the model's output distribution close to its earlier state by limiting parameter updates.

  • �� Freezing parameter groups: In scenarios where new knowledge acquisition is unnecessary, freezing parameter groups can suppress factual plasticity. This method can reduce hallucinations while maintaining task performance.

  • �� Experimental design: Experiments compare standard SFT and self-distillation methods to verify the effectiveness of self-distillation in reducing hallucinations.

Experiments

The experimental design includes the following aspects:


  • �� Datasets: The SLiCK method is used to classify questions and select known and unknown facts for training and evaluation.

  • �� Baselines: Comparisons with standard SFT verify the effectiveness of the self-distillation method.

  • �� Metrics: Factual forgetting rate and task performance are used to evaluate model performance.

  • �� Hyperparameters: Appropriate learning rates and training epochs are selected to ensure model effectiveness.

Results

Experimental results show that the self-distillation method reduces factual forgetting from 15% in standard SFT to 3% while maintaining effective acquisition of new knowledge. Additionally, by freezing parameter groups, the model reduces hallucinations in scenarios where new knowledge acquisition is unnecessary, while maintaining task performance. Experiments also indicate that SFT-induced hallucinations are primarily driven by interference among overlapping semantic representations, and self-distillation succeeds by mitigating this interference.

Applications

The methods proposed in this paper can be applied to large language models where reducing hallucinations is necessary, especially in scenarios where maintaining existing knowledge is crucial. For example, in private domain SFT or alignment fine-tuning, freezing parameter groups can reduce hallucinations. In domain adaptation where new knowledge acquisition is required, the self-distillation method can reduce hallucinations while maintaining effective acquisition of new knowledge.

Limitations & Outlook

Despite the effectiveness of the self-distillation method in reducing hallucinations, it requires additional computational resources to maintain the teacher model's output distribution, which may increase training costs. Additionally, the method of freezing parameter groups may not be applicable in scenarios where new knowledge acquisition is necessary. Future research could explore how to apply the self-distillation method to larger datasets and more complex tasks.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. You already know how to make delicious pasta, but now you want to try a new sauce. To ensure you don't forget how to make pasta, you learn the new sauce while making sure not to change your memory of making pasta. This is like the self-distillation method, which maintains old knowledge while learning new knowledge.

In the kitchen, you might freeze some ingredients that don't need changing, like the basic ingredients for pasta, and focus only on making the new sauce. This is similar to the method of freezing parameter groups, where only necessary adjustments are made.

In this way, you can learn the new sauce while ensuring you never mess up making pasta. This is how self-distillation and freezing parameter groups work to reduce hallucinations. They help the model maintain accuracy on old knowledge while learning new knowledge.

ELI14 Explained like you're 14

Hey there! Have you ever played a game where you need to keep upgrading your character? Imagine your character has learned a lot of skills, but every time you learn a new skill, the old ones become less effective. That's what we call the hallucination problem!

Scientists found that when large language models learn new knowledge, they might forget what they learned before. To avoid this, they invented a method called self-distillation. It's like saving your character's state in a game to ensure learning new skills doesn't affect the old ones.

There's also a method to freeze some skills that don't need changing and focus only on learning new ones. It's like in a game where you only upgrade the skills you need without touching others.

With these methods, models can learn new knowledge while maintaining their grasp on old knowledge. This way, we get smarter and more reliable AI!

Glossary

Self-distillation

Self-distillation is a continual learning technique that reduces forgetting by regularizing the model's output distribution during fine-tuning.

In this paper, self-distillation is used to reduce SFT-induced hallucinations.

Supervised Fine-Tuning (SFT)

SFT is a method of fine-tuning models through supervised learning, commonly used in the development of large language models.

The paper explores the hallucination problem induced by SFT.

Hallucination

Hallucination refers to the generation of content by models that contains factual errors, affecting their reliability.

The paper studies SFT-induced hallucinations and their solutions.

Continual Learning

Continual learning is a machine learning approach that enables models to learn new knowledge without forgetting old knowledge.

The paper leverages tools from continual learning to reduce SFT-induced hallucinations.

Freezing Parameter Groups

Freezing parameter groups is a method of reducing model parameter updates to maintain the stability of existing knowledge.

In scenarios where new knowledge acquisition is unnecessary, the paper explores freezing parameter groups.

Output Distribution Drift

Output distribution drift refers to changes in the model's output distribution when learning new knowledge, which may lead to forgetting old knowledge.

Self-distillation reduces hallucinations by regularizing output distribution drift.

Factual Forgetting

Factual forgetting refers to interference with previously acquired knowledge when models learn new knowledge, leading to errors.

The paper redefines SFT-induced hallucinations as factual forgetting.

SLiCK Method

The SLiCK method is a technique for classifying questions to identify the model's pre-existing knowledge level.

The paper uses the SLiCK method to classify questions for evaluating model performance.

Overlapping Semantic Representations

Overlapping semantic representations refer to different entities sharing similar representations within the model, potentially causing interference.

The paper finds that SFT-induced hallucinations are primarily driven by interference among overlapping semantic representations.

Knowledge Degradation

Knowledge degradation refers to the destruction or forgetting of previously acquired knowledge representations when learning new knowledge.

The paper explores how to reduce knowledge degradation using tools from continual learning.

Open Questions Unanswered questions from this research

  • 1 How can the self-distillation method be applied to larger datasets to reduce hallucinations? Existing methods may have computational resource limitations that need further optimization.
  • 2 What is the effectiveness of the self-distillation method in more complex tasks? Exploring its applicability across different tasks is necessary.
  • 3 The method of freezing parameter groups may not be applicable in scenarios where new knowledge acquisition is necessary. How can hallucinations be reduced in these scenarios?
  • 4 Can the self-distillation method be combined with other continual learning techniques to further enhance model performance?
  • 5 How can the effectiveness of the self-distillation method be maintained without increasing computational costs? More efficient implementations need exploration.

Applications

Immediate Applications

Private Domain SFT

In private domain SFT, freezing parameter groups can reduce hallucinations and maintain the stability of existing knowledge.

Alignment Fine-Tuning

In alignment fine-tuning, freezing parameter groups can reduce hallucinations when new knowledge acquisition is unnecessary.

Domain Adaptation

In domain adaptation where new knowledge acquisition is required, the self-distillation method can reduce hallucinations while maintaining effective acquisition of new knowledge.

Long-term Vision

Large-Scale Knowledge Base Construction

Reducing hallucinations can improve the efficiency and accuracy of large-scale knowledge base construction.

Intelligent Assistant Development

Reducing hallucinations in intelligent assistant development can enhance user experience and system reliability.

Abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

cs.CL cs.AI cs.LG cs.NE

References (20)

Continual Learning for Generative AI: From LLMs to MLLMs and Beyond

Haiyang Guo, Fanhu Zeng, Fei Zhu et al.

2025 6 citations ⭐ Influential View Analysis →

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Mor Geva, Avi Caciularu, Ke Wang et al.

2022 509 citations ⭐ Influential View Analysis →

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter et al.

2026 30 citations ⭐ Influential View Analysis →

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Zorik Gekhman, G. Yona, Roee Aharoni et al.

2024 256 citations ⭐ Influential View Analysis →

Continual Memorization of Factoids in Language Models

Howard Chen, Jiayi Geng, Adithya Bhaskar et al.

2024 5 citations View Analysis →

Learning without Forgetting

Zhizhong Li, Derek Hoiem

2016 5417 citations View Analysis →

A Continual Learning Survey: Defying Forgetting in Classification Tasks

Matthias De Lange, Rahaf Aljundi, Marc Masana et al.

2019 2269 citations

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Guy Kaplan, Michael Toker, Yuval Reif et al.

2025 3 citations View Analysis →

A Comprehensive Survey of Continual Learning: Theory, Method and Application

Liyuan Wang, Xingxing Zhang, Hang Su et al.

2023 1280 citations View Analysis →

Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability

Sergi Masip, Gido M. van de Ven, Javier Ferrando et al.

2026 1 citations View Analysis →

From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization

Catarina G. Belem, Pouya Pezeshkpour, Hayate Iso et al.

2024 18 citations View Analysis →

Locating and Editing Factual Associations in GPT

Kevin Meng, David Bau, A. Andonian et al.

2022 2323 citations View Analysis →

Inferring Functionality of Attention Heads from their Parameters

Amit Elhelo, Mor Geva

2024 12 citations View Analysis →

Online Continual Learning in Image Classification: An Empirical Survey

Zheda Mai, Ruiwen Li, Jihwan Jeong et al.

2021 501 citations View Analysis →

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, Pulkit Agrawal

2025 85 citations View Analysis →

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

O. Ovadia, Meni Brief, Moshik Mishaeli et al.

2023 257 citations View Analysis →

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet, Jörg Bornschein, Stephanie Chan et al.

2025 26 citations View Analysis →

Analyzing Transformers in Embedding Space

Guy Dar, Mor Geva, Ankit Gupta et al.

2022 136 citations View Analysis →

Towards Continual Knowledge Learning of Language Models

Joel Jang, Seonghyeon Ye, Sohee Yang et al.

2021 202 citations View Analysis →

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon, Eyal Ben-David, Zorik Gekhman et al.

2026 2 citations View Analysis →