Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Self-distillation can degrade LLMs' reasoning in math by suppressing uncertainty expression.
Key Findings
Methodology
This study investigates the impact of self-distillation on the reasoning capabilities of large language models (LLMs), particularly in mathematical reasoning tasks. Using models like Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, the research analyzes how conditioning the teacher model on rich information suppresses uncertainty expression in the student model. Controlled experiments varied the richness of the conditioning context and task coverage to systematically study how self-distillation affects reasoning behavior.
Key Results
- Across models like Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, self-distillation led to performance drops of up to 40%. This decline is primarily due to the suppression of uncertainty expression during reasoning, which negatively impacts performance on unseen problems.
- Experiments show that when the teacher model is conditioned on rich information, the student model's reasoning becomes more confident and concise, but this also suppresses uncertainty expression, affecting out-of-distribution (OOD) performance.
- By comparing model performance under different conditions, it was found that self-distillation in rich information contexts leads to changes in reasoning style, which, while effective for in-domain optimization, performs poorly with broad task coverage.
Significance
This research reveals the mechanism by which self-distillation can degrade reasoning capabilities in mathematical tasks, highlighting the importance of appropriately expressing uncertainty during reasoning. This finding is significant for both academia and industry as it challenges the current assumption that self-distillation universally improves model performance and points to new directions for optimizing reasoning behavior beyond merely reinforcing correct answer traces.
Technical Contribution
Technical contributions include uncovering the suppressive effect of self-distillation on uncertainty expression under rich information conditions and how this suppression affects reasoning capabilities and generalization performance. The study also proposes new methods for optimizing reasoning behavior, emphasizing the importance of retaining uncertainty expression during reasoning to improve performance on unseen tasks.
Novelty
This study is the first to systematically analyze the impact of self-distillation on uncertainty expression in mathematical reasoning tasks, proposing a mechanism by which self-distillation may lead to reasoning degradation under rich information conditions. This finding contrasts with previous studies that concluded self-distillation universally improves performance, providing a new perspective.
Limitations
- The study focuses primarily on mathematical reasoning tasks, which may not apply to reasoning tasks in other domains. Different domains may have varying requirements for uncertainty expression.
- The models and datasets used in the experiments are limited, which may not fully represent the behavior of all large language models.
- The study does not consider the specific performance differences of models in different reasoning tasks, which may affect the generalizability of the conclusions.
Future Work
Future research could explore the performance of self-distillation in other reasoning tasks, especially those requiring high levels of uncertainty expression. Additionally, the study could further analyze the impact of different model architectures and datasets on the effects of self-distillation to develop more general optimization strategies.
AI Executive Summary
In the post-training of large language models (LLMs), self-distillation has emerged as an effective paradigm, often improving model performance and shortening reasoning paths. However, in mathematical reasoning tasks, it has been found that self-distillation can reduce response length while degrading performance. The root cause of this phenomenon is traced to the suppression of uncertainty expression during reasoning. Through a series of controlled experiments, researchers found that when the teacher model is conditioned on rich information, the student model's reasoning trajectory becomes more confident and concise, but this also suppresses uncertainty expression, affecting the model's performance on out-of-distribution (OOD) tasks.
The study used models like Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct to analyze the effects of self-distillation under different conditions. The experiments showed that in these models, self-distillation led to performance drops of up to 40%. This decline is primarily due to the suppression of uncertainty expression during reasoning, which negatively impacts performance on unseen problems.
The research reveals the mechanism by which self-distillation can degrade reasoning capabilities in mathematical tasks, highlighting the importance of appropriately expressing uncertainty during reasoning. This finding is significant for both academia and industry as it challenges the current assumption that self-distillation universally improves model performance and points to new directions for optimizing reasoning behavior beyond merely reinforcing correct answer traces.
Technical contributions include uncovering the suppressive effect of self-distillation on uncertainty expression under rich information conditions and how this suppression affects reasoning capabilities and generalization performance. The study also proposes new methods for optimizing reasoning behavior, emphasizing the importance of retaining uncertainty expression during reasoning to improve performance on unseen tasks.
Future research could explore the performance of self-distillation in other reasoning tasks, especially those requiring high levels of uncertainty expression. Additionally, the study could further analyze the impact of different model architectures and datasets on the effects of self-distillation to develop more general optimization strategies.
Deep Analysis
Background
In recent years, large language models (LLMs) have made significant advances in the field of natural language processing. Self-distillation, as a post-training technique, aims to improve model performance by using two instances of the same model, where one instance serves as the teacher model providing informative reward signals, and the other instance serves as the student model generating responses. Self-distillation has been shown to significantly improve model performance in various domains, especially in scientific reasoning and agentic environments. However, there is limited research on the effects of self-distillation in mathematical reasoning tasks.
Core Problem
Self-distillation in mathematical reasoning tasks may lead to a degradation of the model's reasoning capabilities. The core problem lies in the suppression of uncertainty expression during the self-distillation process, which may affect the model's performance on unseen problems. Mathematical reasoning tasks often require the model to express uncertainty across different reasoning paths to adjust and correct during the reasoning process.
Innovation
The core innovation of this study is the revelation of the suppressive effect of self-distillation on uncertainty expression in mathematical reasoning tasks. Through controlled experiments, the study systematically analyzes the impact of self-distillation under different conditions, particularly how it affects reasoning behavior in rich information contexts. The study also proposes new methods for optimizing reasoning behavior, emphasizing the importance of retaining uncertainty expression during reasoning.
Methodology
- �� Use models like Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct for experiments.
- �� Analyze the impact of self-distillation on model reasoning behavior by varying the richness of the conditioning context.
- �� In controlled experiments, the teacher model is trained under rich information conditions, while the student model is optimized within limited task coverage.
- �� Observe model performance on OOD tasks and analyze the impact of uncertainty expression on reasoning capabilities.
Experiments
The experimental design includes comparative experiments using models like Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct. Different datasets and baseline models were used, and evaluation metrics included reasoning capabilities and response length. Ablation studies were also conducted to analyze the effects of self-distillation under different conditions.
Results
The experimental results show that self-distillation in mathematical reasoning tasks can lead to performance drops of up to 40%. This decline is primarily due to the suppression of uncertainty expression during reasoning, which negatively impacts performance on unseen problems. The experiments also found that when the teacher model is trained under rich information conditions, the student model's reasoning becomes more confident and concise, but this also suppresses uncertainty expression, affecting OOD performance.
Applications
The research findings have significant implications for optimizing large language models, especially in reasoning tasks requiring high levels of uncertainty expression. The study reveals the mechanism by which self-distillation can degrade reasoning capabilities in mathematical tasks, highlighting the importance of appropriately expressing uncertainty during reasoning.
Limitations & Outlook
The study focuses primarily on mathematical reasoning tasks, which may not apply to reasoning tasks in other domains. The models and datasets used in the experiments are limited, which may not fully represent the behavior of all large language models. The study does not consider the specific performance differences of models in different reasoning tasks, which may affect the generalizability of the conclusions.
Plain Language Accessible to non-experts
Imagine you're cooking in a kitchen. You have a recipe (teacher model) that tells you how to make the perfect dish. You follow the recipe step by step (student model), but sometimes you might be unsure about certain steps, like "How much seasoning should I add?" At this point, you might pause to think or even try different amounts (uncertainty expression).
Now, imagine you have a super-smart kitchen assistant (self-distillation) that gives you advice while you cook. This assistant is very confident and always tells you, "Just do it this way, don't worry!" As a result, you cook quickly, but sometimes the dish doesn't taste quite right because you didn't have the chance to experiment and adjust.
This is similar to the problem with self-distillation in mathematical reasoning. The model no longer expresses uncertainty during reasoning, leading to poor performance on unseen problems. Just like in the kitchen, if you always follow the assistant's advice without trying and adjusting, you might miss out on some delicious possibilities.
Therefore, appropriately expressing uncertainty is important as it gives you the chance to experiment and adjust, leading to better performance when facing new problems.
ELI14 Explained like you're 14
Hey there! Have you ever played a puzzle game where you need to solve riddles to find a treasure? Sometimes, you might think, "How do I solve this puzzle?" At this point, you might try different methods or even ask your friends for advice, right?
Now, imagine you have a super-cool game assistant that always tells you, "Just do it this way, it's fine!" At first, you might think it's great because you can find the treasure quickly. But slowly, you'll realize that some puzzles are still unsolvable because the assistant always gives you the same advice, and you don't get the chance to try different methods.
This is like what scientists found when studying large language models. Sometimes, when solving math problems, the model becomes too confident and doesn't try different methods, leading to poor performance on new problems.
So, expressing uncertainty is like trying different methods in a game. It gives you the chance to explore and learn, leading to better performance when facing new challenges!
Glossary
Self-Distillation
A post-training technique that uses two instances of the same model to improve performance, where one instance acts as the teacher model providing informative reward signals, and the other acts as the student model generating responses.
Used in the study to analyze the impact of self-distillation on LLM reasoning capabilities.
Epistemic Verbalization
During reasoning, the model expresses its uncertainty about certain reasoning paths through language. This expression can help the model adjust and correct during the reasoning process.
The study analyzes the suppressive effect of self-distillation on epistemic verbalization.
Large Language Model (LLM)
A deep learning-based natural language processing model capable of generating and understanding human language.
Models like Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct were used in the study.
Reasoning Capability
The ability of a model to perform logical reasoning and decision-making when solving problems.
The study analyzes the impact of self-distillation on model reasoning capabilities.
In-Domain Optimization
Optimization of a model within the distribution of the training data to improve performance on known tasks.
The study analyzes the in-domain optimization effects of self-distillation with limited task coverage.
Out-of-Distribution (OOD) Performance
The performance of a model on data or tasks it has not seen before.
The study analyzes the impact of self-distillation on model OOD performance.
Qwen3-8B
A large language model used to study the impact of self-distillation on reasoning capabilities.
One of the models used in the study.
DeepSeek-Distill-Qwen-7B
A large language model used to study the impact of self-distillation on reasoning capabilities.
One of the models used in the study.
Olmo3-7B-Instruct
A large language model used to study the impact of self-distillation on reasoning capabilities.
One of the models used in the study.
Conditioning Context
The informational background on which the teacher model is based during self-distillation.
The study analyzes the impact of the richness of the conditioning context on self-distillation effects.
Information Richness
The amount and level of detail of information contained in the conditioning context.
The study analyzes the impact of information richness on uncertainty expression.
Task Coverage
The variety and number of tasks the model is exposed to during training.
The study analyzes the impact of task coverage on self-distillation effects.
Ablation Study
A method of analyzing the impact of removing or altering certain parts of a model on overall performance.
Used in the study to analyze the effects of self-distillation.
Reasoning Trajectory
The reasoning path and steps a model goes through when solving a problem.
The study analyzes the impact of self-distillation on reasoning trajectories.
Model Performance
The performance of a model on specific tasks, including metrics like accuracy and response time.
The study analyzes the impact of self-distillation on model performance.
Open Questions Unanswered questions from this research
- 1 What are the effects of self-distillation in reasoning tasks in other domains? Current research focuses primarily on mathematical reasoning tasks, and other domains may have different requirements for uncertainty expression.
- 2 What is the impact of different model architectures and datasets on the effects of self-distillation? The models and datasets used in the study are limited and may not fully represent the behavior of all large language models.
- 3 How can uncertainty expression be effectively retained during self-distillation? The study highlights the importance of uncertainty expression for reasoning capabilities, but how to effectively retain this expression during self-distillation requires further exploration.
- 4 What is the mechanism by which self-distillation affects model generalization capabilities? The study reveals the mechanism by which self-distillation may lead to reasoning degradation, but the specific impact mechanism requires further study.
- 5 How can self-distillation be optimized to improve model performance on unseen tasks? The study proposes new methods for optimizing reasoning behavior, but specific optimization strategies require further validation.
Applications
Immediate Applications
Mathematical Reasoning Task Optimization
The research findings can be used to optimize the performance of large language models in mathematical reasoning tasks, especially those requiring high levels of uncertainty expression.
Educational Applications
Large language models can be used in automated problem-solving and assessment systems in education, improving accuracy and reliability by appropriately expressing uncertainty.
Scientific Research Assistance
Large language models can be used in data analysis and reasoning tasks in scientific research, improving performance in complex tasks through optimized self-distillation.
Long-term Vision
Development of General Artificial Intelligence
By optimizing self-distillation and uncertainty expression, large language models can be advanced towards general artificial intelligence, improving performance across various tasks.
Cross-Domain Application Expansion
The research findings can be used to expand the application of large language models in different domains, including automated decision-making and reasoning tasks in healthcare, finance, and law.
Abstract
Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
References (17)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Adam Suma, Sam Dauncey
Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty
Jeonghye Kim, Xufang Luo, Minbeom Kim et al.
Reinforcement Learning via Self-Distillation
Jonas Hubotter, Frederike Lubeck, L. Behric et al.
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu et al.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu et al.
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu et al.
Learning by Distilling Context
Charles Burton Snell, D. Klein, Ruiqi Zhong
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu et al.
Expanding the Capabilities of Reinforcement Learning via Text Feedback
Yuda Song, Lili Chen, Fahim Tajwar et al.
(Preprint)
Sarah Verschueren, J. van Aalst, A. Bangels et al.
Trans-Formers
Oliver Bendel
In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback
Mingye Zhu, Yi Liu, Zheren Fu et al.
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Qiaoyu Tang, Ziliang Deng, Hongyu Lin et al.
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models
Kehua Feng, Keyan Ding, Weijie Wang et al.
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter et al.