The Role of Feedback Alignment in Self-Distillation

TL;DR

This paper introduces feedback alignment in self-distillation, comparing three feedback types; structure-aligned critique outperforms others with +16.11% accuracy.

cs.AI 🔴 Advanced 2026-06-10 71 views
Semih Kara Oğuzhan Ersoy
deep learning model distillation feedback mechanism natural language processing reasoning

Key Findings

Methodology

The study employs a solver-critic framework tailored for mathematical reasoning tasks. It compares three feedback conditions: GRPO (binary reward), RefSol (reference solution), and StepAlignFB (step-by-step critique). The solver generates reasoning traces, while the critic provides feedback based on the chosen condition. The models are trained using a self-distillation objective (Eq.3), which minimizes divergence between the student’s output distribution conditioned on question alone and conditioned on additional feedback. The core innovation is the structure-aware feedback, where the critic copies correct steps verbatim and corrects errors, aligning feedback with the reasoning path. This setup enables precise error localization, akin to process reward models (PRMs), but without training separate reward models. The experiments evaluate on OpenMathReasoning, measuring accuracy, answer length, and per-token advantages, demonstrating the effectiveness of structure-aligned critique in improving reasoning performance.

Key Results

  • On the OpenMathReasoning dataset, StepAlignFB achieves a Pass@12 of 90.00%, outperforming GRPO (76.67%) by 16.11 points and reference solution-conditioned self-distillation (86.67%) by 3.33 points. The Avg@12 metric shows a +5.27 point improvement over reference-conditioned self-distillation. The model also exhibits sharper probability concentration on correct answers, as evidenced by increased majority vote accuracy. Per-token advantage analysis reveals that feedback targeting errors enhances local error correction, leading to more robust reasoning traces. These results confirm that structural alignment in feedback significantly boosts self-distillation effectiveness.
  • The advantage stems from the feedback’s ability to localize signals at error points, reinforcing correct reasoning segments while sharply penalizing errors. Unlike diffuse solution-level feedback, step-aligned critique preserves correct steps and only modifies erroneous ones, mimicking process reward signals. This targeted approach results in higher accuracy and more stable reasoning paths, as shown by the improved metrics across multiple checkpoints. The experiments also demonstrate that feedback quality and structural alignment are as crucial as feedback content, emphasizing the importance of feedback design in self-distillation.
  • The findings indicate that feedback structure influences the model’s learning dynamics profoundly. Conditioning on reference solutions tends to diffuse signals across tokens, even at correct steps, diluting the correction effect. Conversely, step-aligned critique concentrates the learning signal at error locations, leading to more efficient and stable model improvements. This insight opens avenues for designing more effective feedback mechanisms, especially in complex reasoning tasks where error localization is critical. The approach reduces reliance on expensive reward models, making self-distillation more scalable and adaptable.

Significance

This work advances the understanding of feedback mechanisms in self-distillation, highlighting the importance of structural alignment for effective error correction. By demonstrating that localized, step-aligned critique significantly outperforms traditional reward-based or reference-based feedback, it offers a scalable, cost-effective strategy for enhancing reasoning capabilities in large language models. The approach bridges the gap between process supervision and dense token-level learning, providing a new paradigm for model fine-tuning without external reward models. Its implications extend to various domains requiring precise error localization, such as mathematical reasoning, scientific problem-solving, and complex logical inference. The findings suggest that future research should focus on designing feedback structures that mirror the reasoning process, potentially transforming how models learn from natural language feedback and improving their interpretability and robustness.

Technical Contribution

The paper introduces a novel feedback alignment mechanism based on structure-aware critique, which aligns feedback content with the model’s reasoning path. This approach leverages the concept of process reward modeling (PRM) without requiring training separate reward networks, reducing complexity and computational cost. The key technical innovation is the design of step-aligned critique, where the critic copies correct steps verbatim and corrects errors, ensuring feedback is localized and relevant. This mechanism enhances the per-token advantage signal, leading to more precise error correction and improved model performance. The method also incorporates a detailed advantage analysis, revealing how feedback structure influences signal localization and model learning dynamics. The framework is compatible with existing self-distillation setups, making it a practical enhancement for large-scale training pipelines.

Novelty

This research is the first to systematically compare feedback structures in self-distillation, emphasizing the importance of structural alignment. Unlike prior work that treats feedback as a fixed signal—either reference solutions or scalar rewards—this study introduces a step-by-step critique aligned with the reasoning trace. The innovation lies in leveraging in-context copying behaviors, akin to induction-head mechanisms, to selectively reinforce correct steps while correcting errors locally. This approach effectively localizes learning signals, leading to significant performance gains in complex reasoning tasks. It bridges the gap between process supervision and token-level learning, offering a new perspective on designing feedback for self-improvement without external reward models.

Limitations

  • The effectiveness of the proposed feedback mechanism heavily depends on the quality of the critic. Poorly trained or inaccurate critics may produce misleading feedback, reducing the benefits of structure alignment.
  • The current experiments focus mainly on mathematical reasoning tasks; applicability to broader NLP tasks like commonsense reasoning, dialogue, or creative generation remains to be validated.
  • Computational overhead associated with generating step-aligned critiques, especially in large models, could limit scalability. Optimizing feedback generation and processing efficiency is necessary for real-world deployment.

Future Work

Future research should explore multi-modal feedback integration, combining textual, visual, and auditory signals to enhance reasoning in diverse contexts. Developing more robust and scalable critic models, possibly via unsupervised or semi-supervised learning, will be crucial. Extending the structure-aligned critique framework to other NLP tasks, such as summarization or dialogue, can broaden its impact. Additionally, integrating reinforcement learning techniques to dynamically adapt feedback strategies based on model performance could further improve efficiency and generalization. Investigating the theoretical underpinnings of feedback alignment and its relation to human learning processes may also provide deeper insights into model interpretability and alignment.

AI Executive Summary

The rapid evolution of large language models (LLMs) has revolutionized natural language processing, yet their reasoning capabilities remain limited by training paradigms that often lack fine-grained, process-level supervision. Traditional reinforcement learning approaches, such as RLHF, rely on scalar rewards that only indicate success or failure, providing sparse feedback that hampers precise error localization. Knowledge distillation methods improve this by transferring knowledge from teacher models but still face challenges in aligning model reasoning paths with desired outcomes.

Self-distillation emerges as a promising alternative, enabling models to learn from their own generated feedback without external teachers. However, the design of this feedback critically influences the effectiveness of the process. Prior work has used fixed references or scalar rewards, but these approaches often diffuse the learning signal across the entire reasoning trace, diluting the impact of localized errors.

This paper introduces a novel feedback alignment mechanism based on structure-aware, step-by-step critique. The core idea is to align feedback content with the model’s reasoning path, copying correct steps verbatim and correcting errors in a targeted manner. This approach mirrors process reward models (PRMs) but eliminates the need for training separate reward networks, reducing complexity and cost.

Experiments conducted on the challenging OpenMathReasoning dataset demonstrate that this structure-aligned critique significantly outperforms traditional reward-based and reference-based feedback. The model trained with step-aligned feedback achieves a Pass@12 of 90.00%, surpassing the 76.67% of the reward-only baseline by over 16 points, and outperforms reference-conditioned self-distillation by 3.33 points. The improvements are attributed to the feedback’s ability to localize errors precisely, reinforcing correct reasoning segments while sharply penalizing mistakes.

Analyses reveal that the advantage signals generated by structure-aligned critique are highly localized, resembling process reward signals, which enhances the model’s capacity to correct errors without disrupting correct steps. This targeted feedback mechanism effectively guides the model toward more accurate and stable reasoning paths, demonstrating its potential to transform self-supervised training paradigms.

Despite these advances, challenges remain. The quality of feedback depends on the critic’s accuracy, and scalability to more complex or diverse tasks requires further validation. Future directions include integrating multi-modal feedback, optimizing computational efficiency, and extending the framework to broader NLP applications. Overall, this work provides a significant step toward more precise, efficient, and scalable self-supervised learning strategies for complex reasoning in large language models.

Deep Analysis

Background

The development of large language models (LLMs) such as GPT-3, BERT, and their successors has dramatically advanced NLP capabilities. Early models relied on supervised learning with large annotated datasets, but scaling up led to transformer-based architectures that excel in language understanding and generation. Knowledge distillation (Hinton et al., 2015) became a key technique for compressing large models into smaller, efficient ones while retaining performance. Recently, reinforcement learning from human feedback (RLHF) has been employed to align models with human preferences, improving tasks like summarization and dialogue.


Despite these advances, models still struggle with multi-step reasoning, especially in mathematical and logical domains. Errors tend to propagate along reasoning paths, and sparse reward signals limit fine-grained correction. Self-distillation, which leverages the model’s own outputs as feedback, offers a promising solution by enabling dense, token-level supervision. Prior work has explored various feedback forms, such as code execution traces, reference solutions, and user feedback, but often treats feedback as a fixed, global signal.


The gap remains in designing feedback that effectively localizes errors and guides the model toward correct reasoning paths. This paper addresses this by proposing a structure-aware, step-by-step critique mechanism that aligns feedback content with the model’s reasoning trace, inspired by process reward models (PRMs). This approach aims to improve the efficiency and effectiveness of self-distillation, especially in complex reasoning tasks.

Core Problem

Current self-distillation methods often rely on coarse feedback signals, such as reference solutions or scalar rewards, which diffuse the learning signal across the entire reasoning trace. This diffusion dilutes the impact of errors, making it difficult for models to localize and correct mistakes effectively. Consequently, models may overfit to superficial patterns or be misled by stylistic differences in reference solutions.


Furthermore, existing feedback mechanisms lack structural alignment with the model’s reasoning process, leading to inefficient credit assignment. When feedback does not mirror the reasoning path, the model receives ambiguous signals that hinder precise error correction. This problem is exacerbated in multi-step reasoning tasks, where errors at specific steps can cascade, reducing overall accuracy.


Addressing these issues requires designing feedback that is both dense and structurally aligned with the reasoning path, enabling targeted error correction and stable learning. The challenge lies in creating such feedback without incurring prohibitive computational costs or requiring extensive annotations.

Innovation

The paper’s main innovation is the introduction of a structure-aligned, step-by-step critique mechanism that localizes feedback to the reasoning path. Unlike traditional methods that rely on global reference solutions or scalar rewards, this approach copies correct steps verbatim and corrects errors selectively, ensuring feedback is directly aligned with the model’s reasoning trace.


This mechanism leverages in-context copying behaviors, similar to induction-head mechanisms in transformers, to reinforce correct prefixes and isolate errors. By doing so, it effectively mimics process reward models (PRMs) without needing to train separate reward networks, reducing complexity and cost.


Another innovation is the detailed advantage analysis, which demonstrates how feedback structure influences the localization of learning signals at error points. The method also introduces a systematic comparison of three feedback conditions—GRPO, RefSol, and StepAlignFB—highlighting the importance of structural alignment for effective self-distillation.


This work bridges the gap between process supervision and dense token-level learning, providing a scalable, efficient framework for improving reasoning in large language models.

Methodology

  • �� 设计solver-critic架构:训练可调solver生成逐步推理,批评者(critic)冻结,提供不同形式的反馈。
  • �� 三种反馈条件:GRPO(无反馈,二元奖励)、RefSol(参考解,强模型生成)、StepAlignFB(逐步对齐批评,模仿推理路径)
  • �� 训练目标:利用自蒸馏目标(Eq.3),在不同反馈条件下训练模型,优化推理路径。
  • �� 逐步对齐策略:批评者复制正确步骤,修正错误步骤,保持推理路径一致性,强化错误位置。
  • �� 反馈生成:在每个推理步骤中,批评者根据模型输出和参考信息,复制正确部分,修正错误部分,确保反馈内容与推理路径对齐。
  • �� 训练细节:采用Qwen3-1.7B模型,温度1.1,最大2048 tokens,G=1(自蒸馏)或8(GRPO),使用前向KL作为距离度量。
  • �� 评估指标:Pass@12、Maj@12、Avg@12,比较不同反馈条件的性能提升。
  • �� 关键技术:利用vLLM进行高效推理,设计批评者提示模板,确保逐步对齐的反馈质量。

Experiments

实验在OpenMathReasoning数据集上进行,筛选难度较高的问题(pass_rate_72b_tir > 0.85),确保批评者模型能提供有效反馈。模型训练包括:多轮采样(G=1或8),在三种反馈条件(GRPO、RefSol、StepAlignFB)下进行,训练7个epoch,保存每10步的checkpoint。评估指标包括Pass@12、Maj@12、Avg@12和平均答案长度,比较不同方法在不同训练阶段的表现。通过多次试验验证逐步对齐的优势,分析反馈结构对模型路径和性能的影响。还进行了ablation研究,验证verbatim复制、部分复制和完全复制对信号的影响。

Results

逐步对齐反馈在所有指标上均优于其他两种方式,尤其在Pass@12和Maj@12方面表现突出,提升幅度分别达16.11和5.27个百分点。模型在训练过程中,逐步对齐策略能有效定位错误,强化正确推理路径,减少错误传播。实验还显示,条件化在参考解上会导致模型在每个Token都试图调整行为,反而降低了性能。逐步对齐通过局部化信号,显著提升模型的推理准确率和路径稳定性。这些结果验证了结构对齐在自蒸馏中的关键作用。

Applications

该方法适用于需要高精度推理的数学、逻辑和推断任务,特别是在有限标注或昂贵奖励信号难以获得的场景。通过设计结构化反馈,可以在无需额外奖励模型的情况下,提升模型的推理能力,降低训练成本。未来还可结合多模态信息,拓展到视觉推理、跨模态理解等领域。此外,该机制也适合在教育、自动问答和科学计算等行业中应用,帮助模型更好地理解复杂推理过程。

Limitations & Outlook

该方法在高复杂度或多步骤推理任务中仍可能受到批评者反馈质量的限制,批评者的误导可能影响模型训练效果。逐步对齐策略依赖于高质量的批评者模型,若批评者性能不足,可能导致信号稀疏或偏差,影响模型最终性能。目前实验主要集中在数学推理任务,泛化到其他自然语言理解或生成任务仍需验证,存在一定局限。

Plain Language Accessible to non-experts

想象你在学习做一道复杂的菜。你有一本食谱(模型的推理路径),但不总是能做到完美。有时候,你会请一个厨艺高手(批评者)帮你指出哪里做错了,或者告诉你下一步该怎么改。最好的情况是,这个厨艺高手能逐步告诉你每个步骤哪里出错了,而不是只告诉你最终的结果。这样,你可以专注于改正错误的部分,保持正确的做法。这个过程就像论文中的逐步批评机制,它帮助模型在推理过程中找到错误,然后只修正那些出错的部分,而不是全盘否定。这种方法让学习变得更精准、更高效,就像你在厨房里逐步改进菜肴一样。

ELI14 Explained like you're 14

嘿,你知道学习做一道复杂的菜有多难吗?有时候,菜做不好不是因为你不会做,而是因为在某个步骤出了错。想象一下,如果你有一个超级厉害的厨师朋友,他可以逐步告诉你哪一步错了,哪一步做得对,还会帮你改正。这样,你就能更快学会怎么做出好菜。论文里的方法就像这个厨师朋友,他会在你做菜的每一步给你建议,只在你出错的地方提醒你,而不是每次都告诉你整份菜怎么做。这样,你学得更快,也能做出更棒的菜!

Glossary

Feedback Alignment (反馈对齐)

一种设计反馈信息的方法,使模型的学习信号与其推理路径对齐,从而提升学习效率。技术上通过结构化的逐步批评实现,确保反馈内容与推理错误位置对应。

论文中强调结构对齐在自蒸馏中的关键作用。

Self-Distillation (自蒸馏)

一种模型训练方法,模型既作为学生也作为教师,通过自身生成的反馈进行优化,无需外部教师。核心在于利用模型自身的输出作为学习信号。

论文采用自蒸馏机制提升推理能力。

Progressive Reasoning Model (PRM, 过程奖励模型)

一种通过逐步奖励错误位置的模型,强化正确推理路径,减少错误传播。在本研究中通过自然语言反馈实现,无需训练奖励网络。

用于分析反馈结构对模型路径的影响。

逐步对齐批评 (Step-Aligned Critique)

批评者根据模型推理路径逐步提供反馈,复制正确步骤,修正错误步骤,确保反馈与推理路径结构一致。

论文的核心创新策略。

OpenMathReasoning Dataset

一个用于数学推理任务的数据集,包含难度较高的问题,适合验证模型推理和自蒸馏效果。

实验中使用的数据集。

KL Divergence (Kullback-Leibler散度)

衡量两个概率分布差异的指标,广泛用于模型训练中的距离度量。在本文中用于自蒸馏目标的优化。

作为训练目标的距离度量。

vLLM

一种高效的多模型推理框架,支持大规模模型的快速推理和反馈生成。

用于逐步批评的推理流程。

Induction-Head Copying (归纳头复制)

Transformer中的机制,利用前文信息进行模式复制,支持在反馈中实现部分逐字复制,增强路径局部化。

解释逐步对齐中复制行为的机制。

on-policy training (在线策略训练)

模型在训练过程中使用自身生成的数据进行优化,避免偏离目标分布。本文采用此策略进行自蒸馏。

训练流程的核心策略。

Group Normalization (组归一化)

归一化技术,用于稳定训练中的奖励估计,确保不同样本间的奖励分布一致。

用于奖励归一化。

Open Questions Unanswered questions from this research

  • 1 虽然逐步对齐反馈在数学推理中表现优异,但其在自然语言理解、生成等其他任务中的效果尚未充分验证。如何设计跨任务的结构化反馈机制,提升模型的泛化能力,是未来的重要研究方向。
  • 2 批评者模型的质量直接影响反馈效果,当前的批评者多为预训练模型,如何提升批评者的准确性和鲁棒性,仍需探索。特别是在复杂、多模态场景下,反馈的多样性和一致性是挑战。
  • 3 反馈机制的计算成本较高,尤其是在大规模模型中,逐步生成和处理反馈需要大量计算资源。如何优化反馈生成流程,降低成本,提升效率,是未来的关键问题。
  • 4 目前实验主要集中在数学推理任务,其他领域如逻辑推理、常识推断等的适应性和效果还未充分研究。需要扩展验证范围,确保方法的广泛适用性。
  • 5 模型在多步骤推理中的错误定位能力仍有限,未来可以结合强化学习或模仿学习,进一步提升错误识别和修正的精度。

Applications

Immediate Applications

数学推理系统优化

利用逐步对齐反馈机制,提升数学题解模型的准确率,特别适合教育、自动解题等场景,减少对昂贵奖励模型的依赖。

自动问答与推理增强

在自动问答系统中引入结构化逐步反馈,改善模型对复杂问题的理解和推理能力,提升回答的准确性和逻辑性。

模型微调与知识迁移

通过自蒸馏结合逐步反馈,实现模型在特定任务上的高效微调,减少标注成本,增强模型的推理深度。

Long-term Vision

跨模态推理与理解

结合视觉、语音等多模态信息,设计多源结构化反馈机制,推动多模态大模型的推理能力突破。

自主学习与自我优化

实现模型在无需外部标注的情况下,通过内部反馈不断自我改进,迈向真正的自主智能系统。

Abstract

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.

cs.AI cs.LG

References (20)

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon et al.

2022 4470 citations View Analysis →

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su et al.

2023 937 citations View Analysis →

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, S. Gu, Machel Reid et al.

2022 7253 citations View Analysis →

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 9174 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 5778 citations View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 34824 citations

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda et al.

2022 889 citations View Analysis →

Let's Verify Step by Step

H. Lightman, Vineet Kosaraju, Yura Burda et al.

2023 3515 citations View Analysis →

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin et al.

2026 49 citations View Analysis →

LoRA: Low-Rank Adaptation of Large Language Models

J. Hu, Yelong Shen, Phillip Wallis et al.

2021 19779 citations View Analysis →

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou et al.

2023 465 citations View Analysis →

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin et al.

2025 133 citations View Analysis →

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Andreassen, Guy Gur-Ari et al.

2021 1040 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 59110 citations View Analysis →

Reinforcement Learning via Self-Distillation

Jonas Hubotter, Frederike Lubeck, L. Behric et al.

2026 130 citations View Analysis →

Expanding the Capabilities of Reinforcement Learning via Text Feedback

Yuda Song, Lili Chen, Fahim Tajwar et al.

2026 32 citations View Analysis →

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho et al.

2024 617 citations View Analysis →

Aligning Language Models from User Interactions

T. Buening, Jonas Hubotter, Barna P'asztor et al.

2026 11 citations View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 6436 citations View Analysis →

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

S. Ross, Geoffrey J. Gordon, J. Bagnell

2010 4077 citations View Analysis →