Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
This paper analyzes the sparsity and geometric structure of on-policy distillation (OPD), revealing small, coordinate-sparse updates that are spectrally concentrated and deviate from source principal directions.
Key Findings
Methodology
This study employs a comprehensive parameter space analysis by computing differences between source and fine-tuned models, using metrics such as Frobenius norm, coordinate sparsity, singular value decomposition (SVD), and source-space projections. The analysis covers multiple large-scale language and vision-language models, including Qwen, DeepScaleR, and MiniCPM, across various distillation strategies and optimizers (AdamW vs. SGD). The methodology involves: • Quantifying overall update magnitude via Frobenius norm; • Detecting coordinate sparsity with thresholding; • Analyzing spectral concentration through top singular values; • Assessing alignment with source principal components via source-space projections. These metrics collectively reveal the complex geometric and spectral properties of OPD updates.
Key Results
- OPD-style parameter updates are extremely small in relative norm (e.g., Qwen3-1.7B only 0.045%), with high coordinate sparsity (66.72% to 89.50% of parameters show negligible change below 10^-5). Despite full-rank numerical matrices (median rank near 100%), their spectral energy is concentrated in the top few singular values (top 16 singular values account for roughly 27%), indicating spectral concentration. The updates tend to deviate away from the source model’s dominant singular directions, mainly focusing on coordinates where source weights are near zero. Overlap analysis shows that OPD subnetwork masks significantly intersect with RLVR and teacher-varied masks (e.g., 73.53% overlap in Qwen2.5-VL), suggesting preservation of geometric signatures. Experimental validation confirms that training only the discovered subnetwork nearly matches full OPD performance, and AdamW outperforms SGD in reasoning accuracy, emphasizing the importance of adaptive optimization.
- Results demonstrate that, although parameter updates are numerically full-rank, their spectral and geometric structures resemble sparse, off-principal modifications. The layer-wise and module-wise analysis reveals that FFN modules dominate the update energy (65-86%), with attention mechanisms contributing notably in some models. The subnetwork masks overlap substantially with RLVR masks, indicating shared sparse structures across different post-training methods. These findings challenge the conventional low-rank assumption, highlighting the nuanced, structured nature of OPD updates.
- Furthermore, the analysis shows that the sparse subnetwork identified by OPD is sufficient for effective fine-tuning, as restricting training to this subnetwork yields nearly the same reasoning accuracy as full fine-tuning. The optimizer comparison indicates that AdamW’s adaptive scaling remains beneficial, even in sparse, on-policy settings, contrary to some prior beliefs that adaptive optimizers are unnecessary for sparse updates.
Significance
This research uncovers the intricate geometric and spectral properties of parameter updates induced by on-policy distillation, challenging traditional views of model fine-tuning as a low-rank or dense process. By demonstrating that OPD induces sparse, spectrally concentrated, and off-principal updates, it provides new insights into how knowledge transfer and model adaptation occur in large-scale models. These findings have profound implications for designing parameter-efficient fine-tuning methods, model compression, and understanding the internal dynamics of large language models. The work bridges the gap between empirical observations of sparsity and the theoretical understanding of model geometry, offering a foundation for future research into structured model updates and efficient training paradigms.
Technical Contribution
This paper introduces a novel multi-metric framework combining norm-based, sparsity, spectral, and geometric analyses to characterize OPD parameter updates. It reveals that despite full numerical rank, updates are spectrally concentrated and biased away from source principal directions, mainly affecting coordinates where source weights are near zero. The study also demonstrates that sparse subnetwork training can nearly recover full model performance, highlighting the operational relevance of the identified sparse structures. Additionally, the comparison between AdamW and SGD optimizers underscores the importance of adaptive scaling in preserving gradient heterogeneity under dense teacher supervision. These contributions deepen the theoretical understanding of on-policy model updates and suggest new directions for parameter-efficient fine-tuning.
Novelty
This work is the first comprehensive analysis of the geometric and spectral properties of on-policy distillation (OPD) updates, revealing that they are small, sparse, spectrally concentrated, and biased away from source principal directions. Unlike prior low-rank assumptions, the findings show that OPD updates are full-rank but structured, with significant off-principal movement. The integration of multiple metrics—norm, sparsity, spectral concentration, and source-space alignment—provides a holistic view of the parameter dynamics, distinguishing OPD from traditional dense fine-tuning and offline distillation. This nuanced understanding advances the theoretical landscape of model adaptation, emphasizing the importance of geometric signatures in large-scale model training.
Limitations
- The analysis primarily focuses on language and vision-language models, and its generalizability to other architectures such as generative adversarial networks or reinforcement learning agents remains to be validated. Future work should explore broader model types.
- While the spectral and geometric metrics provide deep insights, the dynamic evolution of these properties during training phases is less explored, limiting understanding of how updates develop over time.
- The comparison of optimizers is limited to AdamW and SGD; other adaptive optimizers like LAMB or AdaGrad could exhibit different behaviors, and their effects on the spectral and sparsity structures warrant further investigation.
Future Work
Future research will focus on developing parameter-efficient fine-tuning algorithms that leverage the identified sparse and structured updates, enabling scalable adaptation of large models. Exploring the relationship between spectral concentration and model robustness or generalization could lead to new regularization techniques. Extending the analysis to diverse model architectures and tasks, including reinforcement learning and generative models, will test the universality of these geometric signatures. Additionally, dynamic analysis during training could reveal how these structures evolve, informing better optimization strategies and theoretical models of large-scale neural network adaptation.
AI Executive Summary
In the rapidly evolving landscape of large-scale pre-trained models, fine-tuning remains a critical step for adapting models to specific tasks. Traditional approaches often assume dense, full-parameter updates, but recent empirical evidence suggests that parameter changes are often sparse and structured. This paper delves into the nature of parameter updates induced by on-policy distillation (OPD), a method that combines on-policy data sampling with dense teacher supervision, to understand its underlying geometric and spectral properties.
OPD has gained prominence as a post-training technique that aims to transfer knowledge efficiently while maintaining the model’s original data distribution. Unlike offline distillation, which relies on fixed datasets, or reinforcement learning with sparse rewards, OPD leverages dense token-level feedback from teachers on student-generated trajectories. This hybrid approach raises questions about how such dense supervision influences parameter dynamics, especially in terms of sparsity, geometric structure, and spectral concentration.
The core methodology involves analyzing parameter deltas between source and fine-tuned models across multiple large models, including Qwen and DeepScaleR. Metrics such as the relative Frobenius norm, coordinate sparsity, top singular value energy, and source-space alignment are employed. The findings reveal that OPD updates are extremely small in magnitude, highly sparse in coordinates, and spectrally concentrated in the top few singular values. Despite being numerically full-rank, these updates deviate significantly from the source model’s principal singular directions, mainly affecting coordinates where source weights are near zero. These structural insights challenge the conventional low-rank assumption of model fine-tuning.
Experimental validation shows that training only the sparse subnetwork identified by OPD can nearly match the full model’s reasoning performance, highlighting the operational relevance of the identified structures. Comparisons between AdamW and SGD optimizers indicate that adaptive scaling remains beneficial in OPD, contrary to some prior beliefs that sparse updates render such optimizers unnecessary. Layer-wise and module-wise analyses further demonstrate that FFN modules dominate the update energy, with attention mechanisms contributing notably in some models.
These discoveries have profound implications for the design of parameter-efficient training methods, model compression, and understanding the internal geometry of large models. They suggest that dense teacher supervision does not lead to dense parameter rewriting but preserves geometric signatures that can be exploited for efficient adaptation. The study opens avenues for future research into structured model updates, spectral regularization, and scalable fine-tuning strategies, ultimately advancing the theoretical and practical understanding of large-scale model training.
Deep Dive
Abstract
On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.
References (20)
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Chongyu Fan, Gaowen Liu, Mingyi Hong et al.
JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
Bingxiang He, Zekai Qu, Zeyuan Liu et al.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Adam Suma, Sam Dauncey
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang, Beichen Zhang, Binyuan Hui et al.
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou et al.
LoRA: Low-Rank Adaptation of Large Language Models
J. Hu, Yelong Shen, Phillip Wallis et al.
A Survey of On-Policy Distillation for Large Language Models
Mingyang Song, Mao Zheng
RL's Razor: Why Online Reinforcement Learning Forgets Less
Idan Shenfeld, Jyothish Pari, Pulkit Agrawal
Decoupled Weight Decay Regularization
I. Loshchilov, F. Hutter
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu et al.
OpenThoughts: Data Recipes for Reasoning Models
E. Guha, Ryan Marten, Sedrick Scott Keh et al.
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu et al.
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Samy Bengio, O. Vinyals, N. Jaitly et al.
GLM-5: from Vibe Coding to Agentic Engineering
GLM-4.5 Team Aohan Zeng, Xin Lv, Zhenyu Hou et al.
Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs
Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha et al.
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.
Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning
Pan Lu, Ran Gong, Shibiao Jiang et al.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Yaxuan Li, Yuxin Zuo, Bingxiang He et al.
On Predictability of Reinforcement Learning Dynamics for Large Language Models
Yuchen Cai, Ding Cao, Xin Xu et al.