Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

TL;DR

This paper analyzes the sparsity and geometric structure of on-policy distillation (OPD), revealing small, coordinate-sparse updates that are spectrally concentrated and deviate from source principal directions.

cs.LG 🔴 Advanced 2026-06-12 95 views
Guo Yu Wenlin Liu Yulan Hu Hao-Xuan Ma Jun-Peng Jiang Han-Jia Ye
model distillation parameter sparsity geometric structure deep learning optimization model fine-tuning

Key Findings

Methodology

This study employs a comprehensive parameter space analysis by computing differences between source and fine-tuned models, using metrics such as Frobenius norm, coordinate sparsity, singular value decomposition (SVD), and source-space projections. The analysis covers multiple large-scale language and vision-language models, including Qwen, DeepScaleR, and MiniCPM, across various distillation strategies and optimizers (AdamW vs. SGD). The methodology involves: • Quantifying overall update magnitude via Frobenius norm; • Detecting coordinate sparsity with thresholding; • Analyzing spectral concentration through top singular values; • Assessing alignment with source principal components via source-space projections. These metrics collectively reveal the complex geometric and spectral properties of OPD updates.

Key Results

  • OPD-style parameter updates are extremely small in relative norm (e.g., Qwen3-1.7B only 0.045%), with high coordinate sparsity (66.72% to 89.50% of parameters show negligible change below 10^-5). Despite full-rank numerical matrices (median rank near 100%), their spectral energy is concentrated in the top few singular values (top 16 singular values account for roughly 27%), indicating spectral concentration. The updates tend to deviate away from the source model’s dominant singular directions, mainly focusing on coordinates where source weights are near zero. Overlap analysis shows that OPD subnetwork masks significantly intersect with RLVR and teacher-varied masks (e.g., 73.53% overlap in Qwen2.5-VL), suggesting preservation of geometric signatures. Experimental validation confirms that training only the discovered subnetwork nearly matches full OPD performance, and AdamW outperforms SGD in reasoning accuracy, emphasizing the importance of adaptive optimization.
  • Results demonstrate that, although parameter updates are numerically full-rank, their spectral and geometric structures resemble sparse, off-principal modifications. The layer-wise and module-wise analysis reveals that FFN modules dominate the update energy (65-86%), with attention mechanisms contributing notably in some models. The subnetwork masks overlap substantially with RLVR masks, indicating shared sparse structures across different post-training methods. These findings challenge the conventional low-rank assumption, highlighting the nuanced, structured nature of OPD updates.
  • Furthermore, the analysis shows that the sparse subnetwork identified by OPD is sufficient for effective fine-tuning, as restricting training to this subnetwork yields nearly the same reasoning accuracy as full fine-tuning. The optimizer comparison indicates that AdamW’s adaptive scaling remains beneficial, even in sparse, on-policy settings, contrary to some prior beliefs that adaptive optimizers are unnecessary for sparse updates.

Significance

This research uncovers the intricate geometric and spectral properties of parameter updates induced by on-policy distillation, challenging traditional views of model fine-tuning as a low-rank or dense process. By demonstrating that OPD induces sparse, spectrally concentrated, and off-principal updates, it provides new insights into how knowledge transfer and model adaptation occur in large-scale models. These findings have profound implications for designing parameter-efficient fine-tuning methods, model compression, and understanding the internal dynamics of large language models. The work bridges the gap between empirical observations of sparsity and the theoretical understanding of model geometry, offering a foundation for future research into structured model updates and efficient training paradigms.

Technical Contribution

This paper introduces a novel multi-metric framework combining norm-based, sparsity, spectral, and geometric analyses to characterize OPD parameter updates. It reveals that despite full numerical rank, updates are spectrally concentrated and biased away from source principal directions, mainly affecting coordinates where source weights are near zero. The study also demonstrates that sparse subnetwork training can nearly recover full model performance, highlighting the operational relevance of the identified sparse structures. Additionally, the comparison between AdamW and SGD optimizers underscores the importance of adaptive scaling in preserving gradient heterogeneity under dense teacher supervision. These contributions deepen the theoretical understanding of on-policy model updates and suggest new directions for parameter-efficient fine-tuning.

Novelty

This work is the first comprehensive analysis of the geometric and spectral properties of on-policy distillation (OPD) updates, revealing that they are small, sparse, spectrally concentrated, and biased away from source principal directions. Unlike prior low-rank assumptions, the findings show that OPD updates are full-rank but structured, with significant off-principal movement. The integration of multiple metrics—norm, sparsity, spectral concentration, and source-space alignment—provides a holistic view of the parameter dynamics, distinguishing OPD from traditional dense fine-tuning and offline distillation. This nuanced understanding advances the theoretical landscape of model adaptation, emphasizing the importance of geometric signatures in large-scale model training.

Limitations

  • The analysis primarily focuses on language and vision-language models, and its generalizability to other architectures such as generative adversarial networks or reinforcement learning agents remains to be validated. Future work should explore broader model types.
  • While the spectral and geometric metrics provide deep insights, the dynamic evolution of these properties during training phases is less explored, limiting understanding of how updates develop over time.
  • The comparison of optimizers is limited to AdamW and SGD; other adaptive optimizers like LAMB or AdaGrad could exhibit different behaviors, and their effects on the spectral and sparsity structures warrant further investigation.

Future Work

Future research will focus on developing parameter-efficient fine-tuning algorithms that leverage the identified sparse and structured updates, enabling scalable adaptation of large models. Exploring the relationship between spectral concentration and model robustness or generalization could lead to new regularization techniques. Extending the analysis to diverse model architectures and tasks, including reinforcement learning and generative models, will test the universality of these geometric signatures. Additionally, dynamic analysis during training could reveal how these structures evolve, informing better optimization strategies and theoretical models of large-scale neural network adaptation.

AI Executive Summary

In the rapidly evolving landscape of large-scale pre-trained models, fine-tuning remains a critical step for adapting models to specific tasks. Traditional approaches often assume dense, full-parameter updates, but recent empirical evidence suggests that parameter changes are often sparse and structured. This paper delves into the nature of parameter updates induced by on-policy distillation (OPD), a method that combines on-policy data sampling with dense teacher supervision, to understand its underlying geometric and spectral properties.

OPD has gained prominence as a post-training technique that aims to transfer knowledge efficiently while maintaining the model’s original data distribution. Unlike offline distillation, which relies on fixed datasets, or reinforcement learning with sparse rewards, OPD leverages dense token-level feedback from teachers on student-generated trajectories. This hybrid approach raises questions about how such dense supervision influences parameter dynamics, especially in terms of sparsity, geometric structure, and spectral concentration.

The core methodology involves analyzing parameter deltas between source and fine-tuned models across multiple large models, including Qwen and DeepScaleR. Metrics such as the relative Frobenius norm, coordinate sparsity, top singular value energy, and source-space alignment are employed. The findings reveal that OPD updates are extremely small in magnitude, highly sparse in coordinates, and spectrally concentrated in the top few singular values. Despite being numerically full-rank, these updates deviate significantly from the source model’s principal singular directions, mainly affecting coordinates where source weights are near zero. These structural insights challenge the conventional low-rank assumption of model fine-tuning.

Experimental validation shows that training only the sparse subnetwork identified by OPD can nearly match the full model’s reasoning performance, highlighting the operational relevance of the identified structures. Comparisons between AdamW and SGD optimizers indicate that adaptive scaling remains beneficial in OPD, contrary to some prior beliefs that sparse updates render such optimizers unnecessary. Layer-wise and module-wise analyses further demonstrate that FFN modules dominate the update energy, with attention mechanisms contributing notably in some models.

These discoveries have profound implications for the design of parameter-efficient training methods, model compression, and understanding the internal geometry of large models. They suggest that dense teacher supervision does not lead to dense parameter rewriting but preserves geometric signatures that can be exploited for efficient adaptation. The study opens avenues for future research into structured model updates, spectral regularization, and scalable fine-tuning strategies, ultimately advancing the theoretical and practical understanding of large-scale model training.

Deep Dive

Abstract

On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

cs.LG

References (20)

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Chongyu Fan, Gaowen Liu, Mingyi Hong et al.

2026 2 citations ⭐ Influential View Analysis →

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He, Zekai Qu, Zeyuan Liu et al.

2025 21 citations ⭐ Influential View Analysis →

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Adam Suma, Sam Dauncey

2025 3835 citations ⭐ Influential

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 5832 citations View Analysis →

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui et al.

2024 997 citations View Analysis →

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou et al.

2023 470 citations View Analysis →

LoRA: Low-Rank Adaptation of Large Language Models

J. Hu, Yelong Shen, Phillip Wallis et al.

2021 19871 citations View Analysis →

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song, Mao Zheng

2026 47 citations View Analysis →

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, Pulkit Agrawal

2025 117 citations View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 34956 citations

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu et al.

2026 68 citations View Analysis →

OpenThoughts: Data Recipes for Reasoning Models

E. Guha, Ryan Marten, Sedrick Scott Keh et al.

2025 172 citations View Analysis →

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu et al.

2025 2017 citations View Analysis →

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Samy Bengio, O. Vinyals, N. Jaitly et al.

2015 2350 citations View Analysis →

GLM-5: from Vibe Coding to Agentic Engineering

GLM-4.5 Team Aohan Zeng, Xin Lv, Zhenyu Hou et al.

2026 175 citations View Analysis →

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha et al.

2026 4 citations View Analysis →

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.

2024 1851 citations View Analysis →

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning

Pan Lu, Ran Gong, Shibiao Jiang et al.

2021 497 citations View Analysis →

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He et al.

2026 67 citations View Analysis →

On Predictability of Reinforcement Learning Dynamics for Large Language Models

Yuchen Cai, Ding Cao, Xin Xu et al.

2025 10 citations View Analysis →