Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

TL;DR

Proposes a sparse-to-dense reward principle combining GRPO and OPD to enhance language model post-training.

cs.LG 🔴 Advanced 2026-05-13 206 views
Yuanda Xu Hejian Sang Zhengze Zhou Ran He Zhipeng Wang Alborz Geramifard
language model sparse reward dense reward post-training model distillation

Key Findings

Methodology

This study introduces a novel sparse-to-dense reward allocation principle that combines the strengths of GRPO and OPD. The approach involves applying sparse rewards to the teacher model to discover reward-shaped behavior, which is then transferred to the student model through dense supervision. Specifically, a two-stage bridging method involving forward-KL warmup and OPD is employed to achieve optimal model training.

Key Results

  • On the Qwen3-1.7B model, an RL-improved 8B teacher model distilled through the dense bridge outperforms direct GRPO on the MATH dataset (79.3% vs. 75.9%), scoring 25.2 on AIME 2024 compared to 19.8 with direct GRPO.
  • On the Llama model, RL-improved 70B teacher model transfer outperforms direct GRPO on the 8B student model (62.1% vs. 59.8%).
  • Applying sparse rewards on the student model after the bridge lifts MATH performance from 75.4% to 78.5%, outperforming a matched replay control by 2.8 points.

Significance

This research significantly improves data utilization in language model post-training by introducing a sparse-to-dense reward principle. By applying sparse rewards to the teacher model to discover reward-shaped behavior and then transferring it through dense supervision to the student model, the method achieves enhanced training efficiency. This approach holds substantial significance in academia and provides a more efficient training strategy for the industry.

Technical Contribution

Technical contributions include proposing a new reward density allocation principle that combines the advantages of sparse rewards and dense supervision. The study employs a two-stage bridging method involving forward-KL warmup and OPD to achieve more efficient model training. Additionally, the method's effectiveness is validated across different models and datasets, demonstrating its broad applicability.

Novelty

This study is the first to propose a sparse-to-dense reward allocation principle that combines the strengths of GRPO and OPD. Compared to existing methods, this approach utilizes sparse rewards on the teacher model to discover reward-shaped behavior, which is then transferred through dense supervision to the student model, achieving more efficient data utilization and model training.

Limitations

  • The method has been validated on relatively small model scales (1.7B and 8B students, with teachers up to 14B and 70B), and its performance on larger scales remains unverified.
  • The bridging method requires a shared tokenizer between the teacher and student, which may limit its application across different models.
  • The method's performance on open-ended tasks and instruction-following tasks has not been verified.

Future Work

Future work could include validating the method's effectiveness on larger model scales and its application in open-ended and instruction-following tasks. Additionally, exploring different reward density allocation strategies could further enhance model training efficiency.

AI Executive Summary

In the post-training of language models, effectively utilizing limited labeled data has always been a challenge. Traditional methods often apply sparse rewards directly to the deployment model (e.g., GRPO), but this approach is not efficient in data utilization. This paper proposes a novel sparse-to-dense reward principle that combines the strengths of GRPO and OPD by applying sparse rewards to the teacher model to discover reward-shaped behavior, which is then transferred to the student model through dense supervision.

Specifically, the paper employs a two-stage bridging method involving forward-KL warmup and OPD to achieve optimal model training. Experiments on Qwen3 and Llama models demonstrate that an RL-improved teacher model distilled through the dense bridge outperforms direct GRPO on the MATH dataset. This result indicates that the sparse-to-dense reward allocation principle can significantly enhance model training efficiency.

The core technical principle of this method is to discover reward-shaped behavior on the teacher model using sparse rewards and then transfer this behavior to the student model through dense supervision. This process includes a two-stage bridging method with forward-KL warmup and OPD to ensure that the student model can effectively learn the reward-shaped behavior.

Experimental results show that on the Qwen3-1.7B model, an RL-improved 8B teacher model distilled through the dense bridge outperforms direct GRPO on the MATH dataset (79.3% vs. 75.9%), scoring 25.2 on AIME 2024 compared to 19.8 with direct GRPO. This result indicates that the sparse-to-dense reward allocation principle can significantly enhance model training efficiency.

This research holds substantial significance in academia and provides a more efficient training strategy for the industry. However, the method's performance on larger model scales remains unverified, and future work could include validating the method's effectiveness on larger model scales and its application in open-ended and instruction-following tasks.

Deep Analysis

Background

In recent years, post-training of language models has become a crucial step in enhancing model performance. Traditional methods such as GRPO and OPD have limitations in data utilization. GRPO uses sparse reward signals to guide model learning but is inefficient in data utilization. OPD uses dense teacher supervision to compress behavior but lacks exploration. To overcome these limitations, this paper proposes a novel sparse-to-dense reward allocation principle that combines the strengths of GRPO and OPD.

Core Problem

The core problem in language model post-training is how to effectively utilize limited labeled data. Traditional methods often apply sparse rewards directly to the deployment model, but this approach is not efficient in data utilization. Sparse reward signals, while unbiased, are only useful when the policy is already good enough to learn from them. Dense teacher rewards provide signals at every token but are biased towards the teacher model.

Innovation

This paper proposes a novel sparse-to-dense reward allocation principle that combines the strengths of GRPO and OPD. By applying sparse rewards to the teacher model to discover reward-shaped behavior and then transferring it through dense supervision to the student model, the method achieves enhanced data utilization efficiency. Specifically, a two-stage bridging method involving forward-KL warmup and OPD is employed to achieve optimal model training. This approach not only improves data utilization efficiency but also validates its effectiveness across different models and datasets.

Methodology

  • �� Apply sparse rewards to the teacher model to discover reward-shaped behavior.
  • �� Use forward-KL warmup to adjust the student's support.
  • �� Employ OPD for dense supervision on the student model.
  • �� Apply sparse rewards on the student model to further enhance performance.
  • �� Conduct experiments on Qwen3 and Llama models to validate the method.

Experiments

The experimental design includes validation on Qwen3 and Llama models. The dataset used is DAPO-Math-17K. The experiments compare the performance of direct GRPO, the sparse-to-dense reward allocation principle, and different teacher models. Key hyperparameters include model scale, reward signal density, and bridging method steps.

Results

Experimental results show that an RL-improved teacher model distilled through the dense bridge outperforms direct GRPO on the MATH dataset. On the Qwen3-1.7B model, the 8B teacher model achieves 79.3%, while direct GRPO only achieves 75.9%. On the Llama model, RL-improved 70B teacher model transfer outperforms direct GRPO on the 8B student model (62.1% vs. 59.8%).

Applications

This method can be directly applied to language model post-training scenarios requiring efficient utilization of limited labeled data. It is suitable for industrial applications needing improved model performance under limited computational resources, such as intelligent assistants and dialogue systems.

Limitations & Outlook

The method has been validated on relatively small model scales, and its performance on larger scales remains unverified. Additionally, the bridging method requires a shared tokenizer between the teacher and student, which may limit its application across different models. Future work could include validating the method's effectiveness on larger model scales and its application in open-ended and instruction-following tasks.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditionally, you have a head chef (the teacher model) who tells you what to do at every step (dense reward). However, sometimes the head chef doesn't know every step and can only give you a general direction (sparse reward). It's like you're making a new dish, and the chef can only tell you what the final taste should be, not the specific steps.

Now, suppose you have an assistant (the student model) who needs to learn this dish. You can let the head chef first try making the dish, record each step, and then have the assistant learn these steps. This is the sparse-to-dense reward principle: let the head chef explore with sparse rewards to find a general direction, then transfer this direction to the assistant through dense rewards.

This way, the assistant can learn the dish faster because they know not only what the final taste should be but also what to do at each step. This method not only improves learning efficiency but also achieves better results with limited time and resources.

So, the core of this method is how to effectively use limited information to help the assistant learn to make delicious dishes in the shortest time possible.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game where you have to solve puzzles to get rewards. Usually, you might get hints like 'find the treasure behind that tree.' That's like giving our AI models sparse rewards: they only get rewards when they do something right.

But sometimes we want to give the AI more guidance, like having an NPC in the game telling you what to do at every step. That's dense rewards: feedback at every step, letting the AI know if it's doing well or not.

Now, scientists have come up with a smart way to combine these two types of rewards. First, they let the AI explore with sparse rewards to find useful clues. Then, they use dense rewards to help the AI understand these clues better.

This way, the AI can learn how to solve puzzles faster, just like you would in a game with both big-picture hints and step-by-step guidance. This method makes the AI smarter and helps it complete tasks more quickly!

Glossary

GRPO (Generalized Policy Optimization)

An algorithm used in reinforcement learning aimed at optimizing policies through sparse reward signals. It is typically used for tasks requiring long-term exploration.

Used in this paper for applying sparse rewards directly on the student model.

OPD (On-Policy Distillation)

A method for model distillation using dense reward signals, typically used to compress the behavior of a large model into a smaller one.

Used in this paper to transfer teacher model behavior to the student model through dense supervision.

Sparse Reward

A type of reward signal that provides feedback only after completing a sequence task. Suitable for tasks requiring long-term exploration.

Used to discover reward-shaped behavior on the teacher model.

Dense Reward

A type of reward signal that provides feedback at every step. Suitable for tasks requiring fine control.

Used for dense supervision on the student model.

Forward-KL Warmup

An initialization step performed on the teacher model to adjust the student's support to fit dense supervision.

Used as the first stage of the bridging process in this paper.

Two-Stage Bridge

A method combining forward-KL warmup and OPD to transfer reward-shaped behavior between teacher and student models.

Used in this paper to achieve optimal model training.

Qwen3 Model

A language model used to validate the method proposed in this paper, available in different scales (1.7B, 8B, 14B).

Used in experiments to test the effectiveness of the sparse-to-dense reward principle.

Llama Model

Another language model used to validate the method proposed in this paper, available in different scales (8B, 70B).

Used in experiments to test the teacher quality ordering.

DAPO-Math-17K

A dataset containing 17,000 verifiable math problems used to validate the method proposed in this paper.

Used in experiments to test the performance of different models and methods.

AIME

American Invitational Mathematics Examination, used to test model performance on math problems.

Used in experiments to evaluate the mathematical capabilities of different models.

Open Questions Unanswered questions from this research

  • 1 How can the effectiveness of the sparse-to-dense reward principle be validated on larger model scales? Current experiments are conducted on relatively small scales, and its performance on larger scales remains unverified.
  • 2 How does the method perform on open-ended and instruction-following tasks? Current experiments focus mainly on math problems, and its performance on other types of tasks has not been verified.
  • 3 How can a shared tokenizer be implemented across different models to enable broader bridging applications? The current method requires a shared tokenizer between the teacher and student, which may limit its application across different models.
  • 4 Are there other more efficient reward density allocation strategies? The current method is primarily based on the sparse-to-dense reward allocation principle, and other potential strategies could be explored in the future.
  • 5 How can computational costs be reduced without affecting model performance? The current method may have high computational costs, and more efficient computational strategies could be explored in the future.

Applications

Immediate Applications

Intelligent Assistants

Improve learning efficiency and performance of intelligent assistants under limited data using the sparse-to-dense reward principle, providing more accurate and personalized services.

Dialogue Systems

Apply the method in dialogue systems to enhance response capabilities and accuracy in complex dialogue scenarios, improving user experience.

Educational Technology

Apply the method in educational technology to improve AI performance in personalized learning and automated assessment, providing more effective learning support for students.

Long-term Vision

Cross-Domain AI Applications

Achieve widespread AI applications across different domains through the promotion of this method, enhancing intelligence and efficiency across industries.

General Artificial Intelligence

Promote the development of general artificial intelligence by continuously optimizing reward density allocation strategies, achieving higher levels of intelligence and automation.

Abstract

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from $75.4\%$ to $78.5\%$ after the bridge and outperforms a matched replay control by $2.8$ points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

cs.LG cs.AI

References (20)

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou et al.

2023 330 citations ⭐ Influential View Analysis →

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI, Daya Guo, Dejian Yang et al.

2025 5420 citations ⭐ Influential View Analysis →

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter et al.

2026 55 citations ⭐ Influential View Analysis →

TIP: Token Importance in On-Policy Distillation

Yuan Xu, Hejian Sang, Zhengze Zhou et al.

2026 5 citations View Analysis →

Black-Box On-Policy Distillation of Large Language Models

Tianzhu Ye, Li Dong, Zewen Chi et al.

2025 18 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 4851 citations View Analysis →

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu et al.

2025 1709 citations View Analysis →

Learn Hard Problems During RL with Reference Guided Fine-tuning

Yangzhen Wu, Shanda Li, Zixin Wen et al.

2026 1 citations View Analysis →

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.

2024 1584 citations View Analysis →

Specializing Smaller Language Models towards Multi-Step Reasoning

Yao Fu, Hao-Chun Peng, Litu Ou et al.

2023 348 citations View Analysis →

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal et al.

2017 27274 citations View Analysis →

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Yi He, Simran Kaur, Adithya Bhaskar et al.

2026 5 citations View Analysis →

ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

Kun Liang, Clive Bai, Xin Xu et al.

2026 3 citations View Analysis →

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh et al.

2023 875 citations View Analysis →

Beyond Correctness: Learning Robust Reasoning via Transfer

Hyunseok Lee, Soheil Abbasloo, Jihoon Tack et al.

2026 1 citations View Analysis →

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui et al.

2024 897 citations View Analysis →

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Jiaqi Wang, Wenhao Zhang, Weijie Shi et al.

2026 3 citations View Analysis →

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu et al.

2026 29 citations View Analysis →

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Jiaheng Liu, Chenchen Zhang, Jinyang Guo et al.

2024 39 citations View Analysis →

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

Yuan Xu, Hejian Sang, Zhengze Zhou et al.

2026 4 citations View Analysis →