Rethinking the Divergence Regularization in LLM RL

TL;DR

DRPO introduces smooth advantage-weighted quadratic regularization to improve stability and efficiency in LLM RL training, replacing hard masks with continuous gradient weights.

cs.LG 🔴 Advanced 2026-06-09 59 views
Jiarui Yao Xiangxin Zhou Penghui Qi Wee Sun Lee Liefeng Bo Tianyu Pang
Reinforcement Learning Large Language Models Trust Region Distribution Shift Algorithm Innovation

Key Findings

Methodology

This paper proposes Divergence Regularized Policy Optimization (DRPO), which builds upon DPPO's divergence-based trust region framework. The core innovation replaces DPPO's binary mask with a smooth, advantage-weighted quadratic regularizer on the absolute probability shift of sampled tokens. The regularizer is designed to preserve the trust-region geometry while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. The algorithm involves: • Reformulating the trust region constraint from a hard mask to a continuous quadratic penalty based on absolute probability shift; • Incorporating advantage weighting to adaptively scale the regularization strength per token; • Deriving the gradient update rule that multiplies the policy gradient by a smooth weight depending on the token's probability shift and advantage; • Analyzing the trust-region geometry to show that the regularizer maintains the same boundary as DPPO but with smoother, more stable gradients; • Comparing with SPO, emphasizing the benefits of absolute probability shift over ratio-based metrics, especially in long-tail vocabularies.

Key Results

  • In experiments on Qwen3-30B-A3B-Base, DRPO achieved the highest average accuracy on AIME2024 and AIME2025, with improvements of 3.2% over baseline methods like PPO and DPPO. Under FP8 low-precision settings, DRPO maintained stable training, converging faster and reaching higher final accuracy. Specifically, DRPO outperformed DPPO by 2.8% on AIME2024 and 3.2% on AIME2025, demonstrating robustness across different hardware precisions.
  • Across multiple model sizes (4B to 35B) and architectures, DRPO consistently showed superior stability, reducing gradient oscillations and training collapse risks. The average training time was shortened by 15%, and the final accuracy increased by 2-4% compared to ratio-based methods such as SPO and GRPO. Ablation studies confirmed that advantage-weighted regularization significantly contributed to these gains, especially in long-tail token distributions.
  • Ablation experiments removing advantage weighting or replacing the regularizer with ratio-based penalties resulted in decreased stability and performance. The bounded nature of the absolute probability shift regularizer effectively controlled gradient variance, especially for rare tokens with low behavior probabilities, leading to more robust training dynamics.

Significance

This work advances the theoretical understanding and practical implementation of trust-region control in large-scale LLM RL training. By replacing brittle hard masks with smooth, advantage-weighted regularizers based on absolute probability shifts, the method addresses the instability caused by long-tail vocabularies and distributional shifts. The approach enhances training stability, reduces gradient variance, and accelerates convergence, making it highly relevant for industrial-scale language model fine-tuning. The insights gained also inform future regularizer design, emphasizing the importance of smoothness and boundedness in high-dimensional, long-tailed settings. Overall, this research paves the way for more reliable, efficient, and scalable RL algorithms for LLMs, with broad implications for AI alignment, safety, and deployment.

Technical Contribution

The primary technical contribution is the formulation of a divergence-regularized policy objective that replaces the traditional ratio-based trust region with a continuous, advantage-weighted quadratic penalty on absolute probability shifts. This reformulation maintains the trust-region geometry of DPPO but introduces a smooth, bounded gradient weight that adapts dynamically based on the current policy divergence. The derivation of the gradient (Equation 9) demonstrates how the regularizer modulates the policy update, attenuating diverging updates and providing corrective signals outside the trust boundary. The analysis of trust-region geometry (Section 3.1) confirms that the regularizer enforces the same boundary as DPPO but with improved stability. The comparison with SPO (Section 3.2) highlights the advantage of using absolute probability shifts over ratio metrics, especially in long-tail distributions. The method's simplicity, combined with theoretical guarantees and empirical validation, marks a significant step forward in trust-region-based policy optimization for large language models.

Novelty

This study is the first to systematically replace the binary mask in DPPO with a smooth, advantage-weighted quadratic regularizer based on absolute probability shifts. Unlike prior ratio-based trust regions, which can be brittle and lead to unstable training, DRPO ensures bounded, continuous gradient weights that adapt smoothly to policy divergence. The key novelty lies in translating the trust-region constraint from a ratio-based metric to an absolute probability shift, preserving the geometric properties while enhancing stability. This approach effectively addresses the challenges posed by long-tail vocabularies, where traditional methods struggle with high variance and abrupt gradient changes. The integration of advantage weighting further refines the regularization, making it more responsive to policy improvements. Overall, DRPO offers a new paradigm for stable, scalable RL in large language models.

Limitations

  • While DRPO demonstrates superior stability and efficiency, its performance may still be sensitive to the choice of regularization threshold δ, especially in highly sparse or noisy reward environments. Adaptive thresholding strategies could further improve robustness.
  • The reliance on accurate advantage estimation means that errors in reward modeling or advantage computation can impact the effectiveness of the regularizer, potentially leading to suboptimal policy updates or instability.
  • Computational overhead introduced by the quadratic regularizer, particularly in very large models or high-frequency training scenarios, may pose practical challenges. Future work should focus on optimizing the efficiency of the regularization computation.

Future Work

Future research should explore adaptive schemes for setting the regularization threshold δ, possibly based on dynamic divergence estimates or model confidence. Extending the framework to multi-task and multi-modal settings could broaden its applicability. Investigating the integration with other regularization techniques, such as information-theoretic or geometric distances, may further enhance stability. Additionally, developing more efficient algorithms for large-scale distributed training, possibly leveraging approximation methods, will be crucial for industrial deployment. Finally, theoretical analysis of convergence guarantees under various distributional shifts and reward noise conditions remains an open avenue.

AI Executive Summary

The rapid development of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in understanding and generating human-like text. However, fine-tuning these models with reinforcement learning (RL) remains a complex challenge, especially when dealing with off-policy data and long-tailed vocabularies. Traditional methods like PPO rely on ratio clipping to control policy updates, but these approaches often struggle with instability and inefficiency in real-world scenarios. The core issue lies in the inadequacy of importance ratios as proxies for distributional shifts, particularly when rare tokens dominate the vocabulary landscape.

Recognizing this limitation, recent advances such as DPPO introduced divergence-based masks that measure the absolute probability shift of sampled tokens. While this approach improved alignment with the true distributional change, it still depended on a hard mask—once a token crossed the trust-region boundary in a harmful direction, its gradient was simply discarded. This abrupt cutoff hindered the model’s ability to correct or refine its policy, leading to potential instability.

Building upon these insights, the authors propose Divergence Regularized Policy Optimization (DRPO), a novel framework that replaces the binary mask with a smooth, advantage-weighted quadratic regularizer. This regularizer transforms the trust-region constraint into a continuous penalty based on the absolute probability shift, ensuring that gradients are attenuated smoothly as the policy approaches the boundary. When outside the trust region, the regularizer provides corrective signals, guiding the policy back toward stability.

Extensive experiments across multiple models, architectures, and precision settings demonstrate the effectiveness of DRPO. The method consistently outperforms traditional ratio-based approaches, achieving higher accuracy, faster convergence, and enhanced stability. Notably, in low-precision (FP8) training scenarios, DRPO maintains robustness where other methods falter, highlighting its practical significance.

The significance of this work extends beyond empirical gains. It offers a new perspective on regularizer design in RL for LLMs, emphasizing the importance of smooth, bounded gradient modulation that aligns with the true geometry of policy divergence. By addressing the pitfalls of ratio-based metrics and hard masks, DRPO paves the way for more reliable, scalable, and efficient training of large language models. This advancement holds promise for a wide range of applications, from conversational AI to automated content creation, and sets a foundation for future innovations in stable RL optimization.

Deep Analysis

Background

The evolution of large language models (LLMs) such as GPT, BERT, and T5 has significantly advanced NLP capabilities. Fine-tuning these models with reinforcement learning (RL), especially methods like RLHF (Reinforcement Learning from Human Feedback), has been pivotal in aligning model outputs with human preferences. Early RL algorithms like REINFORCE and policy gradient methods laid the groundwork, but their high variance limited practical deployment. Trust Region Policy Optimization (TRPO) introduced a principled way to control policy updates via KL divergence constraints, ensuring monotonic improvement. PPO simplified this with ratio clipping, making RL training more scalable and stable.


However, these methods face challenges in long-tail vocabularies common in natural language. The importance ratio, used as a proxy for distributional shift, becomes unreliable—small probability changes in rare words can cause large ratio swings, destabilizing training. DPPO addressed this by replacing ratio clipping with a divergence-based mask that measures the absolute probability shift, aligning better with the true geometric distance. Yet, its reliance on a hard mask leads to abrupt gradient discontinuities, limiting the model’s ability to correct policy deviations.


Recent research emphasizes the need for smooth, adaptive regularization strategies that can handle the complex, high-dimensional, long-tailed distributions of language data. This paper builds on these insights, proposing a regularization framework that maintains trust-region geometry while providing continuous, advantage-weighted gradient modulation, promising more stable and efficient training.

Core Problem

The core problem in training large language models with RL is controlling policy divergence to ensure stability and efficiency. Traditional ratio-based methods like PPO rely on clipping importance ratios, which become unreliable in long-tail vocabularies—rare tokens can produce exaggerated ratios, leading to unstable updates. Hard masks in methods like DPPO further exacerbate this by abruptly zeroing gradients once the boundary is crossed, preventing corrective feedback. This results in training oscillations, slow convergence, and potential collapse, especially in low-precision or large-scale scenarios. Addressing this requires a trust-region control mechanism that is both geometrically faithful and numerically stable across the entire vocabulary distribution.

Innovation

This work introduces a novel divergence-regularized policy optimization (DRPO) framework that replaces the binary mask in DPPO with a smooth, advantage-weighted quadratic regularizer based on absolute probability shifts. Unlike ratio-based metrics, the absolute shift directly measures true distributional divergence, especially in the long tail, and remains bounded. The regularizer modulates the policy gradient continuously, attenuating diverging updates and providing corrective signals outside the trust boundary. This approach preserves the trust-region geometry of DPPO while avoiding the gradient discontinuities of hard masks. Additionally, advantage weighting ensures the regularizer adapts to the policy's improvement direction, further stabilizing training. The method's simplicity and theoretical grounding make it a significant advancement over existing ratio-based and hard-mask strategies.

Methodology

  • �� Reformulate the trust-region constraint from a ratio-based to an absolute probability shift, using the Binary-TV proxy as the basis.
  • �� Define a quadratic regularizer scaled by the behavior policy probability, which penalizes the absolute probability shift of each token.
  • �� Incorporate advantage weighting into the regularizer to prioritize tokens with higher potential for policy improvement.
  • �� Derive the gradient update rule (Equation 9), where each token’s policy gradient contribution is multiplied by a continuous weight depending on the shift and advantage.
  • �� Analyze trust-region geometry (Section 3.1), demonstrating that the regularizer enforces the same boundary as DPPO but with smooth, bounded weights.
  • �� Compare with SPO (Section 3.2), highlighting the benefits of using absolute probability shifts over ratio metrics, especially in long-tail distributions.
  • �� Validate the approach through extensive experiments on multiple models and datasets, including ablation studies to isolate the effects of advantage weighting and regularization parameters.

Experiments

  • �� Conducted on models Qwen3-4B, Qwen3-30B-A3B-Base, and Qwen3.5-35B-A3B-Base, using a filtered dataset of 13,000 math problems with rule-based verification, and a sanity test dataset of 1,460 solvable questions.
  • �� Training employed the VeRL framework with BF16 precision, with additional low-precision (FP8) settings to test robustness.
  • �� Compared methods include unregularized surrogate, PPO, SPO, DPPO, GRPO, and the proposed DRPO, across various hyperparameters (e.g., δ=0.15, regularization thresholds).
  • �� Evaluation metrics focused on accuracy on AIME2024 and AIME2025, with multiple responses sampled per problem to assess average performance.
  • �� Ablation studies examined the impact of advantage weighting, regularization strength, and divergence metrics, ensuring comprehensive validation.

Results

  • �� DRPO consistently outperformed baseline methods across all settings, achieving the highest average accuracy—improving by approximately 3% over DPPO and ratio-based methods in the main tasks.
  • �� In low-precision FP8 training, DRPO maintained stable convergence, whereas ratio-based methods often collapsed or exhibited high variance.
  • �� The ablation results confirmed that advantage weighting and the bounded quadratic regularizer are critical for stability and performance, especially in long-tail vocabulary scenarios.
  • �� Quantitative analysis showed that DRPO reduces gradient variance by 20-30%, leading to faster convergence and more reliable policy updates.

Applications

  • �� The proposed DRPO framework can be directly applied to large-scale language model fine-tuning, especially in scenarios requiring robust alignment with human preferences and safety constraints.
  • �� It is suitable for tasks like dialogue systems, content moderation, and automated reasoning, where distributional shifts are common.
  • �� The method can also be integrated into multi-task learning setups, improving training stability across diverse objectives.
  • �� In the long term, DRPO's principles could inform the design of more adaptive, scalable RL algorithms for AI systems operating in dynamic, real-world environments.

Limitations & Outlook

  • �� Although DRPO improves stability, its effectiveness depends on the accurate estimation of advantage functions; noisy or biased reward models may impair performance.
  • �� The additional regularization introduces computational overhead, which could be significant in extremely large models or high-frequency training regimes.
  • �� The choice of regularization threshold δ remains a hyperparameter that requires tuning, potentially limiting out-of-the-box applicability in diverse tasks.
  • �� Future work should focus on adaptive thresholding, efficient implementation, and extending the framework to multi-modal and multi-task settings.

Plain Language Accessible to non-experts

想象你在管理一个大型工厂,工厂每天都在生产各种不同的产品。有些产品很常见,生产它们很容易,但有些稀有的产品很难生产,成本也很高。为了让工厂运作得更顺畅,你需要制定一些规则,确保每个产品都能在合理的范围内变化。传统的方法就像用硬性的门槛,只允许产品生产在一定范围内,一旦超出就停止调整,但这样会导致工厂突然停工或调整不及时。

现在,科学家们提出了一种新办法,就像在工厂里安装了智能调节器,它可以根据每个产品的情况,平滑地调整生产量。当某个稀有产品的生产偏离目标时,调节器会逐渐减弱或增强调整力度,而不是突然就停下来。这就像用一个柔软的弹簧连接调节器和生产线,让它们之间的关系变得更灵活、更平滑。

这种方法让工厂的调整变得更稳定,不会突然出现大起大落。它还能根据每个产品的特殊情况,自动调节调整力度,确保工厂整体运行得更顺畅、更高效。科学家们用这个比喻告诉我们,改进的调节机制就像给工厂装上了聪明的“软弹簧”,让整个生产过程更稳健、更智能。

ELI14 Explained like you're 14

想象你在学校里管理一个班级,有时候学生们会表现得很好,有时候又会出现一些问题。以前,你可能会设一个严格的规则,比如只允许学生在一定范围内表现,否则就要惩罚他们。但这样有时候太死板,学生一偏离目标就会被惩罚,班级气氛也变得紧张。

现在,老师们发明了一种新办法,就像给每个学生配备了一个智能调节器。这个调节器会根据学生的表现,慢慢调整他们的奖励或惩罚,而不是突然就停下来或变得很严厉。比如,如果学生稍微偏离了目标,调节器会轻轻地减少奖励;如果他们表现得更好了,奖励会逐渐增加。

这样一来,班级里的气氛就变得更轻松,学生们也更愿意尝试。这个调节器就像一个会变软的弹簧,可以让老师更灵活地管理学生,让大家都能在一个舒服的环境中学习和成长。科学家们用这个比喻告诉我们,改进的调节机制让整个学习过程变得更平滑、更有效率。

Abstract

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

cs.LG