STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

TL;DR

Proposes STARE, a surprisal-guided advantage reweighting method, stabilizing policy entropy and improving accuracy by 4%-8% on models from 1.5B to 32B.

cs.LG 🔴 Advanced 2026-06-18 37 views
Haipeng Luo Qingfeng Sun Songli Wu Can Xu Wenfeng Deng Han Hu Yansong Tang
Reinforcement Learning Policy Entropy Advantage Reweighting Language Models Training Stability

Key Findings

Methodology

This paper conducts a first-order gradient analysis of token-level entropy dynamics within the GRPO framework, revealing a four-quadrant structure based on advantage and surprisal. It identifies a near-criticality property where small weight perturbations can flip the entropy trend. Leveraging this insight, the authors develop STARE, which employs batch-internal surprisal quantiles to identify entropy-critical tokens, selectively reweights their advantages, and incorporates a target-entropy closed-loop gate for stable regulation. Extensive experiments across model scales (1.5B–32B) and tasks (short CoT, long CoT, multi-turn tool use) demonstrate that STARE maintains stable policy entropy over thousands of steps, outperforming baselines like DAPO by 4%-8% in accuracy, and sustaining exploration-exploitation balance.

Key Results

  • On AIME24 and AIME25 datasets, STARE surpasses DAPO and other baselines by 4%-8% in average accuracy, with models like Qwen2.5-7B maintaining policy entropy within the target band over thousands of training steps. The approach effectively prevents entropy collapse, allowing the model to sustain exploration and reflection tokens, which are crucial for complex reasoning. The experimental results show that the policy's reflection tokens and response lengths grow in tandem, indicating a healthy exploration-exploitation balance. This leads to improved task performance and robustness across different model sizes and task types.
  • Across multiple model sizes, from 1.5B to 32B, STARE consistently stabilizes entropy during training, extending the number of effective training steps. For example, in 7B models, training exceeds 5000 steps without entropy collapse, whereas baseline methods like GRPO experience rapid entropy decay within a few hundred steps. The results also include ablation studies confirming that surprisal-based token selection and the closed-loop target entropy regulation are critical for stability. The method's robustness to hyperparameters and its minimal intrusion into the original GRPO objective make it highly practical.
  • In diverse tasks, including short and long chain-of-thought reasoning and multi-turn tool use, STARE enhances accuracy significantly. On the math benchmarks, the improvements are statistically significant, with accuracy gains of 4%-8%. The method also maintains a healthy distribution of reflection tokens and response lengths, which are indicators of ongoing exploration. The experimental setup involves training with a learning rate of 1e-6, batch size of 64, and 8 rollouts per sample, demonstrating that the approach scales well and is compatible with standard RL training pipelines.

Significance

This work addresses a fundamental challenge in reinforcement learning for large language models: the tendency of policy entropy to collapse during extended training, which hampers exploration and limits model performance. By uncovering the token-level mechanisms behind entropy dynamics, the authors provide a theoretical foundation that explains why traditional methods struggle and how targeted interventions can stabilize training. The proposed STARE method offers a principled, minimally invasive solution that ensures sustained diversity and exploration, crucial for complex reasoning tasks. Its ability to maintain stable entropy over thousands of steps across different model sizes and tasks marks a significant advance in RL methodology, with broad implications for AI development. This approach not only enhances training stability but also unlocks the potential for more sophisticated, self-reflective, and generalizable language models.

Technical Contribution

The paper's key technical contribution is the first-order gradient analysis of token-level entropy variation under GRPO, revealing a four-quadrant structure based on advantage and surprisal. This analysis uncovers a near-criticality property, where small token-level weight perturbations can reverse the entropy trend. Building on this insight, the authors design a surprisal-guided advantage reweighting mechanism that selectively amplifies high-surprisal tokens with positive advantage and attenuates those with negative advantage. The mechanism employs batch-internal surprisal quantiles to identify entropy-critical tokens dynamically, and integrates a target-entropy closed-loop gate for adaptive regulation. This minimal intervention approach effectively prevents entropy collapse, extends stable training over thousands of steps, and improves task accuracy. The method's novelty lies in combining theoretical insights with practical reweighting strategies, providing a new paradigm for entropy stabilization in RL.

Novelty

This work is the first to integrate token-level surprisal analysis with advantage reweighting to address policy entropy collapse. Unlike prior approaches that rely on trajectory-level regularization or coarse-grained entropy control, STARE leverages the four-quadrant structure derived from first-order gradient analysis, enabling precise, token-level interventions. The use of batch-internal surprisal quantiles for dynamic token selection and the incorporation of a closed-loop target entropy gate represent innovative steps that allow minimal yet effective control over entropy dynamics. This approach bridges theoretical understanding and engineering practice, offering a new perspective on how to maintain diversity and exploration in large-scale RL training, setting it apart from existing methods like entropy regularization or advantage reshaping.

Limitations

  • While STARE demonstrates robustness and effectiveness across multiple scales and tasks, its reliance on surprisal quantiles assumes a stable and representative distribution within each batch. In highly non-stationary or sparse reward environments, the surprisal proxy may become less reliable, potentially affecting stability. Additionally, the hyperparameters such as the quantile proportion and target entropy require careful tuning; although less sensitive than prior methods, they still influence performance. The computational overhead of token-level surprisal calculation and dynamic reweighting, especially in very large models, may pose practical challenges. Future work should explore more efficient token selection strategies, adaptive hyperparameter tuning, and broader applicability to diverse RL scenarios, including multi-modal and multi-agent settings.

Future Work

The authors plan to extend STARE to multi-modal tasks, such as vision-language reasoning, where entropy dynamics are more complex. They also aim to develop more adaptive, data-driven hyperparameter tuning methods to reduce manual intervention. Incorporating causal inference and symbolic reasoning techniques could further enhance the interpretability and robustness of the entropy regulation mechanism. Additionally, exploring the integration of STARE with other RL algorithms, such as PPO variants or off-policy methods, may broaden its applicability. Long-term, the goal is to build more stable, explorative, and self-reflective AI systems capable of sustained learning and reasoning in dynamic, real-world environments.

AI Executive Summary

In the rapidly evolving field of large language models, reinforcement learning has become a cornerstone for instilling complex reasoning and adaptive behaviors. However, a persistent challenge has been the tendency of policy entropy to collapse during extended training, leading to reduced output diversity, limited exploration, and ultimately, diminished model performance. Traditional solutions, such as entropy regularization or advantage-based reweighting, offer partial relief but lack a deep understanding of the underlying mechanisms driving entropy decay.

This paper introduces STARE, a novel approach grounded in a detailed theoretical analysis of token-level entropy dynamics within the GRPO framework. By dissecting the gradient flow, the authors reveal a four-quadrant structure based on advantage and surprisal, which explains why high-surprisal tokens—though rare—hold the key to maintaining diversity. The analysis uncovers a near-criticality property: small token-level weight adjustments can flip the entropy trend, providing a precise lever for stabilization.

Building on these insights, STARE employs batch-internal surprisal quantiles to dynamically identify entropy-critical tokens. It then selectively amplifies the advantages of high-surprisal tokens with positive advantage, while suppressing those with negative advantage, effectively balancing exploration and exploitation. The addition of a target-entropy closed-loop gate ensures that the policy’s entropy remains within a predefined range, preventing collapse or divergence. Extensive experiments across models from 1.5B to 32B, and tasks ranging from simple reasoning to multi-turn tool use, demonstrate that STARE sustains stable training over thousands of steps, outperforming baselines like DAPO by 4%-8% in accuracy.

The significance of this work lies in its ability to address a fundamental bottleneck in RL training of large models. By providing a theoretically grounded, minimally invasive mechanism for entropy regulation, it unlocks the potential for more robust, explorative, and self-reflective AI systems. The approach’s scalability and robustness across diverse tasks make it a promising foundation for future advancements in AI training methodologies. Looking ahead, the authors aim to extend STARE to multi-modal scenarios, incorporate adaptive hyperparameter tuning, and explore its integration with other RL algorithms, paving the way for more stable and capable intelligent systems.

Deep Dive

Abstract

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

cs.LG cs.AI cs.CL