Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
Proposes Bebop with TV loss and rejection sampling to stabilize MTP acceptance rate, achieving up to 95% and 1.8× RL training acceleration.
Key Findings
Methodology
This paper systematically analyzes Multi-Token Prediction (MTP) in large language models (LLMs), especially during reinforcement learning (RL) training. The study reveals that MTP acceptance rates are fundamentally limited by fluctuations in model entropy, which exhibits a linear negative correlation. To address this, the authors introduce an end-to-end (e2e) TV loss that directly optimizes the multi-step rejection sampling acceptance rate. By pretraining with TV loss, the MTP models maintain stable acceptance rates throughout RL training, avoiding costly online updates. Extensive experiments on Qwen3.5, 3.6, and 3.7 demonstrate that the proposed approach achieves acceptance rates up to 95%, inference throughput gains up to 25%, and overall training acceleration of 1.8× across reasoning, coding, and agentic tasks.
Key Results
- The study finds that acceptance rates of MTP degrade linearly with increasing model entropy, with drops up to 3.5%. Incorporating TV loss during pretraining stabilizes acceptance rates at around 95%, improving throughput by roughly 10%.
- On Qwen3.5/3.6/3.7 models, the combined pretraining, TV loss, and rejection sampling strategy yields an end-to-end training speedup of 1.8×, with inference throughput increased by 25%.
- Models trained with TV loss exhibit distributional overlap with the policy that remains invariant to entropy fluctuations, demonstrating robustness and superior generalization across tasks and model sizes.
Significance
This work addresses the critical bottleneck of inference speed in RL training of large language models by breaking the entropy-bound limitations on MTP acceptance rates. The proposed TV loss and rejection sampling framework significantly enhance sampling efficiency, enabling faster training and broader deployment of RL-enhanced LLMs. It provides a theoretical foundation and practical engineering solutions that can be extended to multi-modal and multi-task scenarios, marking a substantial step forward in scalable AI training.
Technical Contribution
The paper introduces a novel end-to-end TV loss that directly minimizes the total variation distance between the draft and target distributions, ensuring stable acceptance rates during RL. It rigorously analyzes the linear relationship between entropy and acceptance rate, demonstrating that traditional CE/KL objectives are suboptimal for rejection sampling. The integration of probabilistic rejection sampling with pretrained TV-optimized models yields a robust, entropy-invariant sampling mechanism, facilitating high-speed RL training without online MTP updates.
Novelty
This is the first comprehensive analysis linking model entropy to MTP acceptance limitations during RL, and the first to propose a TV-distance-based training objective for stable, high-acceptance MTP models. Unlike prior methods relying on online fine-tuning or greedy sampling, this approach combines theoretical insights with practical algorithms, achieving significant acceleration and robustness improvements.
Limitations
- While the TV loss stabilizes acceptance rates, in extremely high-entropy scenarios, some decline persists, indicating room for further capacity or regularization improvements.
- The rejection sampling mechanism introduces additional inference overhead, especially when acceptance rates are low, which may offset some speed gains in certain contexts.
- Current validation is primarily on the Qwen series and specific tasks; applicability to other architectures, modalities, or more complex RL environments remains to be explored.
Future Work
Future directions include extending the TV loss framework to multi-modal models, integrating adaptive entropy control mechanisms, and exploring hardware-aware optimizations to reduce rejection sampling overhead. Additionally, investigating the combination of this approach with other RL techniques, such as reward modeling and curriculum learning, could further enhance training efficiency and model robustness.
AI Executive Summary
Reinforcement learning (RL) has become a cornerstone in training large language models (LLMs), enabling models to better align with human preferences and complex tasks. However, the inference (rollout) phase during RL remains a significant bottleneck due to its high computational cost, especially when deploying large models like Qwen series. To accelerate this process, speculative decoding techniques such as Multi-Token Prediction (MTP) have been adopted, allowing the model to generate multiple tokens in advance, thus increasing throughput.
Despite these advances, a persistent challenge has been the decline in MTP acceptance rates as RL training progresses. This decline is primarily driven by fluctuations in the model’s entropy, which causes the distribution of predicted tokens to become more dispersed, reducing the likelihood that the draft tokens will be accepted during verification. Traditional training objectives like cross-entropy (CE) and KL divergence are insufficient to address this issue, as they do not directly optimize the acceptance rate, leading to suboptimal performance.
In response, the authors propose Bebop, a novel framework that leverages a theoretically grounded understanding of the entropy-acceptance relationship. They introduce a new training objective based on the total variation (TV) distance, which directly minimizes the divergence between the draft and target distributions. This end-to-end (e2e) TV loss ensures that the MTP model maintains a high and stable acceptance rate throughout RL training, regardless of entropy fluctuations. Additionally, the paper advocates for the use of probabilistic rejection sampling during inference, which is less sensitive to entropy changes than greedy target-only sampling.
Extensive experiments on Qwen3.5, 3.6, and 3.7 models across reasoning, coding, and agentic tasks demonstrate the effectiveness of this approach. Results show that models pretrained with TV loss and rejection sampling achieve acceptance rates up to 95%, with inference throughput increasing by 25%. More importantly, the overall RL training process is accelerated by up to 1.8×, significantly reducing computational costs and enabling faster deployment.
This work offers a fundamental shift in how MTP models are trained and deployed in RL contexts. By mathematically characterizing the linear relationship between entropy and acceptance rate, and proposing a practical training loss that stabilizes this relationship, the authors provide a scalable solution to a long-standing bottleneck. The combination of theoretical rigor and engineering innovation marks a major step toward more efficient, robust, and scalable large language models in reinforcement learning settings.
Looking forward, future research can explore extending this framework to multi-modal models, integrating adaptive entropy regulation, and optimizing hardware implementations for rejection sampling. These developments promise to further push the boundaries of large-scale AI training, making powerful models more accessible and practical for real-world applications.
Deep Dive
Plain Language Accessible to non-experts
想象你在一家工厂里工作,工厂每天都要生产各种产品。为了提高效率,工厂引入了一套预先准备的模具(就像模型的预测),这些模具可以提前制造出大部分产品(Token)。但问题是,工厂的订单每天都在变化,有时订单很明确(低熵),模具几乎都能准确生产出订单中的产品,效率很高;但有时订单很复杂(高熵),模具就难以准确预测,导致很多产品不符合订单(接受率低)。
为了应对这个问题,工厂引入了一种新方法:当模具预测的产品不符合订单时,工厂会随机拒绝这个预测,重新从剩余的可能产品中选择(拒绝采样)。这样一来,即使订单变化很大,工厂也能保持较高的效率和准确率(接受率)。
更重要的是,工厂还设计了一套特别的调节机制,让模具在预先生产时就学会了如何在各种订单变化中保持准确(TV损失训练),避免了每次订单变化都要重新调节模具的繁琐过程。这就像提前训练好一套适应不同订单的模具,能在整个生产过程中保持高效。
最终,这个方法让工厂的生产速度大大提升(训练加速),同时还能应对订单的不断变化(模型熵波动),实现了既快又准的目标。这就像一个高效、智能的工厂,能在各种复杂环境下稳定运行,节省时间和成本。
Abstract
Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.
References (20)
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, G. Irving et al.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu et al.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig et al.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, Raul Puri et al.
Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan, Matan Kalman, Yossi Matias
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
Tianyuan Wu, Yuhang Yao, Zhenting Qi et al.
f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization
Sebastian Nowozin, Botond Cseke, Ryota Tomioka
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière et al.
ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
Qiaoling Chen, Zijun Liu, Peng Sun et al.
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
Hayate Iso, Tiyasa Mitra, Sudipta Mondal et al.
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal et al.
SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism
Yuhao Shen, Junyi Shen, Quan Kong et al.
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich et al.
SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang et al.
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
Jasper Dekoninck, Nikola Jovanovic, Tim Gehrunger et al.
DFlash: Block Diffusion for Flash Speculative Decoding
Jian Chen, Yesheng Liang, Zhijian Liu
Markov chains and mixing times
V. Climenhaga
Draft-OPD: On-Policy Distillation for Speculative Draft Models
Hao Lei, Yafy Li, Haoran Zhang et al.