Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

TL;DR

Proposes Bebop with TV loss and rejection sampling to stabilize MTP acceptance rate, achieving up to 95% and 1.8× RL training acceleration.

cs.LG 🔴 Advanced 2026-06-11 180 views

Yucheng Li Huiqiang Jiang Yang Xu Jianxin Yang Yi Zhang Yizhong Cao Yuhao Shen Fan Zhou Rui Men Jianwei Zhang An Yang Bowen Yu Bo Zheng Fei Huang Junyang Lin Dayiheng Liu Jingren Zhou

AI Reader Arxiv Page Download PDF

Reinforcement Learning Large Language Models Speculative Decoding Rejection Sampling Entropy Control

Key Findings

Methodology

This paper systematically analyzes Multi-Token Prediction (MTP) in large language models (LLMs), especially during reinforcement learning (RL) training. The study reveals that MTP acceptance rates are fundamentally limited by fluctuations in model entropy, which exhibits a linear negative correlation. To address this, the authors introduce an end-to-end (e2e) TV loss that directly optimizes the multi-step rejection sampling acceptance rate. By pretraining with TV loss, the MTP models maintain stable acceptance rates throughout RL training, avoiding costly online updates. Extensive experiments on Qwen3.5, 3.6, and 3.7 demonstrate that the proposed approach achieves acceptance rates up to 95%, inference throughput gains up to 25%, and overall training acceleration of 1.8× across reasoning, coding, and agentic tasks.

Key Results

The study finds that acceptance rates of MTP degrade linearly with increasing model entropy, with drops up to 3.5%. Incorporating TV loss during pretraining stabilizes acceptance rates at around 95%, improving throughput by roughly 10%.
On Qwen3.5/3.6/3.7 models, the combined pretraining, TV loss, and rejection sampling strategy yields an end-to-end training speedup of 1.8×, with inference throughput increased by 25%.
Models trained with TV loss exhibit distributional overlap with the policy that remains invariant to entropy fluctuations, demonstrating robustness and superior generalization across tasks and model sizes.

Significance

This work addresses the critical bottleneck of inference speed in RL training of large language models by breaking the entropy-bound limitations on MTP acceptance rates. The proposed TV loss and rejection sampling framework significantly enhance sampling efficiency, enabling faster training and broader deployment of RL-enhanced LLMs. It provides a theoretical foundation and practical engineering solutions that can be extended to multi-modal and multi-task scenarios, marking a substantial step forward in scalable AI training.

Technical Contribution

The paper introduces a novel end-to-end TV loss that directly minimizes the total variation distance between the draft and target distributions, ensuring stable acceptance rates during RL. It rigorously analyzes the linear relationship between entropy and acceptance rate, demonstrating that traditional CE/KL objectives are suboptimal for rejection sampling. The integration of probabilistic rejection sampling with pretrained TV-optimized models yields a robust, entropy-invariant sampling mechanism, facilitating high-speed RL training without online MTP updates.

Novelty

This is the first comprehensive analysis linking model entropy to MTP acceptance limitations during RL, and the first to propose a TV-distance-based training objective for stable, high-acceptance MTP models. Unlike prior methods relying on online fine-tuning or greedy sampling, this approach combines theoretical insights with practical algorithms, achieving significant acceleration and robustness improvements.

Limitations

While the TV loss stabilizes acceptance rates, in extremely high-entropy scenarios, some decline persists, indicating room for further capacity or regularization improvements.
The rejection sampling mechanism introduces additional inference overhead, especially when acceptance rates are low, which may offset some speed gains in certain contexts.
Current validation is primarily on the Qwen series and specific tasks; applicability to other architectures, modalities, or more complex RL environments remains to be explored.

Future Work

Future directions include extending the TV loss framework to multi-modal models, integrating adaptive entropy control mechanisms, and exploring hardware-aware optimizations to reduce rejection sampling overhead. Additionally, investigating the combination of this approach with other RL techniques, such as reward modeling and curriculum learning, could further enhance training efficiency and model robustness.

AI Executive Summary

Reinforcement learning (RL) has become a cornerstone in training large language models (LLMs), enabling models to better align with human preferences and complex tasks. However, the inference (rollout) phase during RL remains a significant bottleneck due to its high computational cost, especially when deploying large models like Qwen series. To accelerate this process, speculative decoding techniques such as Multi-Token Prediction (MTP) have been adopted, allowing the model to generate multiple tokens in advance, thus increasing throughput.

Despite these advances, a persistent challenge has been the decline in MTP acceptance rates as RL training progresses. This decline is primarily driven by fluctuations in the model’s entropy, which causes the distribution of predicted tokens to become more dispersed, reducing the likelihood that the draft tokens will be accepted during verification. Traditional training objectives like cross-entropy (CE) and KL divergence are insufficient to address this issue, as they do not directly optimize the acceptance rate, leading to suboptimal performance.

In response, the authors propose Bebop, a novel framework that leverages a theoretically grounded understanding of the entropy-acceptance relationship. They introduce a new training objective based on the total variation (TV) distance, which directly minimizes the divergence between the draft and target distributions. This end-to-end (e2e) TV loss ensures that the MTP model maintains a high and stable acceptance rate throughout RL training, regardless of entropy fluctuations. Additionally, the paper advocates for the use of probabilistic rejection sampling during inference, which is less sensitive to entropy changes than greedy target-only sampling.

Extensive experiments on Qwen3.5, 3.6, and 3.7 models across reasoning, coding, and agentic tasks demonstrate the effectiveness of this approach. Results show that models pretrained with TV loss and rejection sampling achieve acceptance rates up to 95%, with inference throughput increasing by 25%. More importantly, the overall RL training process is accelerated by up to 1.8×, significantly reducing computational costs and enabling faster deployment.

This work offers a fundamental shift in how MTP models are trained and deployed in RL contexts. By mathematically characterizing the linear relationship between entropy and acceptance rate, and proposing a practical training loss that stabilizes this relationship, the authors provide a scalable solution to a long-standing bottleneck. The combination of theoretical rigor and engineering innovation marks a major step toward more efficient, robust, and scalable large language models in reinforcement learning settings.

Looking forward, future research can explore extending this framework to multi-modal models, integrating adaptive entropy regulation, and optimizing hardware implementations for rejection sampling. These developments promise to further push the boundaries of large-scale AI training, making powerful models more accessible and practical for real-world applications.

Deep Dive

Plain Language Accessible to non-experts

想象你在一家工厂里工作，工厂每天都要生产各种产品。为了提高效率，工厂引入了一套预先准备的模具（就像模型的预测），这些模具可以提前制造出大部分产品（Token）。但问题是，工厂的订单每天都在变化，有时订单很明确（低熵），模具几乎都能准确生产出订单中的产品，效率很高；但有时订单很复杂（高熵），模具就难以准确预测，导致很多产品不符合订单（接受率低）。

为了应对这个问题，工厂引入了一种新方法：当模具预测的产品不符合订单时，工厂会随机拒绝这个预测，重新从剩余的可能产品中选择（拒绝采样）。这样一来，即使订单变化很大，工厂也能保持较高的效率和准确率（接受率）。

更重要的是，工厂还设计了一套特别的调节机制，让模具在预先生产时就学会了如何在各种订单变化中保持准确（TV损失训练），避免了每次订单变化都要重新调节模具的繁琐过程。这就像提前训练好一套适应不同订单的模具，能在整个生产过程中保持高效。

最终，这个方法让工厂的生产速度大大提升（训练加速），同时还能应对订单的不断变化（模型熵波动），实现了既快又准的目标。这就像一个高效、智能的工厂，能在各种复杂环境下稳定运行，节省时间和成本。

Abstract

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

cs.LG cs.CL

References (20)

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, G. Irving et al.

2023 900 citations ⭐ Influential View Analysis →

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu et al.

2024 1669 citations ⭐ Influential View Analysis →

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig et al.

2023 2528 citations ⭐ Influential View Analysis →

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, Raul Puri et al.

2019 2841 citations ⭐ Influential View Analysis →

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

2022 1630 citations ⭐ Influential View Analysis →

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

Tianyuan Wu, Yuhang Yao, Zhenting Qi et al.

2026 1 citations View Analysis →

f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

Sebastian Nowozin, Botond Cseke, Ryota Tomioka

2016 1811 citations View Analysis →

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière et al.

2024 318 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 5800 citations View Analysis →

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

Qiaoling Chen, Zijun Liu, Peng Sun et al.

2025 10 citations View Analysis →

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Hayate Iso, Tiyasa Mitra, Sudipta Mondal et al.

2026 2 citations View Analysis →

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal et al.

2017 28243 citations View Analysis →

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Yuhao Shen, Junyi Shen, Quan Kong et al.

2025 12 citations View Analysis →

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich et al.

2024 266 citations View Analysis →

SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang et al.

2023 365 citations View Analysis →

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.

2024 1843 citations View Analysis →

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanovic, Tim Gehrunger et al.

2026 27 citations View Analysis →

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, Zhijian Liu

2026 31 citations View Analysis →

Markov chains and mixing times

V. Climenhaga

2013 2473 citations

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Hao Lei, Yafy Li, Haoran Zhang et al.

2026 1 citations View Analysis →

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Plain Language Accessible to non-experts

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies