SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

TL;DR

SortedRL accelerates RL training for LLMs through online length-aware scheduling, enhancing efficiency and performance.

cs.LG 🔴 Advanced 2026-03-25 7 citations 50 views
Yiqi Zhang Huiqiang Jiang Xufang Luo Zhihe Yang Chengruidong Zhang Yifei Shen Dongsheng Li Yuqing Yang Lili Qiu Yang You
Reinforcement Learning Large Language Models Online Scheduling Sample Efficiency Training Acceleration

Key Findings

Methodology

SortedRL is an online length-aware scheduling strategy designed to accelerate RL training for large language models by optimizing the efficiency of the rollout phase. The core idea is to reorder rollout samples based on output lengths, prioritizing shorter samples for early updates. This approach enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction. Additionally, SortedRL incorporates a cache-based mechanism to control the degree of off-policy training and is supported by a dedicated RL infrastructure that manages rollout and updates.

Key Results

  • SortedRL reduced RL training bubble ratios by over 50% in experiments using LLaMA-3.1-8B and Qwen-2.5-32B, achieving 3.9% to 18.4% superior performance over baselines on tasks like logical puzzles and math challenges such as AIME 24, Math 500, and Minerval.
  • In logical reasoning tasks, LLaMA-3.1-8B-Instruct using SortedRL achieved the same high score with 40.74% fewer samples compared to vanilla Reinforce++ training.
  • In mathematical problems, SortedRL demonstrated outstanding performance on OlympiadBench, AIME 2024, and AMC 2023, showcasing its effectiveness in complex tasks.

Significance

SortedRL addresses the primary bottleneck in RL training for large language models by improving the efficiency of the rollout phase and sample utilization. Its online length-aware scheduling strategy not only speeds up training but also significantly enhances model performance in logical reasoning and mathematical problem-solving. This successful application demonstrates how optimizing scheduling strategies can overcome the issue of low hardware utilization in large-scale model training, providing new insights for future research and applications.

Technical Contribution

The technical contributions of SortedRL are primarily reflected in its innovative online length-aware scheduling strategy, which improves hardware utilization and training efficiency by reordering rollout samples. Additionally, SortedRL introduces a cache mechanism to control the degree of off-policy training and designs a dedicated RL infrastructure to support this strategy. These technical innovations not only enhance training efficiency but also provide new engineering possibilities for RL training of large-scale models.

Novelty

SortedRL is the first to propose an online length-aware scheduling strategy that improves training efficiency by optimizing the ordering of rollout samples. This strategy differs from previous work by dynamically adjusting the processing order of samples, achieving near on-policy training without additional overhead. This innovation is significant in the context of RL training for large-scale models.

Limitations

  • SortedRL may still encounter issues with uneven hardware utilization when handling very long generation sequences, as the generation time for long sequences is extended, potentially leading to some hardware resources being idle.
  • The performance of this method may vary across different tasks, especially when there is a significant difference between task characteristics and training data distribution.
  • Although SortedRL performs well in multiple tasks, its scalability and stability on larger models still need further verification.

Future Work

Future research could further optimize the scheduling strategy of SortedRL to better adapt to the characteristics and needs of different tasks. Additionally, exploring the application of SortedRL on larger models and how to combine it with other optimization techniques to further improve training efficiency and model performance would be valuable. The community could also focus on integrating SortedRL with other advanced training frameworks to achieve more efficient model training.

AI Executive Summary

In the training of large language models, reinforcement learning (RL) is considered a key methodology for enhancing model reasoning capabilities, particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time due to slow autoregressive generation and synchronization overhead between rollout and policy updates.

To address this bottleneck, SortedRL proposes an online length-aware scheduling strategy to accelerate RL training by optimizing rollout efficiency. SortedRL reorders rollout samples based on output lengths, prioritizing short samples for early updates. This approach enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction.

SortedRL also incorporates a cache mechanism to control the degree of off-policy training and is supported by a dedicated RL infrastructure that manages rollout and updates. Experiments demonstrate that SortedRL reduces RL training bubble ratios by over 50% and achieves 3.9% to 18.4% superior performance over baselines on tasks like logical puzzles and math challenges.

The success of SortedRL showcases how optimizing scheduling strategies can significantly improve training efficiency and model performance in large-scale model training. This strategy not only addresses the primary bottleneck in RL training for LLMs but also provides new insights for future research and applications.

However, SortedRL may still encounter issues with uneven hardware utilization when handling very long generation sequences. Additionally, the performance of this method may vary across different tasks, especially when there is a significant difference between task characteristics and training data distribution. Future research could further optimize the scheduling strategy of SortedRL to better adapt to the characteristics and needs of different tasks.

Deep Analysis

Background

In recent years, large language models (LLMs) have achieved remarkable performance across a wide range of tasks, particularly in natural language processing and generation tasks. As the scale of models continues to grow, effectively training these models has become an important research direction. Reinforcement learning (RL) has gradually gained attention as a method to enhance model reasoning capabilities. RL guides model training by generating intermediate reasoning steps and applying outcome-based rewards, which has been proven to significantly improve model performance in complex tasks. However, the efficiency of RL training is often limited by the rollout phase, which requires generating long sequences, leading to low hardware resource utilization. To improve training efficiency, researchers have proposed various optimization strategies, such as continuous batching and chunked prefilling, but these methods still face challenges in practical applications.

Core Problem

In RL training for large language models, the rollout phase is the primary bottleneck, as generating long sequences requires substantial time and computational resources. The generation process is autoregressive, meaning that the generation speed for long sequences is slow, leading to underutilized hardware resources. Additionally, commonly used RL algorithms are on-policy, meaning that updates cannot occur until generation is complete. When response lengths vary widely across samples in a batch, it leads to inefficient hardware utilization, creating so-called 'bubbles.' Improving the efficiency of the rollout phase and reducing computational resource waste is the core problem in current research.

Innovation

SortedRL introduces an innovative online length-aware scheduling strategy to improve training efficiency by optimizing the ordering of rollout samples.

  • �� Online Length-Aware Scheduling: Reorders rollout samples based on output lengths, prioritizing short samples for early updates. This approach enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction.
  • �� Cache Mechanism: Introduces a cache mechanism to control the degree of off-policy training, accelerating the pipeline by caching unfinished samples.
  • �� Dedicated RL Infrastructure: Designs a dedicated infrastructure to support SortedRL, managing rollout and updates to maximize throughput and maintain training consistency.

Methodology

The implementation of SortedRL includes the following key steps:

  • �� Online Length-Aware Scheduling: Reorders rollout samples based on output lengths, prioritizing short samples.
  • �� Cache Mechanism: Accelerates the pipeline by caching unfinished samples.
  • �� Dedicated RL Infrastructure: Designs a dedicated infrastructure to support SortedRL, managing rollout and updates.
  • �� Generation Length-Aware Scheduling: Dynamically adjusts the processing order of samples by predicting generation lengths.
  • �� Grouped Rollout and Micro-curriculum: Organizes prompts into groups of batches, ensuring all prompts are fully processed within a bounded timespan.
  • �� Selective Batching for Training: Provides ready trajectories to the trainer in a dedicated order and combination based on batch readiness.

Experiments

SortedRL was extensively tested on LLaMA-3.1-8B and Qwen-2.5-32B across various tasks, including logical puzzles and mathematical challenges. The experimental design includes:

  • �� Datasets: LogicRL and DAPO-Math-17k, used for logical reasoning and mathematical problems, respectively.
  • �� Baselines: Compared with Reinforce++ and PPO.
  • �� Evaluation Metrics: Accuracy, bubble ratio, response length, etc.
  • �� Key Hyperparameters: Rollout batch size, update batch size, cache strategy, etc.
  • �� Ablation Studies: Analyzed the impact of different components on performance.

Results

SortedRL demonstrated significant performance improvements across multiple tasks:

  • �� In logical reasoning tasks, LLaMA-3.1-8B-Instruct using SortedRL achieved the same high score with 40.74% fewer samples compared to vanilla Reinforce++ training.
  • �� In mathematical problems, SortedRL demonstrated outstanding performance on OlympiadBench, AIME 2024, and AMC 2023, showcasing its effectiveness in complex tasks.
  • �� RL training bubble ratios were reduced by over 50%, achieving 3.9% to 18.4% superior performance over baselines on logical puzzles and math challenges.

Applications

SortedRL's application scenarios include:

  • �� Enhancing large language models' performance in logical reasoning and mathematical problem-solving, suitable for tasks requiring long chain-of-thought generation.
  • �� Improving hardware utilization and training efficiency in large-scale model training by optimizing scheduling strategies.
  • �� Combining with other optimization techniques to achieve more efficient model training and applications.

Limitations & Outlook

SortedRL may still encounter issues with uneven hardware utilization when handling very long generation sequences. Additionally, the performance of this method may vary across different tasks, especially when there is a significant difference between task characteristics and training data distribution. Future research could further optimize the scheduling strategy of SortedRL to better adapt to the characteristics and needs of different tasks.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a meal. You have various ingredients that need different preparation times, like chopping vegetables, cooking rice, and grilling meat. Each ingredient takes a different amount of time, and if you don't plan well, some ingredients might be ready but can't be used immediately, while others are still cooking. SortedRL is like a smart chef who arranges the order of preparing these ingredients based on their cooking times, starting with the ones that take less time. This way, you can finish the entire meal preparation faster. This method not only improves efficiency but also ensures that each dish tastes great because all ingredients are used at their best. Just like in the kitchen, SortedRL improves training efficiency and model performance in large language models by optimizing the order of sample processing.

ELI14 Explained like you're 14

Imagine you're playing a puzzle game with many levels, each with different difficulty. Some levels are easy, while others are really hard. You want to finish the game quickly, but sometimes you get stuck on a tough level and waste a lot of time. SortedRL is like a smart game assistant that helps you arrange these levels so you can play the easy ones first, quickly gaining experience and skills before tackling the harder ones. This method not only helps you finish the game faster but also makes sure you perform better in each level. In the training of large language models, SortedRL improves training efficiency and model performance by optimizing the order of sample processing.

Glossary

Reinforcement Learning

A machine learning method that guides model learning through reward and punishment mechanisms to find the optimal strategy.

Used in the paper to enhance the reasoning capabilities of large language models.

Large Language Model

A deep learning-based model capable of processing and generating natural language text.

LLaMA-3.1-8B and Qwen-2.5-32B are used in experiments in the paper.

Rollout

In reinforcement learning, it refers to the process where the model generates a series of actions and states based on the current policy.

The rollout phase is the main bottleneck in training as discussed in the paper.

On-policy

A reinforcement learning strategy where the model uses the latest policy for updates during training.

SortedRL achieves near on-policy training through online length-aware scheduling.

Off-policy

A reinforcement learning strategy where the model can use data from older policies for updates during training.

SortedRL controls the degree of off-policy training through a cache mechanism.

Bubble Ratio

The proportion of time during computation when hardware resources are underutilized.

SortedRL reduces the bubble ratio by optimizing scheduling strategies.

Autoregressive Generation

A sequence generation method where each step's output depends on the previous outputs.

Autoregressive generation leads to inefficiencies in the rollout phase as discussed in the paper.

Micro-curriculum

A training strategy that gradually increases task difficulty to improve model learning outcomes.

SortedRL constructs a near on-policy micro-curriculum through sample sorting.

Cache Mechanism

A strategy for storing and managing unfinished samples to accelerate the training process.

SortedRL uses a cache mechanism to control the degree of off-policy training.

LLaMA-3.1-8B

A large language model with 8B parameters used for experiments on logical reasoning tasks.

Used in the paper to validate SortedRL's performance in logical reasoning tasks.

Qwen-2.5-32B

A large language model with 32B parameters used for experiments on mathematical problems.

Used in the paper to validate SortedRL's performance in mathematical problems.

Open Questions Unanswered questions from this research

  • 1 SortedRL may still encounter issues with uneven hardware utilization when handling very long generation sequences. Future research could explore further optimizing scheduling strategies to better adapt to different task characteristics and needs.
  • 2 Although SortedRL performs well in multiple tasks, its scalability and stability on larger models still need further verification. Research could explore how to integrate SortedRL with other advanced training frameworks to achieve more efficient model training.
  • 3 The performance of SortedRL may vary across different tasks, especially when there is a significant difference between task characteristics and training data distribution. Future research could explore how to better adapt to these differences.
  • 4 The effectiveness of SortedRL's cache mechanism may vary across different tasks. Research could explore how to optimize cache strategies to improve training efficiency and model performance.
  • 5 The online length-aware scheduling strategy of SortedRL may have varying effects across different tasks. Future research could explore how to better adapt to the characteristics and needs of different tasks.

Applications

Immediate Applications

Logical Reasoning Tasks

SortedRL can enhance large language models' performance in logical reasoning tasks, suitable for tasks requiring long chain-of-thought generation.

Mathematical Problem Solving

SortedRL demonstrates outstanding performance in mathematical problems, improving models' performance in math competitions and challenges.

Large-scale Model Training

SortedRL improves training efficiency and hardware utilization in large-scale models by optimizing scheduling strategies.

Long-term Vision

Intelligent Assistants

SortedRL can be used to train smarter language models, enhancing the reasoning capabilities and response speed of intelligent assistants.

Automated Reasoning Systems

SortedRL can be used to develop more efficient automated reasoning systems for scientific research and technological development.

Abstract

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.

cs.LG cs.AI

References (20)

OpenAI o1 System Card

Ahmed El-Kishky

2024 1594 citations ⭐ Influential

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai et al.

2024 866 citations ⭐ Influential View Analysis →

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Adam Suma, Sam Dauncey

2025 1963 citations ⭐ Influential

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal et al.

2017 26047 citations ⭐ Influential View Analysis →

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath et al.

2021 4648 citations ⭐ Influential View Analysis →

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase et al.

2019 620 citations

GPT-4 Technical Report

OpenAI Josh Achiam, Steven Adler, S. Agarwal et al.

2023 23049 citations View Analysis →

Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley et al.

2022 286 citations View Analysis →

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan et al.

2022 1510 citations View Analysis →

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Shenggui Li, Zhengda Bian, Hongxin Liu et al.

2021 198 citations View Analysis →

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 7843 citations View Analysis →

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley et al.

2021 494 citations View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 5068 citations View Analysis →

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Jian Hu

2025 215 citations

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Weixun Wang et al.

2024 281 citations

DeepSeek-V3 Technical Report

DeepSeek-AI, A. Liu, B. Feng et al.

2024 2821 citations

Orca: A Distributed Serving System for Transformer-Based Generative Models

Gyeong-In Yu, Joo Seong Jeong

2022 612 citations

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 3803 citations View Analysis →

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie et al.

2023 646 citations View Analysis →

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu et al.

2024 5149 citations View Analysis →

Cited By (7)

Not all tokens are needed(NAT): token efficient reinforcement learning

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

Unleashing Efficient Asynchronous RL Post-Training via Staleness-Constrained Rollout Coordination

2026 1 citations View Analysis →

SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts

APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

2025 8 citations View Analysis →

SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts

2025 8 citations