RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

TL;DR

RAD-2 scales reinforcement learning in autonomous driving, reducing collision rate by 56% using a generator-discriminator framework.

cs.CV 🔴 Advanced 2026-04-17 35 views

Hao Gao Shaoyu Chen Yifan Zhu Yuehao Song Wenyu Liu Qian Zhang Xinggang Wang

AI Reader Arxiv Page Download PDF

reinforcement learning autonomous driving generator-discriminator trajectory planning simulation environment

Key Findings

Methodology

RAD-2 employs a generator-discriminator framework where a diffusion-based generator produces diverse trajectory candidates, and an RL-optimized discriminator reranks these candidates based on long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the high-dimensional trajectory space, enhancing optimization stability. Additionally, Temporally Consistent Group Relative Policy Optimization and On-policy Generator Optimization are introduced to further enhance reinforcement learning.

Key Results

RAD-2 reduces the collision rate by 56% on large-scale benchmarks, significantly outperforming strong diffusion-based planners. This result indicates that RAD-2 can provide higher safety and driving smoothness in complex urban traffic environments.
In real-world deployment, RAD-2 demonstrated improved perceived safety and driving smoothness, especially in complex urban traffic. These tests highlight RAD-2's potential for practical applications.
RAD-2 conducts high-throughput closed-loop evaluations using the BEV-Warp simulation environment, demonstrating its efficiency at the feature level and overcoming limitations of existing simulators.

Significance

RAD-2 addresses the challenge of modeling multimodal future uncertainties and robustness in closed-loop interactions in autonomous driving. By introducing a generator-discriminator framework, RAD-2 significantly enhances system safety and efficiency without relying on expert supervision. This method holds significant academic value and offers a scalable solution for the industry.

Technical Contribution

RAD-2's technical contributions include the decoupled design of the generator-discriminator framework, avoiding direct application of sparse scalar rewards in high-dimensional trajectory space. The introduction of Temporally Consistent Group Relative Policy Optimization and On-policy Generator Optimization provides new theoretical guarantees and engineering possibilities for reinforcement learning.

Novelty

RAD-2 is the first to apply a generator-discriminator framework to closed-loop planning in autonomous driving, improving optimization stability. Compared to existing diffusion-based planners, RAD-2 demonstrates superior robustness and efficiency in handling high-dimensional trajectory spaces.

Limitations

RAD-2 may experience performance degradation in extremely complex traffic scenarios due to insufficient diversity and quality of trajectory candidates to handle all potential driving situations.
The method's performance in simulation may differ from the real world, especially when simulators cannot fully capture real traffic dynamics.
While RAD-2 performs excellently in most cases, additional expert supervision may be required in specific scenarios to ensure safety.

Future Work

Future research directions include further optimizing the synergy between the generator and discriminator to improve the quality and diversity of trajectory candidates. Additionally, exploring more complex simulation environments and real-world datasets to validate RAD-2's robustness and adaptability under different driving conditions.

AI Executive Summary

High-level autonomous driving systems require motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning.

At the core of RAD-2 is a diffusion-based generator used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. Additionally, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds.

To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

The introduction of RAD-2 holds significant academic value and offers a scalable solution for the industry. By introducing a generator-discriminator framework, RAD-2 significantly enhances system safety and efficiency without relying on expert supervision. This method addresses the challenge of modeling multimodal future uncertainties and robustness in closed-loop interactions in autonomous driving.

While RAD-2 performs excellently in most cases, additional expert supervision may be required in specific scenarios to ensure safety. Additionally, the method's performance in simulation may differ from the real world, especially when simulators cannot fully capture real traffic dynamics. Future research directions include further optimizing the synergy between the generator and discriminator to improve the quality and diversity of trajectory candidates.

Deep Analysis

Background

The rapid development of autonomous driving technology has made motion planning a core challenge in this field. Traditional planning methods, such as regression-based and selection-based planners, often rely on deterministic predictions or discrete candidate sets, limiting their performance in complex driving scenarios. Recently, diffusion-based imitation learning planners have gained attention for their ability to generate multimodal continuous trajectories. However, these methods face challenges of stochastic instability and lack of corrective negative feedback when dealing with real driving datasets. To overcome these challenges, researchers have begun exploring the combination of reinforcement learning with imitation learning to enhance policy learning.

Core Problem

The core problem in autonomous driving is how to perform robust motion planning in uncertain future environments. Existing diffusion-based imitation learning planners, while capable of generating complex trajectory distributions, face optimization instability when handling high-dimensional continuous trajectories. Additionally, imitation learning lacks negative feedback, leading to potentially unrealistic behaviors in real driving scenarios. To achieve efficient closed-loop planning, a solution is needed that can provide high-quality trajectory candidates without relying on expert supervision.

Innovation

The core innovations of RAD-2 include the decoupled design of the generator-discriminator framework:

�� Generator: Based on a diffusion model, it generates diverse trajectory candidates, ensuring diversity and quality of trajectories.
�� Discriminator: Optimized through reinforcement learning to rerank trajectory candidates based on long-term driving quality.
�� Temporally Consistent Group Relative Policy Optimization: Utilizes temporal coherence to alleviate the credit assignment problem, enhancing policy optimization stability.
�� On-policy Generator Optimization: Converts closed-loop feedback into structured longitudinal optimization signals, progressively shifting the generator toward high-reward trajectory manifolds.

Methodology

The methodology of RAD-2 includes the following key steps:

�� Generator: Based on a diffusion model, it generates diverse trajectory candidates, inputting current observations and outputting candidate trajectories.
�� Discriminator: Optimized through reinforcement learning, inputs candidate trajectories and outputs reranked trajectories.
�� Temporally Consistent Group Relative Policy Optimization: Utilizes temporal coherence to alleviate the credit assignment problem, ensuring policy optimization stability.
�� On-policy Generator Optimization: Converts closed-loop feedback into structured longitudinal optimization signals, progressively shifting the generator toward high-reward trajectory manifolds.

Experiments

The experimental design includes high-throughput closed-loop evaluations in the BEV-Warp simulation environment. The large-scale benchmark dataset used covers various driving scenarios, including safety and efficiency-related scenarios. The experiments compare RAD-2's performance with existing strong diffusion-based planners, focusing on collision rate and driving smoothness. Additionally, ablation studies are conducted to verify the contributions of each component.

Results

The experimental results show that RAD-2 reduces the collision rate by 56% on large-scale benchmarks, significantly outperforming strong diffusion-based planners. Additionally, RAD-2 demonstrated improved perceived safety and driving smoothness in real-world vehicle tests, especially in complex urban traffic. Ablation studies indicate that Temporally Consistent Group Relative Policy Optimization and On-policy Generator Optimization significantly contribute to overall performance improvement.

Applications

Application scenarios for RAD-2 include motion planning for autonomous vehicles, particularly in complex urban traffic environments. This method can provide high-quality trajectory candidates without relying on expert supervision, enhancing system safety and efficiency. The industry can leverage RAD-2's generator-discriminator framework to develop more robust and efficient autonomous driving solutions.

Limitations & Outlook

Plain Language Accessible to non-experts

Imagine you're a chef preparing multiple dishes for a large banquet. You need to ensure each dish meets the guests' tastes while maintaining speed and quality. RAD-2 acts like your kitchen assistant, helping you select the best combination of dishes from numerous recipes. First, RAD-2 generates a series of different recipe options (trajectory candidates), covering various possible tastes and styles. Then, based on guest feedback (reinforcement learning optimization), RAD-2 reranks these recipes to ensure the final dish combination is both delicious and meets guest expectations. In this way, RAD-2 helps you make the best decisions in uncertain environments, ensuring the banquet's success.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool racing game. You need to choose the best route to win the race, but each route has different obstacles and challenges. RAD-2 is like your game assistant, helping you find the safest and fastest route among all the options. First, RAD-2 generates many different route options, like giving you a bunch of maps. Then, based on your previous game performance and feedback, it rearranges these routes to ensure you pick the one that avoids obstacles and gets you to the finish line quickly. With RAD-2, you can easily win the game! Isn't that awesome?

Glossary

Generator-Discriminator Framework

A framework combining generation and discrimination processes to produce diverse candidates and rank them based on quality.

Used in RAD-2 for trajectory generation and ranking.

Diffusion Model

A generative model that generates data samples by gradually adding noise.

Used to generate diverse trajectory candidates.

Reinforcement Learning

A machine learning method that learns optimal policies through interaction with the environment.

Used to optimize the discriminator to rerank trajectory candidates.

Temporally Consistent Group Relative Policy Optimization

An optimization method that uses temporal coherence to alleviate the credit assignment problem.

Enhances policy optimization stability.

On-policy Generator Optimization

A method that converts closed-loop feedback into structured longitudinal optimization signals.

Progressively shifts the generator toward high-reward trajectory manifolds.

BEV-Warp

A simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping.

Used for high-throughput closed-loop evaluations.

Collision Rate

The frequency of collisions occurring within a given time period.

Used to evaluate RAD-2's performance in experiments.

Imitation Learning

A machine learning method that learns policies by imitating expert behavior.

Combined with reinforcement learning to enhance policy learning.

Trajectory Candidates

A diverse set of trajectories generated by the generator for selecting the best path.

Used in RAD-2 for trajectory generation and ranking.

Closed-loop Planning

A planning method performed in a feedback loop, allowing dynamic decision adjustments.

Core application scenario for RAD-2.

Open Questions Unanswered questions from this research

1 How to improve the diversity and quality of trajectory candidates in extremely complex traffic scenarios to handle all potential driving situations. This requires more advanced generator designs and more efficient optimization algorithms.
2 How to more accurately capture real traffic dynamics in simulation environments to narrow the gap between simulation and reality. This requires higher fidelity simulators and more realistic datasets.
3 How to ensure RAD-2's safety in all scenarios without relying on additional expert supervision. This requires more powerful discriminators and more comprehensive safety evaluation mechanisms.
4 How to further optimize the synergy between the generator and discriminator to improve the quality and diversity of trajectory candidates. This requires deeper algorithm research and experimental validation.
5 How to validate RAD-2's robustness and adaptability under different driving conditions to ensure its widespread application globally. This requires cross-regional datasets and diverse test scenarios.

Applications

Immediate Applications

Urban Autonomous Driving

RAD-2 can be used for autonomous driving in urban environments, helping vehicles navigate complex traffic safely and efficiently.

Advanced Driver Assistance Systems

RAD-2 can be integrated into existing driver assistance systems to enhance decision-making capabilities under varying traffic conditions.

Autonomous Taxis

RAD-2 can be used for path planning in autonomous taxis, ensuring passenger safety and comfort.

Long-term Vision

Global Autonomous Driving Network

With further optimization, RAD-2 is expected to become a core technology for a global autonomous driving network, supporting cross-regional autonomous driving applications.

Intelligent Traffic Management Systems

RAD-2 can be used in intelligent traffic management systems to optimize urban traffic flow and improve overall traffic efficiency.

Abstract

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

cs.CV

References (20)

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao, Shaoyu Chen, Bo Jiang et al.

2025 58 citations ⭐ Influential View Analysis →

Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

Yuehao Song, Shaoyu Chen, Haolan Gao et al.

2026 1 citations ⭐ Influential View Analysis →

Reinforced Refinement With Self-Aware Expansion for End-to-End Autonomous Driving

Haochen Liu, Tianyu Li, Haohan Yang et al.

2025 14 citations View Analysis →

DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving

Shuyao Shang, Yuntao Chen, Yu-Quan Wang et al.

2025 14 citations View Analysis →

CARLA: An Open Urban Driving Simulator

Alexey Dosovitskiy, Germán Ros, Felipe Codevilla et al.

2017 6549 citations View Analysis →

Reinforcement Learning with Action Chunking

Qiyang Li, Zhiyuan Zhou, Sergey Levine

2025 39 citations View Analysis →

Learning to Drive in a Day

Alex Kendall, Jeffrey Hawke, David Janz et al.

2018 772 citations View Analysis →

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Adam Suma, Sam Dauncey

2025 2371 citations

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Bo Jiang, Shaoyu Chen, Qian Zhang et al.

2025 84 citations View Analysis →

VAD: Vectorized Scene Representation for Efficient Autonomous Driving

Bo Jiang, Shaoyu Chen, Qing Xu et al.

2023 585 citations View Analysis →

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li et al.

2025 392 citations View Analysis →

Refining Diffusion Planner for Reliable Behavior Synthesis by Automatic Detection of Infeasible Plans

Kyowoon Lee, Seongun Kim, Jaesik Choi

2023 24 citations View Analysis →

ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang et al.

2025 14 citations View Analysis →

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z. Zhao et al.

2025 123 citations View Analysis →

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

Jialv Zou, Shaoyu Chen, Bencheng Liao et al.

2025 9 citations View Analysis →

OpenAI o1 System Card

Ahmed El-Kishky

2024 1645 citations

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo et al.

2023 499 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 56709 citations View Analysis →

MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

2025 9 citations View Analysis →

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Guosheng Zhao, Xiaofeng Wang, Chaojun Ni et al.

2025 30 citations View Analysis →

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Generator-Discriminator Framework

Diffusion Model

Reinforcement Learning

Temporally Consistent Group Relative Policy Optimization

On-policy Generator Optimization

BEV-Warp

Collision Rate

Imitation Learning

Trajectory Candidates

Closed-loop Planning

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Urban Autonomous Driving

Advanced Driver Assistance Systems

Autonomous Taxis

Long-term Vision

Global Autonomous Driving Network

Intelligent Traffic Management Systems

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock