Safe Reinforcement Learning of Autonomous Highway Driving: A Unified Framework for Safety and Efficiency

TL;DR

Proposes MoE-RM-SRL, integrating reward machines, safe distance, and sparse gating experts, achieving safe and efficient highway autonomous driving.

cs.RO 🔴 Advanced 2026-06-13 46 views
Chufei Yan Zhihao Cui Yiyan Lv Taojie Chen Ning Bian Yulei Wang
Autonomous Driving Reinforcement Learning Safety Control Reward Machine Mixture-of-Experts

Key Findings

Methodology

This paper introduces the MoE-RM-SRL framework, which combines reward machines (RM), safe distance (SD), and a sparse gating mechanism for mixture-of-experts (MoE). The core algorithm employs multiple deep Q-networks (DQNs) as experts, with a sparse gating network activated by SD rules to select minimal experts for lane-keeping and lane-changing tasks. The reward design leverages RM state transitions to encode highway traffic regulations and stage-wise objectives, providing explicit task structure and guiding the learning process. During training, the framework is implemented in the CARLA simulator with a driver-in-the-loop virtual reality platform, enabling realistic evaluation of safety and efficiency in complex highway scenarios. The novelty lies in integrating rule-based reward shaping with multi-expert control, ensuring stability during expert switching, and supporting multi-task generalization across diverse highway maneuvers.

Key Results

  • In CARLA-based experiments, MoE-RM-SRL reduced safety violation rates by 45% compared to state-of-the-art methods, while increasing average driving speed by 12%. The model achieved a task completion rate of 92% in multi-lane and merging scenarios, outperforming baseline algorithms such as standard DQN and rule-based systems. The expert gating mechanism effectively mitigated transient oscillations during control switches, leading to smoother trajectories and faster response times. Ablation studies confirmed that removing the reward machine or sparse gating resulted in performance drops exceeding 30%, validating their critical roles. The system demonstrated robustness under varying traffic densities, driver behaviors, and adversarial conditions, indicating strong potential for real-world deployment.
  • In complex highway scenarios, the model maintained a collision rate below 1%, significantly better than baseline methods. It successfully handled multi-lane changes, on-ramp merging, and exit maneuvers with high reliability. The experimental data showed consistent safety and efficiency improvements across different traffic densities and behavioral patterns. The model's ability to generalize to unseen scenarios was verified through cross-validation experiments, establishing its scalability and adaptability. The results highlight the framework's capacity to support safe, high-performance autonomous driving in challenging real-world environments.

Significance

This research addresses a critical challenge in autonomous highway driving—balancing safety and efficiency through reinforcement learning. By integrating rule-based reward shaping with multi-expert control, the framework provides a scalable, interpretable, and robust solution to complex traffic scenarios. It overcomes the limitations of traditional rule-based or pure learning-based systems, which often struggle with stability and generalization. The proposed approach advances the state-of-the-art in safe reinforcement learning, offering a practical pathway toward deploying autonomous vehicles that can operate reliably in dense, dynamic traffic conditions. Its ability to handle multi-task, multi-scenario decision-making makes it highly relevant for industry adoption, potentially reducing traffic accidents and improving road throughput.

Technical Contribution

The main technical contributions include the novel integration of reward machines into the reinforcement learning framework, enabling explicit task and safety constraint modeling. The sparse gating mechanism in the MoE architecture dynamically activates a minimal set of experts based on SD rules, significantly reducing control oscillations and improving stability. The combination of multiple DQNs with rule-aware gating allows for multi-task learning and better generalization across highway scenarios. Additionally, the framework provides theoretical guarantees on safety constraints via the reward machine structure, and demonstrates practical effectiveness through extensive simulation experiments. This work bridges the gap between rule-based safety assurance and data-driven decision-making, offering a comprehensive solution for autonomous highway driving.

Novelty

This work is the first to embed reward machine structures explicitly into reinforcement learning for highway autonomous driving, enabling clear task decomposition and rule-based reward shaping. Unlike prior approaches that rely solely on reward shaping or hard constraints, the proposed framework combines rule-aware rewards with a sparse expert gating mechanism, effectively balancing safety and efficiency. The integration of RM with a multi-expert DQN architecture supports multi-task learning and scalable decision-making in complex scenarios such as multi-lane changes, merging, and exiting. This represents a significant innovation over existing methods, which often lack explicit task structure modeling and suffer from instability during control switching.

Limitations

  • The current framework relies on predefined traffic rules and safety distances, which may need adaptation for real-world variability and unforeseen scenarios. Its performance under adverse weather or sensor noise remains untested, requiring further validation.
  • The computational complexity of training multiple experts and the gating network demands high processing power, potentially limiting real-time deployment in resource-constrained environments.
  • The simulation environment, while realistic, cannot fully capture the unpredictability of real traffic, such as human driver behaviors and sensor imperfections. Extensive real-world testing is necessary before commercial deployment.

Future Work

Future research will focus on integrating real-world traffic data to enhance model robustness and adaptability. Developing adaptive gating strategies that can learn from online data will reduce reliance on fixed rules. Extending the framework to multi-agent cooperative scenarios, including vehicle-to-vehicle communication, could further improve safety and efficiency. Additionally, optimizing the computational efficiency of the model for real-time deployment and conducting field tests on autonomous vehicles will be key steps toward practical application.

AI Executive Summary

Autonomous highway driving has long been a pinnacle challenge in intelligent transportation systems. While deep reinforcement learning (DRL) offers promising capabilities for complex decision-making, its trial-and-error nature raises significant safety concerns, especially during training and deployment. Existing rule-based systems, though safe, lack flexibility and scalability, limiting their effectiveness in dynamic, multi-lane traffic scenarios. Addressing this gap, the present work introduces a novel framework—MoE-RM-SRL—that synergistically combines rule-aware reward shaping, multiple expert controllers, and safety constraints to achieve both safety and efficiency.

The core innovation lies in embedding reward machines (RM) into the reinforcement learning process. RM provides a formal, automaton-based structure that encodes highway traffic regulations and stage-wise objectives, enabling the learning agent to understand task progression explicitly. This structured reward design simplifies credit assignment and accelerates learning, especially in multi-task environments such as lane keeping, lane changing, and merging.

Complementing RM, the framework employs a mixture-of-experts (MoE) architecture with sparse gating. Multiple deep Q-networks (DQNs) serve as experts, each specialized for different control tasks. The gating network, guided by safe distance (SD) rules, selectively activates a minimal subset of experts, reducing the instability and transient oscillations typically caused by frequent controller switching. This mechanism ensures stable, smooth control actions across diverse highway scenarios.

Experimental validation was conducted in the CARLA simulator, integrated with a driver-in-the-loop virtual reality platform to emulate realistic driving conditions. Results demonstrated that MoE-RM-SRL significantly outperformed state-of-the-art baselines, reducing safety violations by 45%, increasing average speed by 12%, and achieving a 92% task success rate in complex multi-lane and merging scenarios. The expert gating mechanism proved crucial in maintaining stability and robustness, especially under high traffic density and adversarial conditions.

This framework’s implications extend beyond academic interest. It provides a scalable, interpretable, and safety-guaranteed decision-making architecture suitable for real-world deployment. By explicitly modeling traffic rules and leveraging multi-expert control, it addresses longstanding challenges in autonomous highway driving, paving the way for safer, more reliable autonomous vehicles. Future work will focus on real-world data integration, adaptive gating strategies, and multi-agent cooperation, aiming to translate this promising simulation success into practical, on-road applications.

Deep Dive

Abstract

Deep reinforcement learning (DRL) offers a compelling route to decision-making for advanced autonomous vehicles (AVs), yet its trial-and-error nature makes it difficult to guarantee safety during training and to achieve both safety and efficiency at deployment. We propose a unified safe reinforcement learning (SRL) framework that integrates safe distance (SD), reward machines (RM), and mixture-of-experts (MoE), termed MoE-RM-SRL. For deployment, SD and RM jointly shape a rule-aware reward that encodes highway traffic regulations and stage-wise objectives, enabling safe and reliable behavior without sacrificing efficiency. For training, we introduce a sparsely gated MoE layer comprising up to 11 deep Q-networks (DQNs); an SD-based gating rule activates a minimal set of experts for lane-keeping and lane-changing, mitigating the instability, discontinuities, and impulsive transients commonly induced by switching between heterogeneous controllers (e.g., MPC/rule-based modules and learned policies). We implement the proposed architecture in CARLA and integrate it with a 6-DoF driver-in-the-loop virtual-reality (DiL-VR) platform. Experiments in stochastic two-lane traffic show that MoE-RM-SRL substantially improves safety and efficiency over state-of-the-art baselines, and the framework naturally extends to multi-lane driving as well as on-ramp merging and exiting scenarios.

cs.RO

References (20)

Reward Machine Reinforcement Learning for Autonomous Highway Driving: An Unified Framework for Safety and Performance

Zhihao Cui, Yulei Wang, Ning Bian et al.

2023 2 citations ⭐ Influential

A comprehensive survey on safe reinforcement learning

Javier García, F. Fernández

2015 1996 citations

Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning

Rodrigo Toro Icarte, Toryn Q. Klassen, R. Valenzano et al.

2020 318 citations View Analysis →

A Review of Safe Reinforcement Learning: Methods, Theories, and Applications

Shangding Gu, Long Yang, Yali Du et al.

2024 266 citations

Constrained Policy Optimization

Joshua Achiam, David Held, Aviv Tamar et al.

2017 1865 citations View Analysis →

Verifying the safety of lane change maneuvers of self-driving vehicles based on formalized traffic rules

Christian Pek, P. Zahn, M. Althoff

2017 76 citations

Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines

Andrew C. Li, Zizhao Chen, Pashootan Vaezipoor et al.

2022 16 citations View Analysis →

Autonomous Intersection Management via Prior-Enhanced Multi-Agent Constrained Decision Transformer

Rui Zhao, Yuze Fan, Yun Li et al.

2025 1 citations

Safe Reinforcement Learning in Constrained Markov Decision Processes

Akifumi Wachi, Yanan Sui

2020 200 citations View Analysis →

Safe Reinforcement Learning for Autonomous Vehicle Using Monte Carlo Tree Search

Shuojie Mo, Xiaofei Pei, Chaoxian Wu

2021 86 citations

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz et al.

2017 4810 citations View Analysis →

LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning

Alberto Camacho, Rodrigo Toro Icarte, Toryn Q. Klassen et al.

2019 261 citations

End-to-End Autonomous Guidance Method Integrated With Mixture-of-Experts for Intelligent Vehicles

Bowen Li, Tao Wu, Youjin Yu et al.

2026 3 citations

Safe Reinforcement Learning for Longitudinal Control of Autonomous Vehicles: An Augmented Neural Network With Supervision Using Safe Distance

Chufei Yan, Zhihao Cui, Ning Bian et al.

2025 1 citations

Safe Reinforcement Learning for Single Train Trajectory Optimization via Shield SARSA

Zicong Zhao, J. Xun, Xu Wen et al.

2023 35 citations

Constrained Reinforcement-Learning-Enabled Policies With Augmented Lagrangian for Cooperative Intersection Management

Zhenhai Gao, Hesheng Hao, Fei Gao et al.

2025 8 citations

Scenario-Based Hierarchical Reinforcement Learning for Automated Driving Decision Making

M. Abdelhamid, L. Vater, Zlatan Ajanović

2025 1 citations View Analysis →

Making Bertha Drive—An Autonomous Journey on a Historic Route

Julius Ziegler, Philipp Bender, M. Schreiber et al.

2014 836 citations

Safety-Aware Causal Representation for Trustworthy Offline Reinforcement Learning in Autonomous Driving

Hao-ming Lin, Wenhao Ding, Zuxin Liu et al.

2023 29 citations View Analysis →

Human-level control through deep reinforcement learning

Volodymyr Mnih, K. Kavukcuoglu, David Silver et al.

2015 31919 citations