Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

TL;DR

League-based multi-agent RL achieves 22 m/s quadrotor racing with 50% collision reduction vs. single-agent baselines.

cs.RO 🔴 Advanced 2026-05-22 80 views

Ismail Geles Leonard Bauersfeld Markus Wulfmeier Davide Scaramuzza

Multi-Agent Reinforcement Learning Quadrotor Racing Safe Coordination League Training Sim-to-Real Transfer

Key Findings

Methodology

This paper introduces a league-based multi-agent reinforcement learning framework employing Proximal Policy Optimization (PPO) with recurrent LSTM networks to capture temporal dependencies. A Perceiver-based attention encoder processes variable and unordered opponent observations, ensuring permutation invariance. The training environment incorporates a particle-based aerodynamic downwash model to simulate physical interactions among quadrotors. Agents train against a diverse pool of opponents—including single-agent, independent multi-agent, and historical policies—via league play to foster robust, generalizable strategies for high-speed quadrotor racing with up to eight agents. The framework is validated both in large-scale simulations and real-world races with up to four competitors, including human champions, demonstrating zero-shot transfer capabilities.

Key Results

In real-world experiments, the league-trained policy achieved a fastest first lap time of 5.54 seconds, outperforming the human champion's 6.63 seconds, with 100% race completion in solo trials. In multi-agent races with up to four competitors, the policy maintained over 90% completion rates and reduced collision rates by 50% compared to single-agent baselines.
Large-scale simulation over 64,000 four-agent races showed the league-play policy averaged 4.96 seconds per lap—only 0.03 seconds slower than the fastest single-agent policy—but with significantly higher safety, achieving over 90% race completion versus under 25% for single-agent policies.
Ablation studies revealed that removing the Perceiver attention encoder drastically increased collision rates, especially gate collisions, confirming its critical role in processing multi-agent observations effectively.

Significance

This work addresses a fundamental limitation in autonomous systems: the brittleness of single-agent policies in dynamic, multi-agent physical environments. By integrating multi-agent reinforcement learning with realistic aerodynamic modeling and diverse opponent training, the study achieves unprecedented safe and agile quadrotor racing performance surpassing human experts. The findings have broad implications for deploying autonomous aerial systems in shared, safety-critical domains such as urban air mobility and multi-robot logistics, where robust multi-agent coordination is essential. The demonstrated zero-shot generalization to human interaction marks a significant step toward practical, safe robotic coexistence.

Technical Contribution

Key technical contributions include: 1) the design of a league-based multi-agent RL framework incorporating diverse opponent strategies to enhance policy generalization and robustness; 2) the novel application of a Perceiver attention encoder to handle variable, unordered multi-agent observations with permutation invariance; 3) integration of a particle-based aerodynamic downwash model capturing complex physical interactions among quadrotors; 4) successful zero-shot sim-to-real transfer validated in real-world multi-agent races against champion human pilots. These advances collectively push the state-of-the-art in multi-agent RL for high-speed physical systems.

Novelty

This study is the first to combine league training with realistic aerodynamic interaction modeling to achieve safe, high-speed multi-agent quadrotor racing beyond two-agent scenarios. Unlike prior work limited to low-speed or two-player settings, it scales to eight-agent races with robust zero-shot generalization to human opponents. The integration of Perceiver-based encoding for multi-agent observations and aerodynamic downwash simulation represents a fundamental innovation enabling effective physical multi-agent coordination.

Limitations

The approach has not been extensively tested in environments with more than eight agents, where increased density may degrade safety and performance.
The aerodynamic downwash model is an approximation and does not capture all complex airflow effects, potentially limiting policy robustness in more turbulent conditions.
Dependence on high-fidelity motion capture for state estimation in real-world deployment restricts applicability to environments with such infrastructure.

Future Work

Future research directions include scaling to higher agent densities to test safety and robustness limits, enhancing aerodynamic modeling fidelity to capture more complex interactions, and developing robust perception and state estimation methods to reduce reliance on motion capture systems. Additionally, exploring long-term multi-agent and human-robot interaction dynamics will be crucial for safe deployment in mixed environments.

AI Executive Summary

Autonomous systems have achieved superhuman performance in isolated or simulated settings but remain fragile in shared, dynamic real-world environments due to the prevalent single-agent paradigm that neglects interactions with other agents. This paper addresses this gap by introducing a multi-agent reinforcement learning framework based on league training, enabling safe and agile quadrotor racing with multiple competitors. The framework uses a Perceiver-based attention encoder to process variable and unordered opponent observations and incorporates a particle-based aerodynamic downwash model to simulate physical interactions realistically.

Through training against a diverse pool of opponents—including single-agent, independent multi-agent, and historical policies—agents develop sophisticated anticipatory behaviors such as proactive collision avoidance, strategic overtaking, and handling aerodynamic disturbances. The learned policies generalize zero-shot to human opponents and achieve speeds exceeding 22 m/s, outperforming a five-time Swiss national drone racing champion in real-world races.

Technically, the study pioneers the integration of league-based multi-agent RL with realistic aerodynamic modeling and permutation-invariant observation encoding, overcoming challenges of non-stationarity, exponential state-space growth, and partial observability inherent in multi-agent physical systems. Ablation studies confirm the critical role of the Perceiver encoder in maintaining safety and performance.

Experimental results demonstrate a 50% reduction in collision rates compared to state-of-the-art single-agent baselines, with race completion rates exceeding 90% in multi-agent scenarios. The policies maintain consistent safety margins regardless of competitive pressure, contrasting with human pilots who exhibit riskier behavior when trailing. This predictability is vital for deploying autonomous agents alongside humans safely.

The broader impact includes advancing multi-agent RL from simulation to real-world applications, providing a foundation for safe multi-robot coordination in domains such as urban air mobility, warehouse logistics, and search and rescue. The work highlights the necessity of interaction-aware training for robust multi-agent coexistence.

Limitations include reliance on approximate aerodynamic models and motion capture systems, and untested scalability beyond eight agents. Future work aims to address these challenges by improving physical modeling, perception robustness, and exploring long-term human-robot interaction dynamics, paving the way for widespread safe autonomous system deployment.

Deep Analysis

Background

The field of autonomous robotics has witnessed significant advances driven by reinforcement learning (RL), enabling robots to perform complex tasks such as locomotion, manipulation, and navigation. Landmark achievements include AlphaGo’s mastery of Go, and multi-agent successes in StarCraft II and Dota 2, demonstrating RL’s capability in strategic decision-making under uncertainty. However, these successes largely pertain to simulated or isolated single-agent environments. Real-world multi-agent coordination, especially in physical domains with stringent safety requirements, remains a formidable challenge. Prior work in autonomous drone racing has focused on single-agent or two-agent scenarios, optimizing lap times without accounting for complex multi-agent interactions or physical coupling effects like aerodynamic downwash. The exponential growth of state and action spaces, coupled with non-stationarity and partial observability, complicates learning robust policies. Moreover, collisions in physical systems cause hardware damage, making safety paramount. This paper builds upon this context to explore multi-agent RL for safe, high-speed quadrotor racing in realistic physical environments.

Core Problem

The core problem is enabling multiple autonomous quadrotors to race at high speeds in a shared physical space while maintaining safety and competitive performance. Challenges include: (1) modeling and anticipating complex interactions among multiple agents, including aerodynamic disturbances; (2) overcoming the limitations of single-agent RL that ignores opponent behaviors, leading to unsafe collisions; (3) handling the combinatorial explosion of the state space as the number of agents increases; (4) ensuring policies generalize to unseen opponents and configurations; and (5) achieving zero-shot transfer from simulation to real-world deployment with human competitors. Addressing these challenges is critical for applications requiring multi-robot coordination in dynamic, safety-critical environments.

Innovation

This work introduces several key innovations: (1) a league-based multi-agent RL framework that trains agents against a diverse set of opponents—including single-agent, independent multi-agent, and historical policies—promoting robust and generalizable strategies; (2) the use of a Perceiver-based attention encoder to process variable and unordered opponent observations, ensuring permutation invariance and efficient multi-agent information fusion; (3) incorporation of a particle-based aerodynamic downwash model simulating physical interactions among quadrotors, enabling policies to learn to maintain safe distances accounting for aerodynamic effects; (4) demonstration of zero-shot sim-to-real transfer in real-world multi-agent races against champion human pilots; and (5) extensive large-scale simulation and real-world experiments validating safety improvements and competitive performance. These innovations collectively address the limitations of prior work restricted to low-speed, two-agent, or simulation-only settings.

Methodology

�� Training Algorithm: Proximal Policy Optimization (PPO) with recurrent LSTM networks captures temporal dependencies in state-action sequences.

�� Observation Encoding: A Perceiver-based attention encoder processes the ego agent’s state and a variable number of opponents’ relative positions and velocities, producing a fixed-size, permutation-invariant feature vector.

�� Opponent Pool: League training involves a diverse opponent pool comprising single-agent policies (ignoring opponents), independent multi-agent policies (jointly trained), and historical checkpoints from prior training iterations.

�� Aerodynamic Modeling: A particle-based downwash model simulates thrust disturbances caused by nearby quadrotors, influencing flight dynamics and encouraging learned policies to maintain safe separation.

�� Simulation Environment: Agents train on the Split-S racetrack, a 75-meter circuit with seven gates, under varying numbers of opponents (up to eight).

�� Real-World Deployment: Policies trained entirely in simulation are deployed on 220-gram, 3-inch quadrotors equipped with motion capture for precise state estimation, competing against human pilots and other autonomous agents.

�� Evaluation Metrics: Race completion rate (percentage of race finished without collision) and lap times serve as primary metrics for safety and performance.

�� Ablation Studies: Experiments remove the Perceiver encoder and vary training regimes to isolate contributions of each component.

Experiments

The experimental setup includes both large-scale simulation and real-world testing. Simulation experiments encompass 64,000 four-agent races with randomized starting positions and opponent configurations, comparing five training paradigms: single-agent PPO, independent multi-agent PPO, fictitious self-play, league-play with Perceiver encoder, and league-play without Perceiver encoder. Metrics include average lap time and race completion rate. Real-world experiments involve time trials, AI-only races, and mixed human-AI races on the Split-S track with up to four competitors, including a five-time Swiss national drone racing champion. Policies are evaluated on safety (collision rates), competitiveness (lap times), and generalization (zero-shot transfer to human opponents). Additional analyses include value function visualization to interpret learned anticipatory behaviors.

Results

The league-play policy achieved a fastest first lap time of 5.54 seconds in solo trials, outperforming the human champion’s 6.63 seconds, with 100% race completion. In multi-agent races with up to four agents, it maintained over 90% completion rates and halved collision rates compared to single-agent baselines. Large-scale simulations showed league-play policies averaged 4.96 seconds per lap with over 90% race completion, significantly safer than single-agent policies which crashed in over 75% of races. Ablation removing the Perceiver encoder led to increased collisions, particularly with gates, underscoring its importance. Value function visualizations revealed that agents learned anticipatory collision avoidance, adjusting trajectories proactively based on predicted opponent positions. Policies generalized zero-shot to races against human pilots, maintaining safety and competitiveness.

Applications

The demonstrated multi-agent RL framework applies directly to autonomous drone racing, enabling safer and more strategic multi-robot competitions. Beyond racing, it provides foundational technology for urban air mobility systems where multiple autonomous aerial vehicles share congested airspace, requiring safe coordination. In warehouse logistics, similar multi-robot coordination strategies can improve throughput and reduce collisions. The zero-shot generalization to human interaction suggests applicability in mixed human-robot environments such as search and rescue operations, where autonomous agents must safely coexist with human operators.

Limitations & Outlook

The approach’s scalability beyond eight agents remains untested, with potential degradation in safety and performance at higher densities. The aerodynamic downwash model is a simplified approximation and may not capture all relevant airflow dynamics, limiting robustness in turbulent or cluttered environments. Real-world deployment relies on high-precision motion capture systems for state estimation, restricting applicability to instrumented environments. Additionally, long-term stability and adaptability under dynamic environmental changes and diverse human behaviors require further investigation.

Plain Language Accessible to non-experts

Imagine a busy playground where several kids are racing toy cars on a winding track. Each kid wants to go as fast as possible without crashing into others. If each kid only focuses on their own car and ignores others, crashes happen often. Now, imagine if each kid could predict where others will move and adjust their speed and path accordingly to avoid collisions while still racing fast. This is what the paper teaches drones to do.

The researchers created a training system where drones learn by racing against many different types of opponents, including past versions of themselves and other strategies. This helps them learn to handle all kinds of situations, just like kids playing with different friends learn new tricks. They also made sure the drones understand how the air pushed by nearby drones affects their flight, so they can keep a safe distance.

As a result, these drones not only race faster than expert human pilots but also crash half as much. This shows that when robots learn to think about others and the environment, they can work together safely and efficiently, even at high speeds. It’s like teaching kids to be smart and careful racers, not just fast ones.

ELI14 Explained like you're 14

Hey! Imagine you’re playing a super cool drone racing game with your friends. Everyone wants to win, but if you fly too fast and don’t watch out, you might crash into someone else — ouch! Traditional drones just focus on flying fast without paying attention to others, so crashes happen a lot. But this paper is about teaching drones to be super smart racers who watch what others are doing, predict their moves, and avoid crashing while still flying really fast!

The researchers made the drones practice by racing against lots of different opponents, kind of like playing with different friends who all have their own styles. This way, the drones learn all sorts of tricks to stay safe and win. They even taught the drones how to handle the wind and air pushed around by other drones, so they don’t get pushed off course.

Guess what? These drones can fly faster than a champion human pilot and crash half as much! That means in the future, drones can race or work together safely with people around. Isn’t that awesome? It’s like having super cool drone friends who are both fast and careful!

Glossary

Multi-Agent Reinforcement Learning

A branch of reinforcement learning where multiple agents learn and make decisions simultaneously in a shared environment, considering each other's actions and strategies.

Core methodology enabling drones to coordinate and compete safely in multi-agent racing.

League Training

A training paradigm where agents compete against a diverse pool of opponents, including historical versions and different strategies, to improve robustness and generalization.

Used to expose agents to varied opponents, preventing overfitting and enhancing safety.

Proximal Policy Optimization (PPO)

A policy gradient RL algorithm that stabilizes training by limiting the magnitude of policy updates, balancing exploration and exploitation.

Primary algorithm used to train quadrotor racing policies.

Perceiver Attention Encoder

An attention-based neural network architecture that processes variable-length, unordered inputs into fixed-size representations, ensuring permutation invariance.

Processes multi-agent observations regardless of opponent number or order.

Aerodynamic Downwash

The downward airflow generated by a flying vehicle’s rotors, which can disturb nearby vehicles’ flight dynamics.

Modeled to simulate physical interactions affecting quadrotor stability.

Zero-Shot Generalization

The ability of a trained model to perform well on unseen tasks or environments without additional training.

Demonstrated by policies transferring directly to races against human pilots.

Non-Stationary Environment

An environment whose dynamics change over time, often due to the actions of other learning agents, complicating policy learning.

Characteristic of multi-agent racing where opponents adapt and interact.

Motion Capture System

A system using cameras and sensors to track the precise position and orientation of objects in real time.

Provides accurate state estimation for real-world drone racing experiments.

Fictitious Self-Play

A training technique where agents play against past versions of themselves to improve strategy diversity and robustness.

Part of the league training approach to diversify opponents.

Recurrent Neural Network (RNN)

A neural network architecture designed to process sequential data by maintaining internal state, capturing temporal dependencies.

Used in policy and value networks to handle time-dependent racing dynamics.

Open Questions Unanswered questions from this research

1 The scalability of the proposed approach beyond eight agents remains unexplored; higher densities may introduce new coordination challenges and safety risks.
2 The aerodynamic downwash model is a simplification and may not capture complex turbulent airflow, limiting policy robustness in more realistic conditions.
3 Real-world deployment depends on high-precision motion capture systems, which are not always available, posing challenges for broader applicability.
4 Long-term stability and adaptability of learned policies under dynamic environmental changes and diverse human behaviors require further study.
5 Strategies for safe and effective long-term human-robot coexistence in shared aerial spaces remain an open research area.

Applications

Immediate Applications

Autonomous Drone Racing

Enables multi-drone races with enhanced safety and strategy, improving competition quality and reducing crashes.

Urban Air Mobility Coordination

Provides foundational algorithms for safe navigation and collision avoidance among multiple autonomous aerial vehicles in congested urban airspace.

Warehouse Multi-Robot Systems

Facilitates coordinated, collision-free operation of multiple robots in logistics and inventory management environments.

Long-term Vision

Safe Multi-Robot Coexistence

Paves the way for autonomous systems to safely share complex environments with humans, enabling widespread robotic integration.

Coordinated Autonomous Drone Fleets for Search and Rescue

Supports reliable, safe collaboration among drone teams in disaster response scenarios, enhancing operational effectiveness and safety.

Abstract

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

cs.RO cs.AI cs.LG cs.MA

References (20)

Fictitious Self-Play in Extensive-Form Games

Johannes Heinrich, Marc Lanctot, David Silver

2015 339 citations ⭐ Influential

Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

Johannes Heinrich, David Silver

2016 453 citations ⭐ Influential View Analysis →

Champion-level drone racing using deep reinforcement learning

Elia Kaufmann, L. Bauersfeld, Antonio Loquercio et al.

2023 795 citations ⭐ Influential

Grandmaster level in StarCraft II using multi-agent reinforcement learning

O. Vinyals, Igor Babuschkin, Wojciech M. Czarnecki et al.

2019 4349 citations ⭐ Influential

Agilicious: Open-source and open-hardware agile quadrotor for vision-based flight

Philipp Foehn, Elia Kaufmann, Angel Romero et al.

2022 167 citations ⭐ Influential View Analysis →

Reaching the limit in autonomous racing: Optimal control versus reinforcement learning

Yunlong Song, Angel Romero, Matthias Müller et al.

2023 279 citations ⭐ Influential View Analysis →

Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno, Andrew Brock et al.

2021 1426 citations View Analysis →

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Zhao, Chelsea Finn

2024 643 citations View Analysis →

Learning quadrupedal locomotion over challenging terrain

Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen et al.

2020 1606 citations View Analysis →

Stable-Baselines3: Reliable Reinforcement Learning Implementations

A. Raffin, Ashley Hill, A. Gleave et al.

2021 2794 citations

Human-level performance in 3D multiplayer games with population-based reinforcement learning

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning et al.

2018 803 citations View Analysis →

TidyBot: Personalized Robot Assistance with Large Language Models

Jimmy Wu, Rika Antonova, Adam Kan et al.

2023 433 citations View Analysis →

Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

Noam Brown, T. Sandholm

2018 806 citations

Environment as Policy: Learning to Race in Unseen Tracks

Hongze Wang, Jiaxu Xing, Nico Messikommer et al.

2024 11 citations View Analysis →

Mastering the game of Go without human knowledge

David Silver, Julian Schrittwieser, K. Simonyan et al.

2017 10436 citations

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan et al.

2019 2141 citations View Analysis →

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Marc Lanctot, V. Zambaldi, A. Gruslys et al.

2017 729 citations View Analysis →

MonoRace: Winning Champion-Level Drone Racing with Robust Monocular AI

Stavrow Bahnam, Robin Ferede, Till M. Blaha et al.

2026 4 citations View Analysis →

Robotics Meets Fluid Dynamics: A Characterization of the Induced Airflow Below a Quadrotor as a Turbulent Jet

L. Bauersfeld, K. Muller, Dominic Ziegler et al.

2024 11 citations View Analysis →

Multi-agent deep reinforcement learning: a survey

Sven Gronauer, K. Diepold

2021 872 citations

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Multi-Agent Reinforcement Learning

League Training

Proximal Policy Optimization (PPO)

Perceiver Attention Encoder

Aerodynamic Downwash

Zero-Shot Generalization

Non-Stationary Environment

Motion Capture System

Fictitious Self-Play

Recurrent Neural Network (RNN)

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Drone Racing

Urban Air Mobility Coordination

Warehouse Multi-Robot Systems

Long-term Vision

Safe Multi-Robot Coexistence

Coordinated Autonomous Drone Fleets for Search and Rescue

Abstract

References (20)

Related Papers

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

ARC: Adaptive Robust Joint State and Covariance Estimation

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies