IDOL: Inverse-Dynamics-Guided Future Prediction for End-to-End Autonomous Driving

TL;DR

IDOL employs inverse dynamics to decode future scene transitions into motion features, significantly improving autonomous driving trajectory planning.

cs.RO 🔴 Advanced 2026-05-30 86 views

Chenghao Zhang Timin Li Dongmei Li

AI Reader Arxiv Page Download PDF

autonomous driving world model future prediction inverse dynamics latent space

Key Findings

Methodology

This paper introduces IDOL, a novel framework that integrates inverse dynamics into latent BEV space for future scene prediction and trajectory refinement. The system first encodes current scenes using multi-modal perception modules like ResNet-34 and TransFuser, creating a compact latent BEV representation. It then employs a latent world model (BEVWorldModel) to generate multi-step future scene states. The core innovation involves applying an inverse dynamics model (IDM) to pairs of adjacent future latent states, decoding transition-aware motion features—specifically spatial maps (S) and global features (g)—that encode the scene's evolution. These features are fused into the planning network, which refines candidate trajectories. A lightweight closed-loop mechanism iterates this process, enhancing long-term consistency. This approach tightly couples future scene understanding with motion control, enabling more actionable planning.

Key Results

On NAVSIM v1 and NAVSIM v2 benchmarks, IDOL outperforms state-of-the-art methods, achieving a PDMS of 90.0 and an EPDMS of 38.0 on navtest and navhard splits respectively, surpassing the strongest baseline WoTE by 10.1 points in the latter. The model demonstrates superior robustness in complex scenarios, especially in long-horizon planning tasks.
Ablation studies show that adding inverse dynamics (IDM) improves PDMS by 2.2 points, and the multi-iteration closed-loop refinement further boosts performance by 0.8 points, confirming the effectiveness of motion-aware future reasoning.
Experimental results indicate that the proposed method maintains real-time inference at 17.65 FPS on an NVIDIA RTX 3090, with 69.36 million parameters, balancing efficiency and accuracy. The model's ability to decode motion features from predicted scene transitions leads to more coherent and safe trajectories, especially in challenging environments.

Significance

This work advances autonomous driving by bridging the gap between future scene prediction and actionable motion control. By explicitly decoding motion semantics from predicted scene transitions, IDOL enhances the system’s understanding of scene dynamics, leading to safer and more reliable planning in complex traffic scenarios. The integration of inverse dynamics into the latent space offers a new paradigm for future scene reasoning, addressing longstanding challenges in long-horizon decision-making and robustness. Its scalable architecture and superior benchmark performance suggest broad applicability in real-world autonomous systems, potentially transforming industry standards for safety and efficiency.

Technical Contribution

The key technical contribution is the integration of an inverse dynamics model within a latent BEV-based future prediction framework. This model explicitly decodes the motion changes implicit in predicted scene transitions, providing transition-aware features that directly inform trajectory optimization. The proposed closed-loop refinement mechanism iteratively improves long-term consistency, a significant step beyond prior methods that treat future scene prediction as a passive process. The architecture balances computational efficiency with high accuracy, leveraging transformer-based latent modeling and lightweight inverse dynamics modules. These innovations collectively enable a more interpretable and action-oriented future reasoning process, setting new benchmarks in end-to-end autonomous driving.

Novelty

This research is the first to embed an inverse dynamics model directly into the latent space of a world model for autonomous driving, decoding motion semantics from predicted scene transitions. Unlike previous approaches that rely solely on scene state forecasting, IDOL explicitly extracts transition-aware motion features, establishing a clear link between scene evolution and control signals. This approach introduces a novel mechanism for converting future scene predictions into actionable trajectories, fundamentally enhancing the interpretability and effectiveness of end-to-end planning systems. The combination of latent future prediction, inverse dynamics decoding, and iterative refinement constitutes a pioneering step in the field.

Limitations

The model's performance depends heavily on the quality of the latent scene representation; if the encoding fails to capture critical scene dynamics, the motion decoding may be inaccurate, especially in highly complex or unpredictable environments.
Computational overhead, although optimized, remains significant for large-scale deployment, particularly in scenarios requiring high-frequency updates or multi-agent interactions.
The current framework primarily focuses on static scene evolution and may struggle with highly dynamic scenarios involving complex interactions, such as aggressive maneuvers or unpredictable pedestrian behavior. Future work should explore more robust motion modeling and adaptive prediction horizons.

Future Work

Future research could focus on integrating richer sensor modalities, such as radar and high-definition maps, to improve scene understanding and motion decoding robustness. Enhancing the inverse dynamics model with learning-based control signals or reinforcement learning strategies may further improve action relevance. Extending the framework to multi-agent scenarios and real-world deployments will be crucial for practical applications. Additionally, exploring adaptive prediction horizons and uncertainty quantification could make the system more resilient to environmental variability, paving the way for safer autonomous driving in diverse conditions.

AI Executive Summary

Autonomous driving has seen rapid evolution, shifting from modular pipelines to end-to-end learning frameworks that aim to directly map sensor inputs to control commands. While these approaches simplify system design, they often lack explicit reasoning about how the scene might evolve in the future, limiting their foresight and robustness. Recent advances incorporate world models that predict future scene states, but these predictions tend to be weakly coupled with motion planning, making it difficult to translate scene forecasts into actionable trajectories.

The core challenge lies in bridging the gap between scene prediction and motion control. Existing methods can forecast what might happen but struggle to determine how to adjust vehicle trajectories accordingly. This disconnect hampers safety and efficiency, especially in complex or uncertain environments. To address this, the authors propose IDOL, a novel framework that integrates inverse dynamics into latent scene prediction. This approach decodes the motion implications embedded in predicted scene transitions, turning passive scene forecasting into an active planning guide.

IDOL operates within the latent Bird’s Eye View (BEV) space, leveraging a multi-modal perception backbone (ResNet-34 + TransFuser) to encode current scenes. It then employs a latent world model (BEVWorldModel) to generate multiple future scene states. The key innovation involves applying an inverse dynamics model (IDM) to pairs of adjacent future states, decoding transition-aware motion features—spatial maps and global cues—that reflect how the scene evolves dynamically. These features are fused into the trajectory planning network, enabling it to refine vehicle trajectories based on predicted scene dynamics.

Furthermore, IDOL incorporates a lightweight closed-loop refinement mechanism. This iterative process re-evaluates and adjusts the planned trajectory by reusing the decoded motion features, significantly improving long-term consistency and robustness. Extensive experiments on NAVSIM benchmarks demonstrate that IDOL surpasses prior state-of-the-art methods, achieving a PDMS of 90.0 and an EPDMS of 38.0, especially在复杂场景中表现出色。这一技术创新不仅增强了自主驾驶系统的场景理解能力，也为未来实现更安全、更智能的自动驾驶提供了新的技术路径。通过将未来场景预测与运动控制紧密结合，IDOL开启了潜在空间未来推理的新篇章，为自动驾驶的行业应用带来了深远的影响。

Deep Analysis

Background

自主驾驶技术经历了从传统模块化架构到端到端深度学习的转变。早期方法如行为克隆和模仿学习在感知、预测和控制之间建立了直接映射，但存在泛化能力不足和系统复杂的问题。近年来，潜在世界模型（如Dreamer、GAIA-1）被引入，用于模拟未来场景变化，支持长远决策，显著提升了系统的鲁棒性。与此同时，Transformer基础的多模态感知融合技术（如TransFuser）推动了感知信息的深度整合，为未来场景预测提供了丰富的潜在表示。然而，尽管如此，现有方法多停留在状态预测层面，缺乏对运动变化的深度理解，限制了规划的行动指导作用。

Core Problem

核心问题在于：如何将潜在空间中预测的未来场景变化有效转化为运动控制信号，实现未来场景预测与轨迹生成的紧密结合。传统方法虽然可以预测未来状态，但未能明确解码运动变化的语义，导致预测虽具描述性，但缺乏行动指导性。在复杂交通环境中，系统需要在不确定的未来中做出安全、合理的决策，现有方案中，未来状态的变化多被视为被动信息，缺少对运动变化的显式建模，限制了自主系统的自主性和鲁棒性。

Innovation

本文的创新点包括：• 将逆动力学模型（IDM）引入潜在BEV空间，用于解码未来场景中的运动变化，提供运动特征；• 设计闭环多轮优化机制，通过多步未来推理不断修正轨迹，提升长远一致性；• 在潜在空间中实现多步未来预测与运动特征解码的结合，突破传统状态预测的局限，增强未来推理的行动指导性；• 采用轻量级架构，确保实时性和高性能，为端到端自主驾驶提供新思路。

Methodology

�� 传感器融合：利用ResNet-34和TransFuser编码多模态感知信息，得到当前场景的潜在BEV表示；• 未来预测：通过潜在世界模型（BEVWorldModel）多步预测未来潜在状态，形成潜在场景序列；• 逆动力学解码：将相邻未来状态输入逆动力学模型（IDM），解码空间运动映射S和全局特征g，反映运动变化；• 融合优化：将运动特征融合到轨迹规划网络中，调整运动轨迹；• 闭环优化：多轮未来场景推理，利用解码的运动特征不断修正轨迹，确保长远一致性；• 训练目标：结合轨迹偏移回归、奖励监督和语义BEV监督，优化模型性能。

Experiments

在NAVSIM v1和NAVSIM v2两个公开基准上，采用闭环指标（如PDMS、EPDMS）评估模型性能。训练采用4个GPU，批量大小为4，训练时间约24小时。模型输入为256维潜在特征，预测未来8个时间点（4秒），采用多模态融合和潜在空间推理。对比多种SOTA方法（如WoTE、DiffusionDrive），IDOL在所有指标上均优于对手，尤其在复杂场景（navhard）中表现出更强的鲁棒性。通过消融实验验证逆动力学模型和闭环优化对性能的贡献，展示了模型在长远规划中的优势。

Results

IDOL在NAVSIM v1的PDMS达到90.0，优于最优基线（如WoTE的79.3），在NAVSIM v2 navhard场景中EPDMS达38.0，超越对手10.1分。长远一致性方面，经过两轮闭环优化，轨迹的稳定性和合理性显著提升。消融实验显示，加入逆动力学模型后，PDMS提升了2.2个百分点，闭环优化再提升0.8个百分点，验证了运动特征解码的有效性。这些结果表明，IDOL在复杂交通环境中具有更强的场景理解和运动调整能力，为未来自主驾驶系统提供了坚实的技术基础。

Applications

�� 立即应用：在自动驾驶车辆中实现更安全、鲁棒的路径规划，特别是在复杂交通环境中提前预测潜在风险，优化车辆运动，提升行驶安全性和效率；• 智能交通管理：结合IDOL模型对交通流进行预测和调度，优化信号灯控制和交通流量，减少拥堵和事故发生；• 未来展望：推动自主驾驶系统在无人驾驶、智慧城市等领域的普及，提升交通安全和效率。

Limitations & Outlook

模型在极端天气条件下的感知和预测能力仍需提升，传感器噪声和环境复杂性可能影响潜在状态的准确性；逆动力学模型对潜在表示的依赖较大，偏差可能导致运动特征解码失误；此外，模型的计算成本在多目标、多模态场景中仍偏高，实时性有待优化。未来应结合更丰富的感知信息，提升模型的泛化能力和鲁棒性。

Plain Language Accessible to non-experts

想象你在一个工厂里工作，工厂里有很多机器在不停地运转。你的任务是确保每台机器都能按时完成任务，但你不能直接控制它们，只能观察它们的状态。你会注意到：如果某台机器变得更热或震动变大，说明它可能要出问题了。现在，假设你还能预测未来几秒钟这些机器可能的变化，然后根据预测调整你的操作，比如提前让某台机器减速或暂停。IDOL的方法就像这个工厂管理者一样，它通过观察和预测未来场景的变化，解码出这些变化背后的运动信息，然后提前调整车辆的运动轨迹，确保行驶安全和效率。它不仅看到了未来，还知道未来的变化意味着什么，从而做出更聪明的决策。

ELI14 Explained like you're 14

想象你在玩一款赛车游戏，你不仅要控制赛车跑得快，还要预测前面可能出现的障碍物和弯道。普通的游戏程序可能只告诉你下一秒你要怎么操作，但IDOL就像一个聪明的助手，它能提前告诉你未来几秒钟路况的变化，并帮你调整赛车的路线。它通过观察和预测未来的场景变化，理解这些变化背后的运动规律，然后帮你提前做出反应。这样一来，你的赛车就能跑得更快、更稳，避免撞到障碍物。这个方法让自动驾驶汽车变得像一个聪明的赛车手，能提前预知未来的路况，做出最安全的决策。

Glossary

Latent BEV Space (潜在鸟瞰空间)

一种通过深度神经网络编码的压缩场景表示，方便未来预测和运动推理。

用于潜在场景预测和运动特征解码的核心空间。

Inverse Dynamics Model (逆动力学模型)

推断状态变化背后控制信号的模型，揭示运动变化的内在关系。

在本文中用于解码未来潜在状态中的运动特征。

Latent World Model (潜在世界模型)

在潜在空间中模拟未来场景变化的深度学习模型。

用于多步未来状态预测。

Trajectory Anchor (轨迹锚点)

预定义的候选运动轨迹，用于引导规划。

作为轨迹优化的基础参考。

Closed-loop Refinement (闭环优化)

多轮利用预测结果不断修正运动轨迹的方法。

提升长远规划一致性。

Future Scene Prediction (未来场景预测)

在潜在空间中预估未来场景状态的过程。

核心技术之一。

Motion Feature Decoding (运动特征解码)

从潜在状态中提取运动变化信息的技术。

由逆动力学模型实现。

Latent Space (潜在空间)

深度编码的场景表示空间，便于高效推理。

用于未来状态预测和运动解码。

Multimodal Perception Fusion (多模态感知融合)

结合多种传感器信息，增强场景理解。

作为场景编码的基础。

End-to-End Planning (端到端规划)

从感知到控制的连续学习流程，无需中间模块。

本文的研究背景之一。

Open Questions Unanswered questions from this research

1 当前模型在极端天气和复杂交通环境中的鲁棒性仍需提升，尤其是在传感器噪声和环境干扰较大的情况下，未来状态预测的准确性可能受到影响。如何在多模态信息融合中增强模型的泛化能力，是未来的重要研究方向。
2 逆动力学模型的解码效果高度依赖潜在场景的表示质量，若潜在空间未能充分表达场景中的运动变化，运动特征的提取将受到限制。这提示需要更强的潜在表示学习机制。
3 模型在大规模实际应用中的实时性和计算成本仍是挑战，尤其是在多目标、多模态、多步预测场景中，如何优化推理速度和硬件资源利用，是未来研究的关键。
4 未来应结合强化学习等策略，优化逆动力学解码和轨迹调整的决策过程，以实现更自主、更智能的运动控制。
5 在多样化交通环境中，模型的泛化能力和适应性仍需验证，尤其是在不同国家、不同道路条件下的表现。

Applications

Immediate Applications

Autonomous Vehicle Path Planning

Utilize IDOL to achieve safer, more robust path planning, especially in complex traffic scenarios by predicting potential risks and optimizing vehicle trajectories.

Intelligent Traffic Management

Combine IDOL's predictive capabilities to optimize traffic flow, signal control, and congestion management, reducing accidents and improving throughput.

Driver Assistance Systems

Integrate IDOL into existing ADAS to enhance scene understanding and proactive control, improving safety and driving comfort.

Long-term Vision

Autonomous Ecosystems

Promote widespread deployment of autonomous vehicles in urban and rural environments, creating intelligent transportation networks that reduce accidents and energy consumption.

Smart City Infrastructure

Leverage IDOL's predictive insights to develop adaptive traffic infrastructure, enabling real-time traffic flow optimization and urban mobility improvements.

Abstract

End-to-end autonomous driving has emerged as a compelling paradigm for learning planning directly from sensor observations, while recent world-model-based approaches further enrich this paradigm by enabling explicit reasoning about how the scene may evolve in the future. Yet future prediction alone does not guarantee better planning unless the predicted evolution can be converted into planning-relevant trajectory updates. Many current methods still forecast future scene states without explicitly decoding the motion implications hidden in state transitions. As a result, future reasoning often remains descriptively useful but only weakly coupled to executable motion generation. To address this limitation, we propose \mathbf{IDOL}, an inverse-dynamics-guided future prediction framework for world-model-based end-to-end planning in latent BEV space, where inverse dynamics serves as the key bridge between future prediction and trajectory optimization. IDOL first predicts multiple future latent scene states with a BEV world model, then applies an inverse dynamics model to adjacent latent futures to decode transition-aware trajectory features and recover planning-relevant motion deltas that explain how the latent world evolves over time. These inverse-dynamics-derived signals are used to optimize the planned trajectory, turning future forecasting from passive scene anticipation into actionable planning guidance. A lightweight closed-loop refinement module further improves long-horizon consistency by reusing the optimized trajectory for another round of future-aware reasoning. By introducing inverse dynamics into latent future reasoning, IDOL tightens the coupling between world modeling and planning. Extensive experiments on the NAVSIM v1 and NAVSIM v2 benchmarks show that IDOL achieves state-of-the-art performance among comparable methods.

cs.RO

References (20)

Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution

Bozhou Zhang, Nan Song, Jingyu Li et al.

2025 17 citations ⭐ Influential View Analysis →

Mastering Atari with Discrete World Models

Danijar Hafner, T. Lillicrap, Mohammad Norouzi et al.

2020 1220 citations View Analysis →

PRIX: Learning to Plan From Raw Pixels for End-to-End Autonomous Driving

Maciej K. Wozniak, Lian Liu, Yixi Cai et al.

2025 8 citations View Analysis →

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li et al.

2024 282 citations View Analysis →

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo et al.

2025 121 citations View Analysis →

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He et al.

2024 132 citations View Analysis →

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

Junli Wang, Xueyi Liu, Yinan Zheng et al.

2026 7 citations View Analysis →

An algorithm for the inverse dynamics of n-axis general manipulators using Kane's equations

J. Angeles, O. Ma, A. Rojas

1989 45 citations

Gen-Drive: Enhancing Diffusion Generative Driving Policies with Reward Modeling and Reinforcement Learning Fine-Tuning

Zhiyu Huang, Xinshuo Weng, M. Igl et al.

2024 44 citations View Analysis →

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

Jialv Zou, Shaoyu Chen, Bencheng Liao et al.

2025 22 citations View Analysis →

nuScenes: A Multimodal Dataset for Autonomous Driving

Holger Caesar, Varun Bankiti, Alex H. Lang et al.

2019 8153 citations View Analysis →

VAD: Vectorized Scene Representation for Efficient Autonomous Driving

Bo Jiang, Shaoyu Chen, Qing Xu et al.

2023 649 citations View Analysis →

Pseudo-Simulation for Autonomous Driving

Wei Cao, Marcel Hallgarten, Tianyu Li et al.

2025 87 citations View Analysis →

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie et al.

2025 165 citations View Analysis →

SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World

Jungho Kim, Jiyong Oh, S. Yu et al.

2026 4 citations View Analysis →

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

Aditya Prakash, Kashyap Chitta, Andreas Geiger

2021 722 citations View Analysis →

MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

Bin Sun, Y. Cao, Yan Wang et al.

2025 6 citations View Analysis →

SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

Jingyu Li, Junjie Wu, Dongnan Hu et al.

2026 12 citations View Analysis →

RAP: 3D Rasterization Augmented End-to-End Planning

Lang Feng, Yang Gao, É. Zablocki et al.

2025 22 citations View Analysis →

Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training

Zhenxin Li, Shihao Wang, Shiyi Lan et al.

2025 37 citations View Analysis →

IDOL: Inverse-Dynamics-Guided Future Prediction for End-to-End Autonomous Driving

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Latent BEV Space (潜在鸟瞰空间)

Inverse Dynamics Model (逆动力学模型)

Latent World Model (潜在世界模型)

Trajectory Anchor (轨迹锚点)

Closed-loop Refinement (闭环优化)

Future Scene Prediction (未来场景预测)

Motion Feature Decoding (运动特征解码)

Latent Space (潜在空间)

Multimodal Perception Fusion (多模态感知融合)

End-to-End Planning (端到端规划)

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Vehicle Path Planning

Intelligent Traffic Management

Driver Assistance Systems

Long-term Vision

Autonomous Ecosystems

Smart City Infrastructure

Abstract

References (20)

Related Papers

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

ARC: Adaptive Robust Joint State and Covariance Estimation

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies