DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

TL;DR

DynVLA uses Dynamics CoT to predict compact world dynamics, excelling on datasets like NAVSIM.

cs.CV 🔴 Advanced 2026-03-12 14 views

Shuyao Shang Bing Zhan Yunfei Yan Yuqi Wang Yingyan Li Yasong An Xiaoman Wang Jierui Liu Lu Hou Lue Fan Zhaoxiang Zhang Tieniu Tan

AI Reader Arxiv Page Download PDF

autonomous driving dynamics modeling vision-language-action deep learning reinforcement learning

Key Findings

Methodology

DynVLA introduces a new Chain of Thought (CoT) paradigm called Dynamics CoT. Its core components include a Dynamics Tokenizer that compresses future dynamics into a compact set of tokens. By decoupling ego-centric and environment-centric dynamics, DynVLA achieves more accurate world dynamics modeling. Additionally, through Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), DynVLA generates dynamics tokens before actions, improving decision quality while maintaining efficient inference.

Key Results

On the NAVSIM benchmark, DynVLA achieved the highest PDMS score, outperforming both traditional end-to-end methods and recent VLA methods, indicating its advantage in future dynamics reasoning.
On the Bench2Drive benchmark, DynVLA achieved the best performance across all metrics, demonstrating its advantages in long-horizon interactive scenarios.
On a large-scale in-house dataset, DynVLA achieved the lowest ADE and Collision Rate, indicating its reliability at larger data scales.

Significance

The introduction of DynVLA provides a new method for dynamics modeling in autonomous driving, significantly improving decision accuracy and efficiency by reasoning future dynamics before action generation. It addresses the shortcomings of existing textual and visual CoT methods in spatiotemporal understanding and reasoning redundancy, providing more physically grounded decision support for autonomous driving models. Its outstanding performance across multiple benchmarks validates its practical value in academia and industry.

Technical Contribution

Technically, DynVLA introduces Dynamics CoT, offering a compact dynamics representation that reduces reasoning redundancy and improves spatiotemporal modeling accuracy. Compared to existing textual and visual CoT methods, DynVLA avoids redundant reasoning by encoding only scene dynamics. Additionally, its Dynamics Tokenizer achieves more physically meaningful dynamics representation by decoupling ego-centric and environment-centric dynamics.

Novelty

DynVLA is the first to introduce the Dynamics CoT paradigm in autonomous driving, addressing the limitations of textual and visual CoT in spatiotemporal understanding through compact dynamics representation. Its Dynamics Tokenizer design provides more accurate dynamics modeling by decoupling dynamic factors.

Limitations

DynVLA may face challenges in complex urban traffic scenarios, where dynamic factors are more diverse and unpredictable.
In high-density traffic environments, the Dynamics Tokenizer may fail to capture all significant dynamic changes.
Further research is needed to enhance the model's robustness in scenarios with drastic dynamic changes.

Future Work

Future research directions include testing DynVLA's performance in more complex traffic scenarios and exploring how to further optimize the Dynamics Tokenizer design to improve its performance in high-density traffic environments. Additionally, research on integrating DynVLA with other autonomous driving technologies to achieve more comprehensive autonomous driving solutions is needed.

AI Executive Summary

Autonomous driving technology has made significant progress in recent years, but existing methods still face challenges in complex traffic scenarios. Traditional textual and visual Chain of Thought (CoT) methods have limitations in spatiotemporal understanding and reasoning efficiency, making it difficult to cope with dynamic driving environments.

To address these issues, researchers have proposed DynVLA, a new autonomous driving Vision-Language-Action (VLA) model. DynVLA introduces a new CoT paradigm called Dynamics CoT, which provides more physically grounded decision support by predicting compact world dynamics before action generation. Its core component is the Dynamics Tokenizer, which compresses future dynamics into a compact set of tokens.

By decoupling ego-centric and environment-centric dynamics, DynVLA achieves more accurate world dynamics modeling. Through Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), DynVLA generates dynamics tokens before actions, improving decision quality while maintaining efficient inference. Compared to textual and visual CoT methods, Dynamics CoT avoids redundant reasoning by encoding only scene dynamics.

DynVLA has demonstrated outstanding performance across multiple benchmarks. On the NAVSIM benchmark, it achieved the highest PDMS score, outperforming both traditional end-to-end methods and recent VLA methods. On the Bench2Drive benchmark, DynVLA achieved the best performance across all metrics, demonstrating its advantages in long-horizon interactive scenarios. On a large-scale in-house dataset, DynVLA achieved the lowest ADE and Collision Rate, indicating its reliability at larger data scales.

The introduction of DynVLA provides a new method for dynamics modeling in autonomous driving, significantly improving decision accuracy and efficiency by reasoning future dynamics before action generation. Its outstanding performance across multiple benchmarks validates its practical value in academia and industry. However, DynVLA may face challenges in complex urban traffic scenarios, where dynamic factors are more diverse and unpredictable. Future research directions include testing DynVLA's performance in more complex traffic scenarios and exploring how to further optimize the Dynamics Tokenizer design to improve its performance in high-density traffic environments.

Deep Analysis

Background

The research on autonomous driving technology has a history of several decades, and significant progress has been made in recent years with the advancement of deep learning and computer vision technologies. Traditional autonomous driving systems often rely on rule-based and model-driven approaches, which perform well in simple driving environments but struggle in complex urban traffic scenarios. In recent years, end-to-end deep learning methods have gradually become a research hotspot, learning driving strategies directly from sensor data and avoiding complex rule design. However, these methods still have limitations in interpretability and robustness.

To improve the decision quality of autonomous driving systems, researchers have begun exploring Vision-Language-Action (VLA) models, which combine visual and language information to better understand and reason about complex dynamic relationships in driving scenarios. The Chain of Thought (CoT) paradigm is an important method in VLA models, improving decision reliability by reasoning before action generation. However, existing textual and visual CoT methods have limitations in spatiotemporal understanding and reasoning efficiency, making it difficult to cope with dynamic driving environments.

Core Problem

Autonomous driving systems face numerous challenges in complex urban traffic scenarios, where dynamic factors are diverse and unpredictable. Existing textual and visual Chain of Thought (CoT) methods have limitations in spatiotemporal understanding and reasoning efficiency, making it difficult to cope with these dynamically changing driving environments. Textual CoT methods lack fine-grained spatiotemporal understanding, while visual CoT methods introduce substantial redundancy due to dense image prediction, leading to inefficient reasoning. Therefore, a new method is urgently needed to improve the decision quality and efficiency of autonomous driving systems in complex dynamic environments.

Innovation

The core innovation of DynVLA lies in introducing a new Chain of Thought (CoT) paradigm called Dynamics CoT. Its Dynamics Tokenizer achieves more accurate world dynamics modeling by decoupling ego-centric and environment-centric dynamics. Dynamics CoT provides more physically grounded decision support by predicting compact world dynamics before action generation. Compared to existing textual and visual CoT methods, Dynamics CoT avoids redundant reasoning by encoding only scene dynamics. Additionally, DynVLA generates dynamics tokens before actions through Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), improving decision quality while maintaining efficient inference.

Methodology

The implementation of DynVLA includes the following key steps:

�� Dynamics Tokenizer: First, DynVLA compresses future dynamics into a compact set of tokens through the Dynamics Tokenizer. The Dynamics Tokenizer achieves more accurate world dynamics modeling by decoupling ego-centric and environment-centric dynamics.

�� Supervised Fine-Tuning (SFT): Before action generation, DynVLA generates dynamics tokens through Supervised Fine-Tuning, improving decision quality.

�� Reinforcement Fine-Tuning (RFT): Through Reinforcement Fine-Tuning, DynVLA generates dynamics tokens before actions, improving decision quality while maintaining efficient inference.

�� Dynamics CoT: Compared to existing textual and visual CoT methods, Dynamics CoT avoids redundant reasoning by encoding only scene dynamics.

Experiments

The experimental design includes evaluations on multiple benchmarks, including NAVSIM, Bench2Drive, and a large-scale in-house dataset. Various baseline methods were used for comparison, including traditional end-to-end methods and recent VLA methods. Key hyperparameters include the size of the Dynamics Tokenizer and the decoupling strategy. Ablation studies were also conducted to verify the effectiveness of the Dynamics Tokenizer and Dynamics CoT.

Results

On the NAVSIM benchmark, DynVLA achieved the highest PDMS score, outperforming both traditional end-to-end methods and recent VLA methods. On the Bench2Drive benchmark, DynVLA achieved the best performance across all metrics, demonstrating its advantages in long-horizon interactive scenarios. On a large-scale in-house dataset, DynVLA achieved the lowest ADE and Collision Rate, indicating its reliability at larger data scales. Ablation studies show that the design of the Dynamics Tokenizer and Dynamics CoT plays a key role in improving model performance.

Applications

DynVLA's application scenarios include decision systems for autonomous vehicles, especially in complex urban traffic environments. Its compact dynamics representation and efficient reasoning process enable it to provide more reliable decision support in dynamically changing scenarios. DynVLA can also be integrated with other autonomous driving technologies to achieve more comprehensive autonomous driving solutions.

Limitations & Outlook

DynVLA may face challenges in complex urban traffic scenarios, where dynamic factors are more diverse and unpredictable. Additionally, in high-density traffic environments, the Dynamics Tokenizer may fail to capture all significant dynamic changes. Future research directions include testing DynVLA's performance in more complex traffic scenarios and exploring how to further optimize the Dynamics Tokenizer design to improve its performance in high-density traffic environments.

Plain Language Accessible to non-experts

Imagine you're driving, and the vehicles and pedestrians around you are constantly changing. To drive safely, you need to predict these dynamic changes and make corresponding decisions. DynVLA is like a smart assistant that helps you predict future traffic dynamics before you make driving decisions. It uses a tool called the Dynamics Tokenizer to compress future changes into a set of simple tokens. These tokens are like a compass for you while driving, helping you make wiser decisions in complex traffic environments. Unlike traditional methods, DynVLA not only considers your driving behavior but also takes into account changes in the surrounding environment. It's like when you're driving, you not only pay attention to your speed but also the movements of surrounding vehicles. DynVLA helps you drive safely in complex traffic environments through this approach.

ELI14 Explained like you're 14

Hey there, friends! Have you ever wondered how self-driving cars know when to turn or brake? It's not magic! Scientists have invented a super-smart system called DynVLA. Imagine you're playing a racing game, and the car in the game automatically adjusts its speed and direction based on the track changes. DynVLA is like the smart assistant in the game, predicting changes on the road, like whether the car in front will suddenly stop or if the car next to you will change lanes. This way, self-driving cars can react in advance and avoid collisions. Isn't that cool? But, this system also has some challenges, like when there's a lot of traffic, it might get a bit overwhelmed. But scientists are working hard to make it smarter and safer. In the future, with DynVLA, self-driving cars will be more reliable, making our travel safer!

Glossary

Dynamics Tokenizer

A tool that compresses future dynamic changes into a compact set of tokens. It achieves more accurate world dynamics modeling by decoupling ego-centric and environment-centric dynamics.

In DynVLA, the Dynamics Tokenizer is used to generate compact dynamics representations to improve decision quality.

Chain of Thought (CoT)

A reasoning paradigm that improves decision reliability by reasoning before action generation.

In DynVLA, Dynamics CoT avoids redundant reasoning by encoding only scene dynamics.

Vision-Language-Action Model (VLA)

A model that combines visual and language information to better understand and reason about complex dynamic relationships in driving scenarios.

DynVLA is a new autonomous driving VLA model that improves decision quality through Dynamics CoT.

Supervised Fine-Tuning (SFT)

A training method that generates dynamics tokens before action generation to improve decision quality.

In DynVLA, SFT is used to generate dynamics tokens before action generation.

Reinforcement Fine-Tuning (RFT)

A training method that improves model decision quality and reasoning efficiency through reinforcement learning.

In DynVLA, RFT is used to generate dynamics tokens before action generation.

Ego-centric Dynamics

Dynamic changes arising from the motion of the ego vehicle.

In DynVLA, ego-centric dynamics are used alongside environment-centric dynamics for dynamic modeling.

Environment-centric Dynamics

Dynamic changes arising from external changes such as other traffic participants.

In DynVLA, environment-centric dynamics are used alongside ego-centric dynamics for dynamic modeling.

Spatiotemporal Modeling

Modeling dynamic changes in both time and space dimensions.

In DynVLA, spatiotemporal modeling is used to improve decision accuracy.

Ablation Study

An experimental method that evaluates the impact of removing or modifying certain parts of a model on overall performance.

In DynVLA experiments, ablation studies are used to verify the effectiveness of the Dynamics Tokenizer and Dynamics CoT.

Benchmark

A standardized testing method used to evaluate model performance on specific tasks.

In DynVLA experiments, multiple benchmarks are used to evaluate model performance.

Open Questions Unanswered questions from this research

1 Despite DynVLA's outstanding performance across multiple benchmarks, its performance in complex urban traffic scenarios still needs further validation. These scenarios have more diverse and unpredictable dynamic factors, which may challenge the model's robustness.
2 In high-density traffic environments, the Dynamics Tokenizer may fail to capture all significant dynamic changes. How to optimize the Dynamics Tokenizer design to improve its performance in high-density traffic environments remains a research question.
3 DynVLA's Dynamics Tokenizer achieves more accurate dynamics modeling by decoupling ego-centric and environment-centric dynamics, but the applicability of this decoupling strategy in different scenarios needs further exploration.
4 In terms of reasoning efficiency, although Dynamics CoT reduces redundancy through compact dynamics representation, how to further improve reasoning efficiency in practical applications remains an open question.
5 Future research directions include integrating DynVLA with other autonomous driving technologies to achieve more comprehensive autonomous driving solutions. This requires exploring integration methods between different technologies.

Applications

Immediate Applications

Decision Systems for Autonomous Vehicles

DynVLA can be used in decision systems for autonomous vehicles, especially in complex urban traffic environments. Its compact dynamics representation and efficient reasoning process enable it to provide more reliable decision support in dynamically changing scenarios.

Long-term Vision

Comprehensive Autonomous Driving Solutions

In the future, DynVLA can be integrated with other autonomous driving technologies to achieve more comprehensive autonomous driving solutions. This requires exploring integration methods between different technologies to improve the overall performance and safety of the system.

Abstract

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

cs.CV cs.RO

References (20)

DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving

Xiaosong Jia, Yulu Gao, Li Chen et al.

2023 132 citations ⭐ Influential View Analysis →

Hidden Biases of End-to-End Driving Models

Bernhard Jaeger, Kashyap Chitta, Andreas Geiger

2023 118 citations ⭐ Influential View Analysis →

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li et al.

2025 25 citations ⭐ Influential View Analysis →

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

2025 86 citations ⭐ Influential View Analysis →

TransFuser: Imitation With Transformer-Based Sensor Fusion for Autonomous Driving

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger et al.

2022 571 citations ⭐ Influential View Analysis →

DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang et al.

2025 81 citations ⭐ Influential View Analysis →

Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

Xiaosong Jia, Peng Wu, Li Chen et al.

2023 190 citations ⭐ Influential View Analysis →

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu et al.

2024 4933 citations ⭐ Influential View Analysis →

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu et al.

2025 24 citations ⭐ Influential View Analysis →

Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline

Peng Wu, Xiaosong Jia, Li Chen et al.

2022 314 citations ⭐ Influential View Analysis →

Planning-oriented Autonomous Driving

Yi Hu, Jiazhi Yang, Li Chen et al.

2022 1136 citations ⭐ Influential View Analysis →

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Qi Lv, Weijie Kong, Hao Li et al.

2025 26 citations ⭐ Influential View Analysis →

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang et al.

2024 143 citations ⭐ Influential View Analysis →

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Katrin Renz, Long Chen, Elahe Arani et al.

2025 71 citations ⭐ Influential View Analysis →

Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li et al.

2024 171 citations ⭐ Influential View Analysis →

MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

2025 2 citations ⭐ Influential View Analysis →

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Jiang-Tian Zhai, Ze Feng, Jinhao Du et al.

2023 150 citations ⭐ Influential View Analysis →

VAD: Vectorized Scene Representation for Efficient Autonomous Driving

Bo Jiang, Shaoyu Chen, Qing Xu et al.

2023 521 citations ⭐ Influential View Analysis →

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li et al.

2024 208 citations ⭐ Influential View Analysis →

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He et al.

2024 91 citations ⭐ Influential View Analysis →

DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Dynamics Tokenizer

Chain of Thought (CoT)

Vision-Language-Action Model (VLA)

Supervised Fine-Tuning (SFT)

Reinforcement Fine-Tuning (RFT)

Ego-centric Dynamics

Environment-centric Dynamics

Spatiotemporal Modeling

Ablation Study

Benchmark

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Decision Systems for Autonomous Vehicles

Long-term Vision

Comprehensive Autonomous Driving Solutions

Abstract

References (20)

Related Papers

Visual-ERM: Reward Modeling for Visual Equivalence

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning