Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

TL;DR

Latent-WAM achieves efficient end-to-end autonomous driving with spatially-aware and dynamics-informed latent world representations, scoring 89.3 on NAVSIM v2.

cs.CV 🔴 Advanced 2026-03-26 54 views

Linbo Wang Yupeng Zheng Qiang Chen Shiwei Li Yichen Zhang Zebin Xing Qichao Zhang Xiang Li Deheng Qian Pengxuan Yang Yihang Dong Ce Hao Xiaoqing Ye Junyu han Yifeng Pan Dongbin Zhao

AI Reader Arxiv Page Download PDF

autonomous driving latent world modeling trajectory planning Transformer compressive encoding

Key Findings

Methodology

Latent-WAM is an efficient end-to-end autonomous driving framework that achieves robust trajectory planning through spatially-aware and dynamics-informed latent world representations. The framework consists of two core modules: the Spatial-Aware Compressive World Encoder (SCWE) and the Dynamic Latent World Model (DLWM). SCWE extracts geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries. DLWM employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations.

Key Results

On the NAVSIM v2 dataset, Latent-WAM achieved an EPDMS score of 89.3, surpassing the previous best perception-free method by 3.2 points, while significantly reducing the amount of training data and using a compact 104M-parameter model.
On the HUGSIM dataset, Latent-WAM achieved a HD-Score of 28.9, demonstrating its strong adaptability and robustness across different environments.
Ablation studies confirmed the contributions of the SCWE and DLWM modules to overall performance, validating the rationale and effectiveness of each module's design.

Significance

The introduction of Latent-WAM is significant for both academia and industry. It addresses the limitations of existing world-model-based planners in terms of representation compression, spatial understanding, and temporal dynamics utilization, achieving superior planning performance under constrained data and compute budgets. This framework offers new insights for the autonomous driving field, particularly in scenarios with scarce data and limited computational resources, showcasing its powerful potential and application value.

Technical Contribution

Latent-WAM fundamentally differs from existing state-of-the-art methods. Its innovative SCWE module significantly enhances spatial understanding by extracting geometric knowledge from a foundation model, while the DLWM module improves temporal dynamics utilization through a causal Transformer. These technical contributions not only provide new theoretical guarantees but also open up new engineering possibilities for autonomous driving systems.

Novelty

Latent-WAM's novelty lies in its first-time integration of spatially-aware and dynamics-informed latent world representations, achieving efficient trajectory planning through a causal Transformer. Compared to existing methods, it offers significant improvements in representation compression and dynamic prediction.

Limitations

In extremely complex traffic scenarios, Latent-WAM may experience performance degradation, primarily due to the limited complexity of the datasets used during training.
The method's reliance on the foundation model may limit its transferability across different environments.
In scenarios with extremely limited computational resources, real-time performance may still be challenging despite the model's compact size.

Future Work

Future research directions include exploring performance improvements in more complex traffic scenarios, optimizing the choice of foundation models to enhance environmental adaptability, and improving real-time performance under constrained computational resources. Additionally, integrating more sensor data to enhance model robustness and accuracy is an important research direction.

AI Executive Summary

The rapid development of autonomous driving technology has brought many new challenges, especially in achieving efficient trajectory planning under limited data and computational resources. Existing world-model-based planners often suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, leading to sub-optimal planning performance. Latent-WAM is introduced to address these issues.

The Latent-WAM framework consists of two core modules: the Spatial-Aware Compressive World Encoder (SCWE) and the Dynamic Latent World Model (DLWM). SCWE extracts geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens, enhancing spatial understanding. DLWM employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations.

This innovative framework has been extensively tested on the NAVSIM v2 and HUGSIM datasets, achieving new heights in trajectory planning performance. On NAVSIM v2, Latent-WAM achieved an EPDMS score of 89.3, surpassing the previous best perception-free method, while significantly reducing the amount of training data and using a compact 104M-parameter model.

The success of Latent-WAM has garnered widespread attention in academia and offers new insights for the industry. Its excellent performance under data scarcity and limited computational resources showcases its immense potential and application value in the field of autonomous driving.

However, Latent-WAM also has some limitations, such as potential performance degradation in extremely complex traffic scenarios and limited transferability due to reliance on the foundation model. Future research directions include exploring performance improvements in more complex scenarios and enhancing real-time performance under constrained computational resources.

Deep Analysis

Background

Autonomous driving technology has made significant strides in recent years, particularly in perception, decision-making, and control. However, achieving efficient end-to-end autonomous driving remains challenging, especially under limited data and computational resources. Traditional world-model-based planners often suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, leading to sub-optimal planning performance. The development of deep learning technologies such as Transformers offers new possibilities for addressing these challenges.

Core Problem

Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, leading to sub-optimal planning performance under constrained data and compute budgets. The core problem is how to effectively compress multi-view image information and utilize historical visual and motion representations to predict future world status. This requires strong spatial understanding and full utilization of temporal dynamics.

Innovation

The core innovations of Latent-WAM lie in its Spatial-Aware Compressive World Encoder (SCWE) and Dynamic Latent World Model (DLWM).

�� SCWE extracts geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens, enhancing spatial understanding.
�� DLWM employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations, improving temporal dynamics utilization.

These innovations not only improve representation compression efficiency but also achieve new heights in trajectory planning performance.

Methodology

The implementation of Latent-WAM includes the following steps:

�� Use SCWE to extract geometric knowledge from a foundation model and compress multi-view images into compact scene tokens.
�� Employ DLWM's causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations.
�� Generate future trajectory planning through autoregressive prediction.
�� Conduct experimental validation on NAVSIM v2 and HUGSIM datasets to evaluate model performance and robustness.

Experiments

The experimental design includes extensive testing on the NAVSIM v2 and HUGSIM datasets. Baseline methods include existing perception-free methods and other state-of-the-art world-model-based planners. Evaluation metrics include EPDMS and HD-Score, with key hyperparameters including model compression rate and Transformer layers. Ablation studies confirmed the contributions of the SCWE and DLWM modules to overall performance.

Results

Experimental results show that Latent-WAM achieved an EPDMS score of 89.3 on NAVSIM v2, surpassing the previous best perception-free method by 3.2 points. On HUGSIM, it achieved a HD-Score of 28.9, demonstrating strong adaptability and robustness across different environments. Ablation studies further confirmed the contributions of the SCWE and DLWM modules to overall performance.

Applications

Application scenarios for Latent-WAM include trajectory planning for autonomous vehicles, path planning for drones, and other automated systems requiring efficient trajectory planning. Its excellent performance under data scarcity and limited computational resources makes it highly applicable in the industry.

Limitations & Outlook

Despite significant progress in many areas, Latent-WAM still has some limitations. For example, it may experience performance degradation in extremely complex traffic scenarios. Additionally, its reliance on the foundation model may limit transferability across different environments. In scenarios with extremely limited computational resources, real-time performance may still be challenging despite the model's compact size.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Latent-WAM is like a smart kitchen assistant that helps you make a delicious dish with limited ingredients and time. First, it gathers all the ingredient information (geometric knowledge) from your pantry (foundation model) and compresses it into a simple shopping list (scene tokens). Next, it uses your past cooking experiences (historical visual and motion representations) to predict the next steps (future world status). This way, even with limited ingredients and time, you can make a tasty dinner (efficient trajectory planning).

ELI14 Explained like you're 14

Hey there! Imagine you're playing a racing game, but this game is super hard because you can only see part of the track. Latent-WAM is like a super-smart game assistant that helps you predict what the next part of the track looks like. First, it gathers all the track information from the game and compresses it into a simple map. Then, it uses your previous gaming experiences to predict the changes in the track. This way, even if you can't see the whole track, you can still finish the race smoothly! Isn't that cool?

Glossary

Latent-WAM

An efficient end-to-end autonomous driving framework that achieves robust trajectory planning through spatially-aware and dynamics-informed latent world representations.

Used in the paper to address the limitations of existing world-model-based planners.

SCWE

Spatial-Aware Compressive World Encoder that extracts geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens.

Enhances spatial understanding.

DLWM

Dynamic Latent World Model that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations.

Enhances temporal dynamics utilization.

EPDMS

A metric used to evaluate trajectory planning performance, with higher values indicating better performance.

Used on the NAVSIM v2 dataset to evaluate Latent-WAM's performance.

HD-Score

A metric used to evaluate trajectory planning performance, with higher values indicating better performance.

Used on the HUGSIM dataset to evaluate Latent-WAM's performance.

Transformer

A deep learning model that excels at handling sequential data, particularly in natural language processing and time series prediction.

Used in DLWM to autoregressively predict future world status.

Autoregressive

A prediction method that uses historical data to predict future states.

Used in DLWM to predict future world status.

Foundation Model

A pre-trained model used to extract geometric knowledge and other features.

Used in SCWE to extract geometric knowledge.

Geometric Knowledge

Information about spatial structures and shapes, used to enhance spatial understanding.

Used in SCWE to compress multi-view images.

Scene Tokens

A compact representation used to describe scene information in multi-view images.

Used in SCWE to compress multi-view images.

Open Questions Unanswered questions from this research

1 How can Latent-WAM's performance be improved in extremely complex traffic scenarios? The existing datasets may not be sufficient to train a model that performs well in all scenarios, necessitating the development of more complex datasets and more powerful models.
2 How can reliance on the foundation model be reduced to enhance environmental adaptability? Current methods may have limited transferability across different environments, so exploring more general model architectures is needed.
3 How can real-time performance be improved under extremely limited computational resources? Despite the model's compact size, real-time performance remains challenging, necessitating optimization of the model's computational efficiency.
4 How can more sensor data be integrated to enhance model robustness and accuracy? Current methods primarily rely on visual and motion representations, which may perform poorly in certain scenarios.
5 How can model performance be further improved under data scarcity? While current methods perform well under data scarcity, there is still room for improvement.

Applications

Immediate Applications

Autonomous Vehicles

Latent-WAM can be used for trajectory planning in autonomous vehicles, providing a more efficient solution, especially under data scarcity and limited computational resources.

Drone Path Planning

The method is also applicable to drone path planning, helping drones navigate and avoid obstacles efficiently in complex environments.

Industrial Automation Systems

Latent-WAM can be applied to industrial automation systems requiring efficient trajectory planning, improving production efficiency and safety.

Long-term Vision

Smart City Traffic Management

In the future, Latent-WAM could be used in smart city traffic management systems to improve traffic flow efficiency and safety.

Fully Automated Logistics Systems

The method has the potential to be applied to fully automated logistics systems, achieving efficient goods transportation and delivery.

Abstract

We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.

cs.CV cs.RO

References (20)

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He et al.

2024 100 citations ⭐ Influential View Analysis →

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu et al.

2025 28 citations ⭐ Influential View Analysis →

World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

Yupeng Zheng, Pengxuan Yang, Zebin Xing et al.

2025 45 citations ⭐ Influential View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 31993 citations ⭐ Influential

Planning-oriented Autonomous Driving

Yi Hu, Jiazhi Yang, Li Chen et al.

2022 1178 citations View Analysis →

DriveWorld: 4D Pre-Trained Scene Understanding via World Models for Autonomous Driving

Chen Min, Dawei Zhao, Liang Xiao et al.

2024 78 citations View Analysis →

Generalized Trajectory Scoring for End-to-end Multimodal Planning

Zhenxin Li, Wenhao Yao, Zi Wang et al.

2025 25 citations View Analysis →

DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang et al.

2025 89 citations View Analysis →

UniScene: Unified Occupancy-centric Driving Scene Generation

Bohan Li, Jiazhe Guo, Hongsi Liu et al.

2024 78 citations View Analysis →

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao et al.

2024 215 citations View Analysis →

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Xuemeng Yang, Licheng Wen, Yukai Ma et al.

2024 67 citations View Analysis →

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

Aditya Prakash, Kashyap Chitta, Andreas Geiger

2021 682 citations View Analysis →

nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan et al.

2021 498 citations View Analysis →

Pseudo-Simulation for Autonomous Driving

Wei Cao, Marcel Hallgarten, Tianyu Li et al.

2025 55 citations View Analysis →

SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries

Chenxu Dang, Haiyan Liu, Guangjun Bao et al.

2025 5 citations View Analysis →

VAD: Vectorized Scene Representation for Efficient Autonomous Driving

Bo Jiang, Shaoyu Chen, Qing Xu et al.

2023 551 citations View Analysis →

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li et al.

2024 222 citations View Analysis →

WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting

Yifan Liu, Zhiyuan Min, Zhenwei Wang et al.

2025 23 citations View Analysis →

nuScenes: A Multimodal Dataset for Autonomous Driving

Holger Caesar, Varun Bankiti, Alex H. Lang et al.

2019 7714 citations View Analysis →

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo et al.

2023 475 citations View Analysis →

Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Latent-WAM

SCWE

DLWM

EPDMS

HD-Score

Transformer

Autoregressive

Foundation Model

Geometric Knowledge

Scene Tokens

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Vehicles

Drone Path Planning

Industrial Automation Systems

Long-term Vision

Smart City Traffic Management

Fully Automated Logistics Systems

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock