VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

TL;DR

VLGA introduces a dense 3D geometry expert supervised by LiDAR pointmap reconstruction, achieving state-of-the-art safety and driving scores in autonomous driving benchmarks.

cs.CV 🔴 Advanced 2026-06-11 73 views

Jin Yao Dhruva Dixith Kurra Tom Lampo Zezhou Cheng Danhua Guo Burhan Yaman

AI Reader Arxiv Page Download PDF

autonomous driving multimodal learning 3D geometry vision-language models deep learning

Key Findings

Methodology

VLGA employs a four-expert Mixture-of-Transformers (MoT) architecture integrating vision, language, perception, and geometry modalities. The geometry expert is based on a pretrained LiDAR backbone, which processes multi-view camera inputs to produce dense 3D features. During training, a dense pointmap reconstruction loss guides the geometry expert to learn continuous 3D spatial representations at pixel-level resolution, ensuring explicit dense geometric supervision. The architecture uses masked joint attention to fuse features from all experts, with the geometry stream conditioned on the dense pointmap decoder outputs. A two-stage training process first optimizes the geometric expert independently with the pointmap loss, then jointly fine-tunes the entire model with action loss, balancing multimodal learning. The model is evaluated on nuScenes and Bench2Drive datasets, demonstrating superior safety and accuracy metrics.

Key Results

On nuScenes open-loop evaluation, VLGA-Large achieves an average L2 displacement of 0.50 meters and a 3-second collision rate of 0.18%, outperforming all prior methods, indicating significant safety improvements in long-horizon planning.
On Bench2Drive closed-loop evaluation, VLGA attains a driving score of 79.08, surpassing previous state-of-the-art by +0.71, with improvements in success rate and comfort metrics, demonstrating robust real-world driving capabilities.
Ablation studies show that the dense pointmap supervision alone reduces collision rate by 8.7%, confirming the critical role of dense geometric supervision in safety-critical scenarios.

Significance

This work addresses a fundamental challenge in autonomous driving: achieving dense, continuous 3D scene understanding within a multimodal reasoning framework. By explicitly supervising the geometric modality with dense pointmaps, VLGA bridges the gap between scene semantics and precise spatial perception. The approach enhances safety, especially in scenarios requiring tight spatial reasoning, such as narrow passages and dynamic obstacle avoidance. It pushes the boundary of multimodal perception models, making them more operationally grounded in the 3D world, which is crucial for deploying reliable autonomous vehicles in complex environments. The integration of dense geometric supervision into a language-conditioned policy marks a significant step forward in the quest for safe, explainable, and robust autonomous driving systems.

Technical Contribution

The paper introduces a novel dense geometric supervision mechanism within a multimodal transformer framework, where a pretrained LiDAR backbone provides dense per-pixel pointmap targets. The geometric expert is isolated as a dedicated module, trained via a pixel-wise pointmap regression loss, ensuring the model learns continuous 3D spatial features. This dense supervision is integrated into the MoT architecture through a lightweight pointmap decoder and a two-stage training schedule, which first warms up the geometric stream before joint fine-tuning. Unlike prior methods that either rely on sparse perception or inject features without explicit supervision, VLGA explicitly models dense 3D geometry as an independent modality, leading to improved spatial reasoning and safety. The architecture preserves language reasoning and perception capabilities, enabling comprehensive scene understanding.

Novelty

VLGA is the first to incorporate dense 3D geometry supervision directly into a multimodal vision-language-action framework for autonomous driving. Unlike previous approaches that either focus on sparse object detection or dense feature injection without explicit supervision, VLGA’s dense pointmap reconstruction provides a continuous, pixel-level geometric understanding. This explicit supervision ensures the model learns detailed spatial representations, which significantly enhances safety and precision. The core innovation lies in treating geometry as a dedicated modality, isolated from language and perception, and supervised via dense pointmap regression, enabling the model to reason about the scene’s 3D structure with high fidelity. This approach sets a new paradigm for integrating dense geometric understanding into multimodal autonomous driving models.

Limitations

The reliance on LiDAR data for dense supervision makes the model vulnerable in scenarios with sparse or occluded LiDAR signals, potentially degrading performance in adverse weather or sensor failure conditions.
The additional geometric module and dense supervision increase computational complexity and training time, posing challenges for real-time deployment on resource-constrained edge devices.
Current dense supervision is limited to single-frame data, lacking temporal consistency, which is essential for dynamic scene understanding. Extending to multi-frame temporal coherence remains an open challenge.

Future Work

Future research will focus on incorporating temporal consistency into dense geometric supervision to better handle dynamic scenes. Model compression and efficient inference techniques such as knowledge distillation and quantization will be explored to enable real-time deployment on autonomous vehicles. Additionally, integrating dense geometric supervision with other perception modules, such as semantic segmentation and instance-level detection, could further enhance scene understanding and safety. Extending the framework to multi-sensor fusion, including radar and high-resolution cameras, will also be a promising direction to improve robustness in diverse environmental conditions.

AI Executive Summary

Autonomous driving has long been a frontier of artificial intelligence, promising safer and more efficient transportation. Yet, despite rapid advancements, a persistent challenge remains: how to achieve a comprehensive understanding of complex, dynamic environments. Traditional perception models focus on sparse object detection or semantic segmentation, which, while useful, often fall short in providing the dense, continuous spatial understanding necessary for safe navigation.

Recent efforts have integrated multimodal reasoning—combining visual, linguistic, and perceptual cues—to improve scene comprehension. However, these models typically lack explicit dense 3D geometric understanding, which is crucial for precise maneuvering, especially in tight or unpredictable scenarios. Existing approaches either inject sparse 3D features into language models or rely on separate perception modules that do not fully integrate dense spatial information into decision-making.

Addressing this gap, the paper introduces VLGA, a novel multimodal framework that explicitly incorporates dense 3D geometry as a dedicated modality. The core idea is to supervise a geometric expert module with dense pointmap reconstruction targets derived from LiDAR data. This supervision ensures the model learns continuous, pixel-level 3D spatial features, enabling finer spatial reasoning.

The architecture employs a four-expert Mixture-of-Transformers (MoT) design, integrating understanding, perception, geometry, and action experts. The geometric expert processes multi-view camera inputs through a pretrained LiDAR backbone, producing dense features. During training, a lightweight pointmap decoder predicts per-pixel 3D points, supervised by a pixel-wise loss comparing against LiDAR ground truth. This dense supervision guides the geometric expert to learn detailed spatial representations.

The training adopts a two-stage schedule: first, the geometric expert is optimized independently with the dense pointmap loss; then, the entire model is jointly fine-tuned with the action loss, balancing multimodal learning. This approach ensures the geometric modality effectively influences planning without interference from other modules.

Extensive experiments on nuScenes and Bench2Drive demonstrate VLGA’s superior performance. In open-loop nuScenes evaluation, VLGA achieves an average L2 displacement of 0.50 meters and a collision rate of only 0.18% at 3 seconds, outperforming all prior methods. In closed-loop driving, VLGA attains a state-of-the-art score of 79.08, surpassing previous models by 0.71 points, with notable improvements in safety-critical scenarios.

These results highlight the importance of explicit dense geometric supervision in autonomous driving. By enabling models to reason about the scene’s continuous 3D structure, VLGA enhances safety, precision, and robustness. Its design paves the way for future systems that integrate dense spatial understanding seamlessly with multimodal reasoning, bringing us closer to truly reliable autonomous vehicles.

Despite these advances, challenges remain. The reliance on LiDAR data limits performance in adverse conditions, and the increased computational complexity poses deployment hurdles. Future work will focus on incorporating temporal consistency, model compression, and multi-sensor fusion to address these issues, aiming for real-time, robust autonomous driving solutions.

Deep Analysis

Background

近年来，自动驾驶技术经历了从基于规则的系统到深度学习端到端模型的快速演进。早期模型多依赖稀疏目标检测（如3D边界框、车道线）实现场景理解，但在复杂环境中表现有限。随着多模态学习的发展，视觉-语言模型（VLM）被引入自动驾驶，用于提升场景推理和长尾场景处理能力（如LLaVA、GPT-Driver等）。然而，这些模型多依赖静态语义理解，缺乏对连续空间的细粒度感知，导致在安全关键任务中表现不足。近年来，结合稀疏感知和密集空间理解的研究逐渐增多（如VGA、UniDriveVLA），但在密集几何信息的有效利用和融合方面仍存在瓶颈。LiDAR和多视角摄像头的普及，为密集点云提供了丰富的空间信息，但如何将其融入多模态模型，保持端到端的学习能力，成为核心难题。

Core Problem

核心问题在于如何在保持语言推理和稀疏感知能力的基础上，有效引入连续、密集的空间几何信息，从而提升模型对复杂场景的空间理解能力。现有方法多采用稀疏目标检测或特征注入，未能实现高精度的连续空间感知，导致在紧密车距、动态避障等安全关键场景中表现不足。尤其是在复杂交通环境中，车辆需要对周围环境进行细粒度的连续空间感知，才能做出安全、合理的决策。如何设计一种既能保持多模态推理能力，又能实现密集空间理解的架构，是当前的技术难点。

Innovation

本论文的创新点主要体现在：1）引入基于LiDAR的密集点图重建作为几何专家的监督目标，确保模型学习到连续空间的丰富几何信息；2）将几何作为独立模态嵌入多模态变换器架构中，保持语言推理和稀疏感知能力的完整性；3）采用两阶段训练策略，先单独优化几何专家，再联合优化动作专家，有效缓解模态间干扰；4）在nuScenes和Bench2Drive两个挑战性数据集上验证，显著提升长远安全性和空间精度。相较于传统稀疏感知或特征注入方法，VLGA实现了连续空间的细粒度理解，为自动驾驶中的空间感知提供了全新解决方案。

Methodology

�� 输入多视角摄像头图像和导航指令，构建多模态感知环境。
�� 采用预训练的视觉-语言模型（VLM）作为理解专家，处理场景语义信息。
�� 引入感知专家，输出稀疏的目标检测和场景结构信息。
�� 利用预训练LiDAR模型提取的点云特征，构建多视角密集点图，作为几何专家的输入。
�� 设计密集点图重建目标，通过每像素点图回归损失，监督几何专家学习连续空间几何特征。
�� 构建四专家混合变换器（MoT），采用masked joint attention机制融合四模态信息，确保信息在不同专家间有效流动。
�� 两阶段训练：第一阶段只优化几何专家，利用点图重建损失进行预热；第二阶段联合优化动作专家，同时保持几何专家的学习效果。
�� 在训练过程中，感知专家提供场景和目标信息，动作专家基于融合特征预测车辆轨迹。
�� 测试时，模型利用多模态信息进行端到端路径规划，输出安全、精确的驾驶轨迹。

Experiments

�� 采用nuScenes作为开放环评估数据集，评估模型在长远轨迹规划中的安全性和精度，包括L2误差和碰撞率指标。
�� 利用Bench2Drive进行闭环场景测试，评估驾驶得分、成功率、效率和乘坐舒适性。
�� 比较多种对比模型，包括传统稀疏感知模型、特征注入模型和纯几何模型，确保评估的全面性。
�� 设置不同的训练阶段和超参数（如学习率、批次大小、训练轮数），确保模型充分收敛。
�� 进行消融实验，验证几何专家、点图监督和两阶段训练的贡献。
�� 统计模型在不同场景和指标上的表现，确保结果的稳健性和统计显著性。

Results

�� VLGA-Large在nuScenes无自我状态评估中，平均L2误差为0.50米，3秒碰撞率为0.18%，优于所有对比模型，显示出在长远安全性方面的巨大优势。
�� 在Bench2Drive闭环测试中，VLGA获得79.08的最高驾驶得分，超越前沿模型0.71分，表现出在复杂交通环境中的优越操控能力。
�� 消融实验表明，密集点图重建目标对提升安全性和空间精度起到关键作用，单独引入几何专家即可降低碰撞率8.7%。
�� 结果显示，密集几何监督显著改善模型在空间敏感场景中的表现，尤其是在狭窄车道和动态避障任务中优势明显。

Applications

�� 该模型可直接应用于自动驾驶车辆的端到端路径规划系统，提升其在复杂交通环境中的安全性和鲁棒性。
�� 在自动驾驶研发中，可作为感知增强模块，结合LiDAR和多视角摄像头实现高精度场景理解。
�� 未来可拓展到无人机、机器人等自主系统，增强其空间感知和路径规划能力，适应多变环境。

Limitations & Outlook

�� 目前模型对LiDAR数据的依赖较强，稀疏或遮挡情况下性能可能下降，限制了在极端场景中的应用。
�� 增加的几何模块和密集监督带来更高的计算成本，限制了在边缘设备上的实时性。
�� 当前的点图重建仅在单帧基础上进行，缺乏时间连续性，未来需引入时序信息以增强动态场景理解。

Plain Language Accessible to non-experts

想象你在一个繁忙的工厂里工作。工厂里有很多不同的机器和人员，每个人都在做不同的事情。有的在搬运货物，有的在组装零件，还有的在检查产品。你需要知道每个机器和人员的位置、动作，以及他们之间的关系，才能确保工厂正常运转。

传统的方法就像只看工厂的某个角落，看到一些机器在工作，但不知道它们之间的具体距离和位置关系。这就像用望远镜看远处的东西，只能看到大致轮廓，不能知道每个零件的具体位置。

而新方法就像用一个高精度的3D扫描仪，把整个工厂的每个角落都扫描一遍，得到详细的空间地图。这样，你不仅知道每个机器在哪里，还能知道它们之间的距离、运动轨迹，甚至未来可能发生的碰撞。

这就像给工厂装上了“眼睛”和“脑袋”，让它能像人一样理解空间中的每个细节。通过这种方式，工厂的管理变得更智能、更安全，也能更快地找到问题所在。自动驾驶车辆也是如此，利用密集的3D几何信息，就像工厂的“空间地图”，让车辆能更准确地感知周围环境，做出更安全的决策。

Abstract

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.

cs.CV cs.RO

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence