OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Key Findings

Methodology

OneVL is a unified framework combining vision-language models and world models, routing reasoning through compact latent tokens supervised by dual auxiliary decoders. The language decoder reconstructs text chain-of-thought (CoT), while the visual world model decoder predicts future-frame tokens, forcing the latent space to internalize causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization.

Key Results

OneVL is the first latent CoT method to surpass explicit CoT across four benchmarks, delivering state-of-the-art accuracy at answer-only latency. For instance, on the NAVSIM dataset, OneVL's latency matches answer-only prediction and is 0.5× faster than explicit autoregressive CoT.
On the ROADWork dataset, prefill latency is identical to answer-only and 0.3× faster than its explicit counterpart.
Appending an MLP head for producing trajectory further reduces latency to 0.24s, improving efficiency by 16.4%.

Significance

The significance of OneVL lies in its ability to achieve more generalizable reasoning through compact latent representations, addressing the latency issue of explicit chain-of-thought in real-time deployment. By guiding compression with both language and world-model supervision, OneVL provides more generalizable representations than verbose token-by-token reasoning. This approach is significant not only academically but also offers new solutions for industrial autonomous driving systems.

Technical Contribution

OneVL's technical contributions include the innovative combination of visual and language decoders to supervise the compressed representation of latent tokens. This method not only addresses the limitations of traditional latent CoT methods in multimodal reasoning but also significantly improves inference speed through a prefill inference mechanism. Its three-stage training pipeline ensures alignment of the latent bottleneck with trajectory prediction, capturing causal structure rather than memorized patterns.

Novelty

OneVL is the first to introduce a visual world model decoder in latent CoT methods to predict future-frame tokens, ensuring that the latent space internalizes causal dynamics. This approach not only surpasses language-only latent representations but also achieves more efficient reasoning through compact latent tokens.

Limitations

OneVL may not fully capture all causal dynamics in complex scenarios, especially with significant environmental changes.
The method relies on large amounts of training data and computational resources, which may not be feasible in resource-constrained environments.
In extreme weather conditions, the accuracy of the visual decoder's predictions may be affected.

Future Work

Future research directions include further optimizing the design of latent tokens to improve reasoning capabilities in complex scenarios. Additionally, exploring robustness under different weather conditions and reducing the need for training data and computational resources are potential areas for future exploration.

AI Executive Summary

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in Vision-Language-Action (VLA) based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning.

The architecture of OneVL includes a pretrained Vision-Language Model (VLM), a compact latent token interface, and dual auxiliary decoders for multimodal explanation. Its backbone is Qwen3-VL-4B-Instruct, a VLM that processes interleaved image and text inputs. The model consists of three standard components: Vision Encoder (ViT), Visual Projector (MLP Aligner), and Large Language Model (LLM). All three components are initialized from the Qwen3-VL-4B-Instruct checkpoint and remain fully trainable in Stages 0 and 2. The backbone is primarily optimized via a standard next-token prediction objective, applying a cross-entropy loss to both the trajectory answers and the latent reasoning tokens introduced below.

The key innovation of OneVL lies in the introduction of dual-modal auxiliary decoders: a language auxiliary decoder that reconstructs human-readable CoT reasoning from compact language latent tokens, and a visual auxiliary decoder that predicts anticipated future frames. The visual decoder plays the role of a world model auxiliary. By forcing the compressed latents to anticipate what the scene will look like at future time steps, it ensures that the bottleneck encodes genuinely causal scene dynamics, such as agent trajectories, road geometry evolution, and emerging hazards, rather than abstract symbolic summaries. This is precisely the missing ingredient in language-only latent CoT. Future-frame prediction is a concrete compression target that directly reflects the causal structure of the physical world, satisfying the compression view of intelligence in a way that text descriptions alone cannot. The resulting framework simultaneously handles planning, language reasoning, and visual interpretation within a single model.

Beyond interpretability, the dual reconstruction objectives serve a deeper role: they ensure that the compressed latents encode genuinely generalizable structure rather than superficial correlations. If compact latent tokens can be decoded into both coherent language reasoning and plausible future frames, the model has necessarily discovered transferable representations of scene dynamics rather than memorized input-output mappings. Critically, the world model supervision (visual decoder) and the language supervision act as complementary forms of validation. Language grounds the latents in semantic intent, while visual prediction grounds them in physical scene dynamics. Together, they guarantee that the compressed representation satisfies both the semantic and causal requirements of robust trajectory planning.

At inference time, the latent tokens (both visual and language) are prefilled into the model’s context as fixed prompt inputs, enabling single-pass generation of all latent tokens. This eliminates the iterative latent token generation overhead and achieves inference speed essentially identical to answer-only AR prediction. The resulting model performs one-step latent reasoning (fast inference), vision-language explanation (interpretable reasoning), and finally planning in a unified sequence. Empirically, OneVL not only matches but surpasses explicit AR CoT in trajectory quality, demonstrating that compression, far from being a necessary compromise, is itself a driver of more effective reasoning.

In conclusion, OneVL represents a significant advancement in the field of autonomous driving by addressing the latency issues associated with explicit CoT methods. Its innovative use of dual-modal auxiliary decoders and prefill inference mechanism not only improves inference speed but also enhances the generalizability of the reasoning process. Future research may focus on optimizing latent token design and exploring robustness under varying environmental conditions, further solidifying OneVL's potential impact on both academic research and industrial applications.

Deep Analysis

Background

In recent years, Vision-Language Models (VLMs) have rapidly become a foundational building block for autonomous driving, unifying holistic scene understanding, natural language reasoning, and end-to-end trajectory planning within a single model. When further extended to produce action outputs, such as trajectory waypoints or control signals, these models are known as Vision-Language-Action models (VLAs). A central driver of recent progress in VLA-based driving is Chain-of-Thought (CoT) reasoning, where the model articulates intermediate reasoning steps before committing to a final trajectory, yielding substantial gains in prediction quality. However, deploying CoT in real driving systems exposes a sharp tension between interpretability and efficiency. Standard autoregressive (AR) CoT generation must emit every reasoning token before the trajectory can be produced, leading to inference latency proportional to the chain length, which is far above that of answer-only prediction. In safety-critical real-time settings, this gap is prohibitive.

Core Problem

Despite the significant advances in reasoning quality achieved by explicit CoT, its autoregressive nature imposes a latency cost that makes real-time deployment infeasible. Additionally, explicit CoT chains are strikingly redundant; for example, much of the sequence merely restates context or follows formulaic patterns. This redundancy suggests that the essential reasoning content can be compressed into a far more compact form without sacrificing and even strengthening generalization, since tighter compression forces the model to retain only the causal structure that truly matters for prediction.

Innovation

OneVL overcomes the limitations of prior latent CoT methods through two key innovations. First, we introduce dual-modal auxiliary decoders: a language auxiliary decoder that reconstructs human-readable CoT reasoning from compact language latent tokens, and a visual auxiliary decoder that predicts anticipated future frames. Second, we design a prefill inference mechanism. At inference time, the latent tokens (both visual and language) are prefilled into the model’s context as fixed prompt inputs, enabling single-pass generation of all latent tokens. This eliminates the iterative latent token generation overhead and achieves inference speed essentially identical to answer-only AR prediction.

Methodology

�� The backbone of OneVL is Qwen3-VL-4B-Instruct, a VLM that processes interleaved image and text inputs. The model consists of three standard components: Vision Encoder (ViT), Visual Projector (MLP Aligner), and Large Language Model (LLM).

�� The language auxiliary decoder aims to recover human-readable CoT reasoning text from the compact language latent hidden states. Input construction includes current-frame ViT patch embeddings and language latent hidden states extracted from the backbone.

�� The visual auxiliary decoder aims to predict anticipated future-frame visual tokens. Input construction includes current-frame ViT embeddings extracted from the main model’s visual encoder and visual latent token hidden states.

�� The total training loss includes the main model’s cross-entropy loss, language explanation loss, and visual explanation loss. The lower weight on visual explanation loss reflects that visual token reconstruction is a harder task, and a smaller weight prevents it from dominating the training signal.

Experiments

The experimental design involves evaluating OneVL's performance across four benchmarks, including the NAVSIM and ROADWork datasets. We use explicit autoregressive CoT as the baseline and compare OneVL's performance in terms of prediction accuracy and inference latency. Key hyperparameters include the number of latent tokens and the training weights of auxiliary decoders. Ablation studies are conducted to verify the contribution of each component, particularly the role of visual and language decoders in performance improvement.

Results

OneVL is the first latent CoT method to surpass explicit CoT across four benchmarks, delivering state-of-the-art accuracy at answer-only latency. For instance, on the NAVSIM dataset, OneVL's latency matches answer-only prediction and is 0.5× faster than explicit autoregressive CoT. On the ROADWork dataset, prefill latency is identical to answer-only and 0.3× faster than its explicit counterpart. Appending an MLP head for producing trajectory further reduces latency to 0.24s, improving efficiency by 16.4%.

Applications

The direct application scenarios of OneVL include real-time trajectory prediction in autonomous driving systems. Its compact latent representation and prefill inference mechanism make it suitable for driving environments requiring rapid response. Additionally, the method has potential applications in other multimodal reasoning tasks, such as robotic navigation and intelligent surveillance systems.

Limitations & Outlook

OneVL may not fully capture all causal dynamics in complex scenarios, especially with significant environmental changes. Additionally, the method relies on large amounts of training data and computational resources, which may not be feasible in resource-constrained environments. In extreme weather conditions, the accuracy of the visual decoder's predictions may be affected. Future research directions include further optimizing the design of latent tokens to improve reasoning capabilities in complex scenarios.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. Explicit chain-of-thought is like writing down every step in detail, such as chopping vegetables, adding salt, stirring, etc. This method is detailed, but if you need to cook quickly, such detailed notes can slow you down. Latent chain-of-thought is like having a rough idea of the steps in your mind without writing them down, just knowing that you need to make a delicious dish in the end. OneVL is like a smart chef who not only remembers the steps but can also predict what to do next, like when to stir-fry or when to add water. This way, it can cook delicious dishes faster without needing to write down every step.

ELI14 Explained like you're 14

Hey there! Have you ever wondered how self-driving cars know where to go? It's like playing a game where you have to plan your next move. Scientists have invented a method called OneVL, which is like giving the car a super-smart brain. This brain can not only see what's on the road but also tell itself what to do next using language. It's like when you're playing a game and thinking, 'First go left, then jump, then sprint!' Plus, this brain can predict the future, like whether there will be obstacles ahead. This way, the car can reach its destination faster and safer! Isn't that cool?

Glossary

Vision-Language Model

A model that combines visual and language information for reasoning and decision-making, commonly used in autonomous driving and robotics.

In OneVL, the vision-language model is used to process interleaved image and text inputs.

Chain-of-Thought

A reasoning method that improves prediction quality by articulating intermediate steps, commonly used in complex decision-making tasks.

OneVL achieves compressed representation of chain-of-thought through compact latent tokens.

Latent Token

Compact tokens used to carry implicit reasoning information, helping the model compress information during reasoning.

OneVL uses latent tokens to compress and convey reasoning information.

World Model

A model that simulates dynamic changes in the environment, commonly used to predict future scene states.

The visual decoder in OneVL acts as a world model auxiliary, predicting future frames.

Autoregressive

A model structure that generates output step-by-step, with each step depending on the previous output.

Explicit chain-of-thought methods typically use autoregressive generation.

Prefill Inference

A method of pre-filling latent tokens during inference to speed up reasoning.

OneVL achieves the same speed as answer-only prediction through prefill inference.

Visual Decoder

A decoder used to predict future frames from visual latent representations, helping the model internalize causal dynamics.

The visual decoder in OneVL is used to predict future-frame tokens.

Language Decoder

A decoder used to reconstruct human-readable reasoning text from language latent tokens.

The language decoder in OneVL is used to reconstruct text chain-of-thought.

Multimodal

Involving the processing and analysis of multiple information modalities, such as vision and language.

OneVL achieves more efficient reasoning through multimodal explanation.

Ablation Study

A research method that evaluates the contribution of model components by removing or modifying them.

Ablation studies in OneVL verify the contribution of visual and language decoders.

Open Questions Unanswered questions from this research

1 Despite OneVL's excellent performance across multiple benchmarks, its robustness under extreme weather conditions remains to be further validated. Current methods may not fully capture all environmental changes, especially in complex driving scenarios. Future research could explore how to enhance model robustness without increasing computational costs.
2 OneVL relies on large amounts of training data and computational resources, which may limit its application in resource-constrained environments. Future research could explore how to reduce the need for training data and computational resources while maintaining model performance.
3 The current design of latent tokens may not fully capture all causal dynamics in some complex scenarios. Future research could explore more optimized latent token designs to improve reasoning capabilities in complex scenarios.
4 Although OneVL achieves multimodal explanation through visual and language decoders, the accuracy of the visual decoder's predictions may be affected in some cases. Future research could explore how to improve the accuracy of visual decoders under different environmental conditions.
5 OneVL's prefill inference mechanism significantly improves inference speed, but in some cases, it may not fully capture all reasoning information. Future research could explore how to improve the completeness of reasoning information without increasing latency.

Applications

Immediate Applications

Autonomous Driving Systems

OneVL can be used for real-time trajectory prediction in autonomous driving systems. Its compact latent representation and prefill inference mechanism make it suitable for driving environments requiring rapid response.

Robotic Navigation

OneVL's multimodal reasoning capabilities make it suitable for robotic navigation tasks, enabling quick decision-making in complex environments.

Intelligent Surveillance Systems

OneVL can be used in intelligent surveillance systems, achieving multimodal explanation of complex scenes through visual and language decoders.

Long-term Vision

Intelligent Traffic Management

OneVL can be used in intelligent traffic management systems, improving traffic flow management efficiency through real-time trajectory prediction and multimodal explanation.

Human-Machine Interaction Systems

OneVL's multimodal reasoning capabilities can be used in human-machine interaction systems, improving the system's understanding and response to human instructions.

Abstract

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

cs.CV cs.CL cs.RO

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Vision-Language Model

Chain-of-Thought

Latent Token

World Model

Autoregressive

Prefill Inference

Visual Decoder

Language Decoder

Multimodal

Ablation Study

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Driving Systems

Robotic Navigation

Intelligent Surveillance Systems

Long-term Vision

Intelligent Traffic Management

Human-Machine Interaction Systems

Abstract

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock