DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

TL;DR

DualCoT-VLA enhances vision-language-action models with parallel reasoning for complex tasks, achieving state-of-the-art performance.

cs.CV 🔴 Advanced 2026-03-24 137 views

Zhide Zhong Junfeng Li Junjie He Haodong Yan Xin Gong Guanyi Zhao Yingjie Cai Jiantao Gao Xu Yan Bingbing Liu Yingcong Chen Liuqing Yang Haoang Li

AI Reader Arxiv Page Download PDF

vision-language models parallel reasoning robotic manipulation multimodal reasoning chain of thought

Key Findings

Methodology

DualCoT-VLA employs a parallel reasoning mechanism to integrate visual and linguistic chains of thought, addressing the limitations of existing models in capturing both low-level visual details and high-level logical planning. The method introduces two sets of learnable query tokens for visual and linguistic reasoning, eliminating the high latency and compounding errors associated with autoregressive reasoning.

Key Results

On the LIBERO benchmark, DualCoT-VLA achieved an average success rate of 98.8%, significantly outperforming other models with single-modal chain of thought reasoning.
On the RoboCasa GR1 benchmark, DualCoT-VLA achieved an average success rate of 55.1% across 24 tasks, with particularly strong performance in spatially constrained tasks, such as an 80.0% success rate on the CuttingboardToPan task.
In real-world experiments, DualCoT-VLA demonstrated superior performance in long-horizon tabletop tasks, with success rates significantly higher than baseline models, showcasing its adaptability in complex environments.

Significance

This research significantly enhances the efficiency and accuracy of VLA models in complex tasks by introducing parallel visual-linguistic chain of thought reasoning. It addresses the longstanding challenges of logical planning and spatial perception in traditional models, offering new insights and methods for the field of robotic manipulation.

Technical Contribution

DualCoT-VLA eliminates the latency issues of autoregressive reasoning through parallelized chain of thought reasoning, achieving efficient integration of multimodal information. This method provides new theoretical insights and engineering possibilities for complex task execution.

Novelty

DualCoT-VLA is the first to implement parallel visual-linguistic chain of thought reasoning in VLA models, overcoming the limitations of previous single-modal reasoning approaches and offering a novel method for integrating multimodal information.

Limitations

In extremely complex tasks, DualCoT-VLA may still face bottlenecks in reasoning capabilities, particularly in tasks requiring high precision spatial perception.
The model relies on a large amount of annotated data during training, which may limit its performance in data-scarce scenarios.
In certain hardware environments, the model may require optimization to accommodate computational resource constraints.

Future Work

Future research could explore training DualCoT-VLA on larger multimodal datasets to validate its applicability in broader scenarios. Additionally, optimizing the model's reasoning efficiency for real-time applications is a promising direction.

AI Executive Summary

Vision-Language-Action (VLA) models play a crucial role in robotic manipulation, mapping visual observations and language instructions directly to robotic actions. However, traditional VLA models often struggle with complex, multi-step tasks, particularly those requiring precise spatial perception and logical planning. Existing Chain-of-Thought (CoT) reasoning methods, while endowing VLA models with a 'thinking before acting' capability, still face limitations due to their reliance on single-modal reasoning and the high latency of autoregressive decoding.

DualCoT-VLA addresses these issues by introducing parallel visual-linguistic chain of thought reasoning. This method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning, employing two sets of learnable query tokens to efficiently integrate multimodal information and eliminate the latency bottleneck of autoregressive reasoning.

In experiments, DualCoT-VLA achieved state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, demonstrating exceptional performance in complex tasks. In real-world robotic experiments, the model also showcased its robust task planning and 3D spatial perception capabilities, seamlessly transferring to complex environments.

The significance of this research lies in its ability to significantly enhance the efficiency and accuracy of VLA models in complex tasks through parallel chain of thought reasoning, providing new insights and methods for the field of robotic manipulation. It addresses the longstanding challenges of logical planning and spatial perception in traditional models, offering a new direction for future research and applications.

Despite its impressive performance in multiple benchmarks, DualCoT-VLA may still face bottlenecks in reasoning capabilities in extremely complex tasks. Additionally, the model's reliance on a large amount of annotated data during training may limit its performance in data-scarce scenarios. Future research could explore training DualCoT-VLA on larger multimodal datasets to validate its applicability in broader scenarios, and optimizing the model's reasoning efficiency for real-time applications is a promising direction.

Deep Analysis

Background

Vision-Language-Action (VLA) models have gained significant attention in the field of robotic manipulation. These models map visual observations and language instructions directly to robotic actions, greatly simplifying the interaction between robots and their environments. However, traditional VLA models often struggle with complex, multi-step tasks, particularly those requiring precise spatial perception and logical planning. To overcome these challenges, researchers have introduced Chain-of-Thought (CoT) reasoning methods, endowing VLA models with a 'thinking before acting' capability. However, existing CoT reasoning methods primarily rely on single-modal reasoning, failing to simultaneously capture low-level visual details and high-level logical planning. Additionally, the high latency and compounding errors associated with autoregressive decoding limit their performance in real-time applications.

Core Problem

Traditional VLA models struggle with complex, multi-step tasks, particularly those requiring precise spatial perception and logical planning. Existing Chain-of-Thought (CoT) reasoning methods, while endowing VLA models with a 'thinking before acting' capability, still face limitations due to their reliance on single-modal reasoning and the high latency of autoregressive decoding. These issues limit the applicability of VLA models in complex tasks, necessitating a method that can capture both low-level visual details and high-level logical planning in a multimodal reasoning framework.

Innovation

DualCoT-VLA addresses the limitations of traditional CoT reasoning methods by introducing parallel visual-linguistic chain of thought reasoning. Its core innovations include:

1. Parallel chain of thought reasoning: By employing two sets of learnable query tokens, DualCoT-VLA achieves parallel reasoning, eliminating the latency bottleneck of autoregressive reasoning.

2. Multimodal information integration: Combining visual CoT for low-level spatial understanding and linguistic CoT for high-level task planning, DualCoT-VLA efficiently integrates multimodal information.

3. Efficient reasoning mechanism: Through single-step forward reasoning, the model significantly enhances reasoning efficiency and accuracy.

Methodology

The methodology of DualCoT-VLA includes the following key steps:

�� Visual and linguistic chain of thought reasoning: Two sets of learnable query tokens are employed for visual and linguistic reasoning, respectively.
�� Parallel reasoning mechanism: By utilizing single-step forward reasoning, the model eliminates the latency bottleneck of autoregressive reasoning.
�� Multimodal information integration: Combining visual CoT for low-level spatial understanding and linguistic CoT for high-level task planning, the model efficiently integrates multimodal information.
�� Experimental design: Validation on the LIBERO and RoboCasa GR1 benchmarks demonstrates its exceptional performance in complex tasks.

Experiments

The experimental design includes validating the performance of DualCoT-VLA on the LIBERO and RoboCasa GR1 benchmarks. On the LIBERO benchmark, the model is evaluated across four task suites, demonstrating its exceptional performance in complex tasks. On the RoboCasa GR1 benchmark, the model is evaluated across 24 tasks, with particularly strong performance in spatially constrained tasks. Additionally, in real-world robotic experiments, DualCoT-VLA showcases its robust task planning and 3D spatial perception capabilities, seamlessly transferring to complex environments.

Results

Experimental results show that DualCoT-VLA achieved an average success rate of 98.8% on the LIBERO benchmark, significantly outperforming other models with single-modal chain of thought reasoning. On the RoboCasa GR1 benchmark, DualCoT-VLA achieved an average success rate of 55.1% across 24 tasks, with particularly strong performance in spatially constrained tasks, such as an 80.0% success rate on the CuttingboardToPan task. In real-world experiments, DualCoT-VLA demonstrated superior performance in long-horizon tabletop tasks, with success rates significantly higher than baseline models, showcasing its adaptability in complex environments.

Applications

DualCoT-VLA has broad applications in the field of robotic manipulation. Its robust multimodal reasoning capabilities enable it to perform complex tasks in industrial automation, smart homes, and medical assistance. By efficiently integrating task planning and spatial perception, DualCoT-VLA enhances the intelligence of robotic systems, allowing them to operate effectively in dynamic and uncertain environments.

Limitations & Outlook

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditional VLA models are like a chef who follows a recipe step by step, often getting stuck when the recipe gets complicated. DualCoT-VLA, on the other hand, is like an experienced chef who not only understands the recipe but can also adapt the cooking steps based on the state of the ingredients and the kitchen environment. By simultaneously observing the ingredients (vision) and understanding the recipe (language), it makes quick decisions, avoiding the need to rethink each step like traditional models. Just as a chef considers both the freshness of ingredients and the heat of the stove, DualCoT-VLA achieves efficient integration of visual and linguistic information through parallel reasoning, excelling in complex tasks.

ELI14 Explained like you're 14

Imagine you're playing a game like Minecraft, where you need to use both your eyes and your brain. Traditional robots are like players who follow the game's instructions step by step, getting stuck on complex tasks. DualCoT-VLA is like a super player who can see the game's details and plan the next move at the same time. It's like having both a god's eye view and a strategy planning mode in the game, making quick decisions without stopping to think about the next step. This makes it perform really well in complex tasks, just like you can build castles and fight monsters in Minecraft at the same time.

Glossary

Vision-Language-Action Model

A model that maps visual observations and language instructions directly to robotic actions, widely used in robotic manipulation.

Used in this paper to automate complex tasks.

Chain-of-Thought

A reasoning method that solves complex tasks through step-by-step reasoning, often used to enhance a model's logical planning capabilities.

Used in this paper to enhance the reasoning capabilities of VLA models.

Parallel Reasoning

A reasoning mechanism that processes multiple information sources simultaneously, improving reasoning efficiency and accuracy.

Used in this paper to achieve efficient integration of visual and linguistic information.

Autoregressive Decoding

A decoding method that generates output step by step, often resulting in high latency and compounding errors.

Replaced by parallel reasoning mechanisms in this paper.

Multimodal Information Integration

Integrating information from different modalities (e.g., vision and language) to achieve a more comprehensive understanding and decision-making.

Used in this paper to enhance the model's task execution capabilities.

LIBERO Benchmark

A standard test set for evaluating the performance of robotic manipulation models, containing various complex tasks.

Used in this paper to validate the performance of DualCoT-VLA.

RoboCasa GR1 Benchmark

A complex robotic manipulation test set requiring high-precision spatial perception and action coordination.

Used in this paper to evaluate the spatial perception capabilities of DualCoT-VLA.

Learnable Query Tokens

Trainable parameters used to guide the model in extracting and reasoning over specific information, playing a crucial role in multimodal reasoning.

Used in this paper to achieve parallel reasoning for visual and linguistic information.

Visual CoT

A chain of thought reasoning method that achieves low-level spatial understanding through visual information.

Used in this paper to enhance the model's spatial perception capabilities.

Linguistic CoT

A chain of thought reasoning method that achieves high-level task planning through linguistic information.

Used in this paper to enhance the model's logical planning capabilities.

Open Questions Unanswered questions from this research

1 How can DualCoT-VLA be effectively trained in data-scarce scenarios? Existing methods rely on large amounts of annotated data during training, which may limit performance when data is limited. Future research needs to explore training this model under few-shot or unsupervised learning frameworks.
2 Will DualCoT-VLA's reasoning capabilities reach a bottleneck in extremely complex tasks? Despite its excellent performance in multiple benchmarks, challenges may still exist in tasks requiring extremely high precision spatial perception.
3 How can DualCoT-VLA's reasoning efficiency be further optimized for real-time applications? Although the parallel reasoning mechanism significantly enhances reasoning efficiency, optimization may be needed in certain hardware environments to accommodate computational resource constraints.
4 Will training DualCoT-VLA on larger multimodal datasets lead to further performance improvements? Existing research primarily validates on specific benchmarks, and future studies could explore training this model on larger datasets.
5 How can DualCoT-VLA be applied to other fields such as autonomous driving or smart homes? While the model excels in robotic manipulation, further research is needed to determine if its multimodal reasoning capabilities can be equally effective in other fields.

Applications

Immediate Applications

Industrial Automation

DualCoT-VLA can be used for complex industrial automation tasks, such as multi-step operations on assembly lines, improving production efficiency and accuracy.

Smart Homes

In smart homes, DualCoT-VLA can be used for robotic assistants to perform complex household tasks, such as cleaning and organizing.

Medical Assistance

DualCoT-VLA can be applied in medical assistance robots to help perform complex surgeries or care tasks, improving the quality and efficiency of medical services.

Long-term Vision

Autonomous Driving

DualCoT-VLA's multimodal reasoning capabilities can be used in autonomous vehicles to achieve safer and more efficient driving decisions.

Smart Cities

In smart cities, DualCoT-VLA can be used for city management and service robots, enhancing the intelligence and service quality of cities.

Abstract

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

cs.CV cs.RO

References (20)

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen et al.

2025 77 citations ⭐ Influential View Analysis →

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen et al.

2025 413 citations ⭐ Influential View Analysis →

PD-VLA: Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding

Wenxuan Song, Jiayi Chen, Pengxiang Ding et al.

2025 50 citations ⭐ Influential View Analysis →

π0.5: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown et al.

2025 643 citations ⭐ Influential View Analysis →

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim et al.

2025 309 citations ⭐ Influential View Analysis →

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Chi-Pin Huang, Yunze Man, Zhiding Yu et al.

2026 4 citations ⭐ Influential View Analysis →

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter et al.

2025 343 citations ⭐ Influential View Analysis →

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao et al.

2023 666 citations ⭐ Influential View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 3706 citations ⭐ Influential View Analysis →

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, J. Liew et al.

2025 110 citations ⭐ Influential View Analysis →

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Nvidia, Johan Bjorck, Fernando Castañeda et al.

2025 576 citations ⭐ Influential View Analysis →

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

C. Yin, Yankai Lin, Wang Xu et al.

2025 6 citations ⭐ Influential View Analysis →

ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao et al.

2025 24 citations View Analysis →

DINOv3

Oriane Sim'eoni, Huy V. Vo, Maximilian Seitzer et al.

2025 489 citations View Analysis →

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, S. Feng, Yilun Du et al.

2023 2699 citations View Analysis →

FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models

Zhide Zhong, Haodong Yan, Junfeng Li et al.

2025 20 citations View Analysis →

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1323 citations View Analysis →

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi et al.

2025 79 citations View Analysis →

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Zhao, Chelsea Finn

2024 591 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 55669 citations View Analysis →

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Vision-Language-Action Model

Chain-of-Thought

Parallel Reasoning

Autoregressive Decoding

Multimodal Information Integration

LIBERO Benchmark

RoboCasa GR1 Benchmark

Learnable Query Tokens

Visual CoT

Linguistic CoT

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Industrial Automation

Smart Homes

Medical Assistance

Long-term Vision

Autonomous Driving

Smart Cities

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock