Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
Vision pathways dominate action generation in VLA models; language sensitivity is task-dependent.
Key Findings
Methodology
This study examines six Vision-Language-Action (VLA) models, ranging from 80M to 7B parameters, using activation injection, sparse autoencoders (SAEs), and linear probes. Through over 394,000 rollout episodes, the study analyzes the dominance of visual pathways in action generation. It reveals that visual pathways dominate across all architectures, and language sensitivity depends on task structure rather than model design.
Key Results
- Visual Pathway Dominance: Across all architectures, injecting baseline activations into null-prompt episodes recovers nearly identical behavior, and cross-task injection steers robots toward source-task positions, with 99.8% of X-VLA episodes aligning with the source trajectory.
- Language Sensitivity: When visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA libero_goal: success rate drops from 94% to 10% under wrong prompts).
- Multi-Pathway Architectures: In π0.5, SmolVLA, and GR00T, expert pathways encode motor programs while VLM pathways encode goal semantics, with expert injection causing twice the behavioral displacement as VLM pathways.
Significance
This research highlights the dominance of visual pathways in VLA models and the task-dependent nature of language sensitivity, providing significant insights into how multimodal models translate inputs into actions. By revealing the critical role of visual pathways in action generation, the study offers new perspectives for future robot control and multimodal model design. Additionally, it underscores the importance of task structure in language processing, potentially influencing future multimodal task design.
Technical Contribution
This paper provides the first systematic study across six different VLA model architectures, revealing the dominance of visual pathways in action generation and the task-dependent nature of language sensitivity. By employing techniques such as activation injection, sparse autoencoders, and linear probes, the study demonstrates functional dissociation and specialization in multi-pathway architectures. These findings offer new technical means for designing and debugging multimodal models.
Novelty
This is the first large-scale and systematic study of VLA models, covering models from 80M to 7B parameters. The research not only reveals the dominance of visual pathways in action generation but also demonstrates for the first time the impact of task structure on language sensitivity, rather than model design, providing new perspectives for understanding and applying multimodal models.
Limitations
- The dominance of visual pathways may lead to insufficient flexibility in processing language instructions, especially when visual information is inadequate.
- The study focuses primarily on specific tasks and environments, which may not directly generalize to all types of multimodal tasks.
- While the study reveals the dominance of visual pathways, in-depth analysis of language pathways remains limited.
Future Work
Future research could further explore how to balance the roles of visual and language pathways in VLA models, especially in complex and dynamic environments. Additionally, it could investigate how to enhance the flexibility and adaptability of language pathways without compromising the dominance of visual pathways.
AI Executive Summary
Vision-Language-Action (VLA) models integrate perception, language, and motor control to generate actions from multimodal inputs. However, the mechanisms by which these models translate inputs into actions remain opaque. Existing solutions often rely on visual-motor priors rather than truly understanding language instructions.
This study examines six VLA models, ranging from 80M to 7B parameters, using activation injection, sparse autoencoders (SAEs), and linear probes. Through over 394,000 rollout episodes, the study reveals the dominance of visual pathways in action generation. By injecting baseline activations into null-prompt episodes, models can recover nearly identical behavior, while cross-task injection steers robots toward source-task positions, exposing spatially bound motor programs tied to scene coordinates.
The study shows that language sensitivity depends on task structure rather than model design. When visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential. In multi-pathway architectures, expert pathways encode motor programs while VLM pathways encode goal semantics, with expert injection causing twice the behavioral displacement as VLM pathways.
These findings provide significant insights into how multimodal models translate inputs into actions. By revealing the critical role of visual pathways in action generation, the study offers new perspectives for future robot control and multimodal model design. Additionally, it underscores the importance of task structure in language processing, potentially influencing future multimodal task design.
However, the study also has limitations. The dominance of visual pathways may lead to insufficient flexibility in processing language instructions, especially when visual information is inadequate. Future research could further explore how to balance the roles of visual and language pathways in VLA models, especially in complex and dynamic environments.
Deep Analysis
Background
Vision-Language-Action (VLA) models represent a significant advancement in the field of multimodal learning. These models integrate visual encoders, language backbones, and action decoders to generate actions from multimodal inputs. Traditionally, robot control has relied on explicit kinematic and control models, whereas VLA models achieve generalization across objects and instructions through end-to-end policies. Despite their rapid adoption in practical applications, the question remains whether these models truly understand and execute language instructions. Existing debugging methods are primarily based on behavioral observation, lacking a deep understanding of the internal mechanisms of the models. Techniques like sparse autoencoders (SAEs) have been used to extract interpretable features from large language models, but their applicability to VLA models remains to be tested.
Core Problem
The mechanisms by which VLA models translate multimodal inputs into actions remain unclear. This opacity presents practical challenges: when a VLA-controlled robot exhibits unexpected behavior, operators have no principled way to diagnose the failure. Existing debugging methods are limited to behavioral observation, lacking a deep understanding of the internal mechanisms of the models. Particularly, the roles of visual and language pathways and how they interact remain largely unexplored.
Innovation
The core innovations of this paper include:
- �� Systematic Study: The first large-scale and systematic study of six different VLA model architectures, ranging from 80M to 7B parameters.
- �� Visual Pathway Dominance: Revealing the dominance of visual pathways in action generation, with cross-task injection steering robots toward source-task positions.
- �� Language Sensitivity: Demonstrating for the first time the impact of task structure on language sensitivity, rather than model design.
- �� Multi-Pathway Architectures: In multi-pathway architectures, expert pathways encode motor programs while VLM pathways encode goal semantics.
Methodology
The methodology of this study includes:
- �� Activation Injection: Injecting baseline activations into null-prompt episodes to observe the dominance of visual pathways.
- �� Sparse Autoencoders (SAEs): Used to extract interpretable features and analyze functional dissociation and specialization in multi-pathway architectures.
- �� Linear Probes: Used to test whether action information can be linearly decoded from intermediate representations.
- �� Experimental Design: Conducting over 394,000 rollout episodes across four benchmarks, covering six models ranging from 80M to 7B parameters.
Experiments
The experimental design includes:
- �� Datasets: Using benchmarks such as LIBERO, MetaWorld, SimplerEnv, and ALOHA.
- �� Baselines: Comparing the performance of different models on the same tasks.
- �� Metrics: Task success rate, behavioral displacement, etc.
- �� Key Hyperparameters: Model parameters ranging from 80M to 7B.
- �� Ablation Studies: Analyzing the relative importance of visual and language pathways.
Results
Results analysis shows:
- �� Visual Pathway Dominance: Across all architectures, injecting baseline activations into null-prompt episodes recovers nearly identical behavior, and cross-task injection steers robots toward source-task positions, with 99.8% of X-VLA episodes aligning with the source trajectory.
- �� Language Sensitivity: When visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA libero_goal: success rate drops from 94% to 10% under wrong prompts).
- �� Multi-Pathway Architectures: In π0.5, SmolVLA, and GR00T, expert pathways encode motor programs while VLM pathways encode goal semantics, with expert injection causing twice the behavioral displacement as VLM pathways.
Applications
Application scenarios include:
- �� Robot Control: Enhancing adaptability in complex environments through visual pathway-dominated action generation, particularly in industrial and service robotics.
- �� Multimodal Task Design: Adjusting the role of language pathways based on task structure to improve model flexibility and adaptability, applicable to intelligent assistants and autonomous driving.
- �� Debugging Vision-Language Models: Providing new debugging methods by analyzing the relative importance of visual and language pathways, helping developers better understand and optimize models.
Limitations & Outlook
Limitations and outlook include:
- �� The dominance of visual pathways may lead to insufficient flexibility in processing language instructions, especially when visual information is inadequate.
- �� The study focuses primarily on specific tasks and environments, which may not directly generalize to all types of multimodal tasks.
- �� While the study reveals the dominance of visual pathways, in-depth analysis of language pathways remains limited.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking. A Vision-Language-Action model is like a robot assistant that can see, hear, and move. Its visual pathway is like your eyes, helping it see every detail in the kitchen, like the location of pots, spatulas, and ingredients. The language pathway is like your ears, helping it understand every instruction you give, like 'stir-fry' or 'add salt.'
In this model, the visual pathway is dominant, just like you mainly rely on your eyes to judge whether the food is cooked. Even if you don't have explicit instructions, as long as you see the ingredients change color in the pot, you know it's time to stir.
However, when there are multiple tasks in the kitchen, like cooking soup and stir-frying at the same time, the language pathway becomes important. It's like you need to follow instructions to decide which task to do first.
The innovation of this model is that it can automatically generate actions based on visual and language information, like a robot assistant that can cook autonomously. While it performs well in visually rich environments, it may face challenges when language information is insufficient. Future research will explore how to find a better balance between visual and language pathways.
ELI14 Explained like you're 14
Hey there, friends! Imagine you have a super cool robot assistant that can see, hear, and help you do things! This robot is like an all-in-one helper with two main 'superpowers': one is the 'visual pathway,' like its eyes, which can see everything around it; the other is the 'language pathway,' like its ears, which can understand what you say.
Now, this robot's eyes are super powerful. It can decide what to do just by seeing things. For example, if it sees an apple on the table, it will automatically go over and pick it up. Even if you don't tell it, it knows what to do!
But sometimes, it also needs to listen to your instructions, especially when there are many things to do at once. Like, if you tell it to pick up the apple first and then the banana, it needs to use its ears to follow your instructions.
The amazing thing about this robot assistant is that it can combine what it sees and hears to make smart decisions automatically. But sometimes, it might face challenges, like when it can't hear your instructions clearly. In the future, we hope to make it smarter and better at understanding complex instructions!
Glossary
Vision-Language-Action Model
A model that integrates vision, language, and action control to generate actions from multimodal inputs.
Used in this paper to study how multimodal inputs are translated into actions.
Activation Injection
A technique that involves injecting activations from one episode into another to analyze changes in model behavior.
Used to study the dominance of visual pathways in action generation.
Sparse Autoencoder
A neural network used to decompose dense neural activations into sparse, interpretable features.
Used to extract interpretable features in VLA models.
Linear Probe
A technique used to test whether action information can be linearly decoded from intermediate representations.
Used to analyze functional dissociation in different pathways of the model.
Multi-Pathway Architecture
A model design that includes multiple functional pathways, each specialized for different tasks.
Used in this paper to analyze the relative importance of visual and language pathways.
Task Structure
The specific arrangement and requirements of a task, affecting the model's sensitivity to language.
Used to analyze the role of language pathways in different tasks.
Visual Pathway
The pathway in the model responsible for processing visual information, dominating action generation.
Proven to be critical in action generation in this paper.
Language Pathway
The pathway in the model responsible for processing language information, affecting task execution.
Becomes important in multi-goal tasks.
Behavioral Displacement
Changes in behavior due to pathway injection or other interventions.
Used to analyze the relative importance of pathways in multi-pathway architectures.
Cross-Task Injection
Injecting activations from one task into another to study changes in behavior.
Used to reveal the dominance of visual pathways.
Open Questions Unanswered questions from this research
- 1 How can the roles of visual and language pathways be balanced in Vision-Language-Action models? Current research shows that visual pathways dominate action generation, but they may face challenges when language information is insufficient. Future research needs to explore how to enhance the flexibility and adaptability of language pathways.
- 2 Will the dominance of visual pathways affect model adaptability in complex and dynamic environments? Current research focuses primarily on specific tasks and environments, and future studies need to validate these findings in a broader range of scenarios.
- 3 How can the role of language pathways be enhanced without compromising the dominance of visual pathways? Current research shows that task structure significantly impacts language sensitivity, but how to achieve this in design remains to be explored.
- 4 How does the dissociation and specialization of functions in multi-pathway architectures affect the overall performance of the model? While the study reveals the relative importance of visual and language pathways, in-depth analysis of their interactions remains limited.
- 5 How can the language understanding ability of models be improved when visual information is insufficient? Current research focuses primarily on visually rich scenarios, and future studies need to explore how to improve model performance when visual information is lacking.
Applications
Immediate Applications
Robot Control
Enhancing adaptability in complex environments through visual pathway-dominated action generation, particularly in industrial and service robotics.
Multimodal Task Design
Adjusting the role of language pathways based on task structure to improve model flexibility and adaptability, applicable to intelligent assistants and autonomous driving.
Debugging Vision-Language Models
Providing new debugging methods by analyzing the relative importance of visual and language pathways, helping developers better understand and optimize models.
Long-term Vision
Intelligent Robot Assistants
Developing robot assistants capable of autonomous decision-making in complex and dynamic environments, integrating visual and language information for higher levels of intelligence.
Multimodal AI Systems
Building AI systems capable of handling multiple modalities of information, applicable in fields like healthcare, education, and entertainment, enabling more natural human-machine interaction.
Abstract
Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA \texttt{libero\_goal}: 94\%$\to$10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics ($2\times$ greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.
References (20)
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, Percy Liang
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Zhao, Vikash Kumar, S. Levine et al.
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Senyu Fei, Siyin Wang, Junhao Shi et al.
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal et al.
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Bo Liu, Yifeng Zhu, Chongkai Gao et al.
Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts
Michal Golovanevsky, William Rudman, Michael A. Lepori et al.
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky, Nick Gabrieli, Julia Schulz et al.
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, L. Smith et al.
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Joris Postmus, Steven Abreu
Flow Matching for Generative Modeling
Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.
Locating and Editing Factual Associations in GPT
Kevin Meng, David Bau, A. Andonian et al.
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang et al.
Interactive Post-Training for Vision-Language-Action Models
Shuhan Tan, Kairan Dou, Yue Zhao et al.
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
Junjie Wen, Minjie Zhu, Jiaming Liu et al.
VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
Jianke Zhang, Xiaoyu Chen, Qiuyue Wang et al.
Code as Policies: Language Model Programs for Embodied Control
Jacky Liang, Wenlong Huang, F. Xia et al.
Interpreting CLIP with Hierarchical Sparse Autoencoders
Vladimir Zaigrajew, Hubert Baniecki, P. Biecek
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.
GR-3 Technical Report
Chi-Lam Cheang, Sijin Chen, Zhongren Cui et al.
RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning
Yinpei Dai, Jayjun Lee, Nima Fazeli et al.