Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

TL;DR

Vision pathways dominate action generation in VLA models; language sensitivity is task-dependent.

cs.RO 🔴 Advanced 2026-03-20 56 views
Bryce Grant Xijia Zhao Peng Wang
Vision-Language Models Action Generation Activation Injection Sparse Autoencoders Linear Probes

Key Findings

Methodology

This study examines six Vision-Language-Action (VLA) models, ranging from 80M to 7B parameters, using activation injection, sparse autoencoders (SAEs), and linear probes. Through over 394,000 rollout episodes, the study analyzes the dominance of visual pathways in action generation. It reveals that visual pathways dominate across all architectures, and language sensitivity depends on task structure rather than model design.

Key Results

  • Visual Pathway Dominance: Across all architectures, injecting baseline activations into null-prompt episodes recovers nearly identical behavior, and cross-task injection steers robots toward source-task positions, with 99.8% of X-VLA episodes aligning with the source trajectory.
  • Language Sensitivity: When visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA libero_goal: success rate drops from 94% to 10% under wrong prompts).
  • Multi-Pathway Architectures: In π0.5, SmolVLA, and GR00T, expert pathways encode motor programs while VLM pathways encode goal semantics, with expert injection causing twice the behavioral displacement as VLM pathways.

Significance

This research highlights the dominance of visual pathways in VLA models and the task-dependent nature of language sensitivity, providing significant insights into how multimodal models translate inputs into actions. By revealing the critical role of visual pathways in action generation, the study offers new perspectives for future robot control and multimodal model design. Additionally, it underscores the importance of task structure in language processing, potentially influencing future multimodal task design.

Technical Contribution

This paper provides the first systematic study across six different VLA model architectures, revealing the dominance of visual pathways in action generation and the task-dependent nature of language sensitivity. By employing techniques such as activation injection, sparse autoencoders, and linear probes, the study demonstrates functional dissociation and specialization in multi-pathway architectures. These findings offer new technical means for designing and debugging multimodal models.

Novelty

This is the first large-scale and systematic study of VLA models, covering models from 80M to 7B parameters. The research not only reveals the dominance of visual pathways in action generation but also demonstrates for the first time the impact of task structure on language sensitivity, rather than model design, providing new perspectives for understanding and applying multimodal models.

Limitations

  • The dominance of visual pathways may lead to insufficient flexibility in processing language instructions, especially when visual information is inadequate.
  • The study focuses primarily on specific tasks and environments, which may not directly generalize to all types of multimodal tasks.
  • While the study reveals the dominance of visual pathways, in-depth analysis of language pathways remains limited.

Future Work

Future research could further explore how to balance the roles of visual and language pathways in VLA models, especially in complex and dynamic environments. Additionally, it could investigate how to enhance the flexibility and adaptability of language pathways without compromising the dominance of visual pathways.

AI Executive Summary

Vision-Language-Action (VLA) models integrate perception, language, and motor control to generate actions from multimodal inputs. However, the mechanisms by which these models translate inputs into actions remain opaque. Existing solutions often rely on visual-motor priors rather than truly understanding language instructions.

This study examines six VLA models, ranging from 80M to 7B parameters, using activation injection, sparse autoencoders (SAEs), and linear probes. Through over 394,000 rollout episodes, the study reveals the dominance of visual pathways in action generation. By injecting baseline activations into null-prompt episodes, models can recover nearly identical behavior, while cross-task injection steers robots toward source-task positions, exposing spatially bound motor programs tied to scene coordinates.

The study shows that language sensitivity depends on task structure rather than model design. When visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential. In multi-pathway architectures, expert pathways encode motor programs while VLM pathways encode goal semantics, with expert injection causing twice the behavioral displacement as VLM pathways.

These findings provide significant insights into how multimodal models translate inputs into actions. By revealing the critical role of visual pathways in action generation, the study offers new perspectives for future robot control and multimodal model design. Additionally, it underscores the importance of task structure in language processing, potentially influencing future multimodal task design.

However, the study also has limitations. The dominance of visual pathways may lead to insufficient flexibility in processing language instructions, especially when visual information is inadequate. Future research could further explore how to balance the roles of visual and language pathways in VLA models, especially in complex and dynamic environments.

Deep Analysis

Background

Vision-Language-Action (VLA) models represent a significant advancement in the field of multimodal learning. These models integrate visual encoders, language backbones, and action decoders to generate actions from multimodal inputs. Traditionally, robot control has relied on explicit kinematic and control models, whereas VLA models achieve generalization across objects and instructions through end-to-end policies. Despite their rapid adoption in practical applications, the question remains whether these models truly understand and execute language instructions. Existing debugging methods are primarily based on behavioral observation, lacking a deep understanding of the internal mechanisms of the models. Techniques like sparse autoencoders (SAEs) have been used to extract interpretable features from large language models, but their applicability to VLA models remains to be tested.

Core Problem

The mechanisms by which VLA models translate multimodal inputs into actions remain unclear. This opacity presents practical challenges: when a VLA-controlled robot exhibits unexpected behavior, operators have no principled way to diagnose the failure. Existing debugging methods are limited to behavioral observation, lacking a deep understanding of the internal mechanisms of the models. Particularly, the roles of visual and language pathways and how they interact remain largely unexplored.

Innovation

The core innovations of this paper include:


  • �� Systematic Study: The first large-scale and systematic study of six different VLA model architectures, ranging from 80M to 7B parameters.

  • �� Visual Pathway Dominance: Revealing the dominance of visual pathways in action generation, with cross-task injection steering robots toward source-task positions.

  • �� Language Sensitivity: Demonstrating for the first time the impact of task structure on language sensitivity, rather than model design.

  • �� Multi-Pathway Architectures: In multi-pathway architectures, expert pathways encode motor programs while VLM pathways encode goal semantics.

Methodology

The methodology of this study includes:


  • �� Activation Injection: Injecting baseline activations into null-prompt episodes to observe the dominance of visual pathways.

  • �� Sparse Autoencoders (SAEs): Used to extract interpretable features and analyze functional dissociation and specialization in multi-pathway architectures.

  • �� Linear Probes: Used to test whether action information can be linearly decoded from intermediate representations.

  • �� Experimental Design: Conducting over 394,000 rollout episodes across four benchmarks, covering six models ranging from 80M to 7B parameters.

Experiments

The experimental design includes:


  • �� Datasets: Using benchmarks such as LIBERO, MetaWorld, SimplerEnv, and ALOHA.

  • �� Baselines: Comparing the performance of different models on the same tasks.

  • �� Metrics: Task success rate, behavioral displacement, etc.

  • �� Key Hyperparameters: Model parameters ranging from 80M to 7B.

  • �� Ablation Studies: Analyzing the relative importance of visual and language pathways.

Results

Results analysis shows:


  • �� Visual Pathway Dominance: Across all architectures, injecting baseline activations into null-prompt episodes recovers nearly identical behavior, and cross-task injection steers robots toward source-task positions, with 99.8% of X-VLA episodes aligning with the source trajectory.

  • �� Language Sensitivity: When visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA libero_goal: success rate drops from 94% to 10% under wrong prompts).

  • �� Multi-Pathway Architectures: In π0.5, SmolVLA, and GR00T, expert pathways encode motor programs while VLM pathways encode goal semantics, with expert injection causing twice the behavioral displacement as VLM pathways.

Applications

Application scenarios include:


  • �� Robot Control: Enhancing adaptability in complex environments through visual pathway-dominated action generation, particularly in industrial and service robotics.

  • �� Multimodal Task Design: Adjusting the role of language pathways based on task structure to improve model flexibility and adaptability, applicable to intelligent assistants and autonomous driving.

  • �� Debugging Vision-Language Models: Providing new debugging methods by analyzing the relative importance of visual and language pathways, helping developers better understand and optimize models.

Limitations & Outlook

Limitations and outlook include:


  • �� The dominance of visual pathways may lead to insufficient flexibility in processing language instructions, especially when visual information is inadequate.

  • �� The study focuses primarily on specific tasks and environments, which may not directly generalize to all types of multimodal tasks.

  • �� While the study reveals the dominance of visual pathways, in-depth analysis of language pathways remains limited.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. A Vision-Language-Action model is like a robot assistant that can see, hear, and move. Its visual pathway is like your eyes, helping it see every detail in the kitchen, like the location of pots, spatulas, and ingredients. The language pathway is like your ears, helping it understand every instruction you give, like 'stir-fry' or 'add salt.'

In this model, the visual pathway is dominant, just like you mainly rely on your eyes to judge whether the food is cooked. Even if you don't have explicit instructions, as long as you see the ingredients change color in the pot, you know it's time to stir.

However, when there are multiple tasks in the kitchen, like cooking soup and stir-frying at the same time, the language pathway becomes important. It's like you need to follow instructions to decide which task to do first.

The innovation of this model is that it can automatically generate actions based on visual and language information, like a robot assistant that can cook autonomously. While it performs well in visually rich environments, it may face challenges when language information is insufficient. Future research will explore how to find a better balance between visual and language pathways.

ELI14 Explained like you're 14

Hey there, friends! Imagine you have a super cool robot assistant that can see, hear, and help you do things! This robot is like an all-in-one helper with two main 'superpowers': one is the 'visual pathway,' like its eyes, which can see everything around it; the other is the 'language pathway,' like its ears, which can understand what you say.

Now, this robot's eyes are super powerful. It can decide what to do just by seeing things. For example, if it sees an apple on the table, it will automatically go over and pick it up. Even if you don't tell it, it knows what to do!

But sometimes, it also needs to listen to your instructions, especially when there are many things to do at once. Like, if you tell it to pick up the apple first and then the banana, it needs to use its ears to follow your instructions.

The amazing thing about this robot assistant is that it can combine what it sees and hears to make smart decisions automatically. But sometimes, it might face challenges, like when it can't hear your instructions clearly. In the future, we hope to make it smarter and better at understanding complex instructions!

Glossary

Vision-Language-Action Model

A model that integrates vision, language, and action control to generate actions from multimodal inputs.

Used in this paper to study how multimodal inputs are translated into actions.

Activation Injection

A technique that involves injecting activations from one episode into another to analyze changes in model behavior.

Used to study the dominance of visual pathways in action generation.

Sparse Autoencoder

A neural network used to decompose dense neural activations into sparse, interpretable features.

Used to extract interpretable features in VLA models.

Linear Probe

A technique used to test whether action information can be linearly decoded from intermediate representations.

Used to analyze functional dissociation in different pathways of the model.

Multi-Pathway Architecture

A model design that includes multiple functional pathways, each specialized for different tasks.

Used in this paper to analyze the relative importance of visual and language pathways.

Task Structure

The specific arrangement and requirements of a task, affecting the model's sensitivity to language.

Used to analyze the role of language pathways in different tasks.

Visual Pathway

The pathway in the model responsible for processing visual information, dominating action generation.

Proven to be critical in action generation in this paper.

Language Pathway

The pathway in the model responsible for processing language information, affecting task execution.

Becomes important in multi-goal tasks.

Behavioral Displacement

Changes in behavior due to pathway injection or other interventions.

Used to analyze the relative importance of pathways in multi-pathway architectures.

Cross-Task Injection

Injecting activations from one task into another to study changes in behavior.

Used to reveal the dominance of visual pathways.

Open Questions Unanswered questions from this research

  • 1 How can the roles of visual and language pathways be balanced in Vision-Language-Action models? Current research shows that visual pathways dominate action generation, but they may face challenges when language information is insufficient. Future research needs to explore how to enhance the flexibility and adaptability of language pathways.
  • 2 Will the dominance of visual pathways affect model adaptability in complex and dynamic environments? Current research focuses primarily on specific tasks and environments, and future studies need to validate these findings in a broader range of scenarios.
  • 3 How can the role of language pathways be enhanced without compromising the dominance of visual pathways? Current research shows that task structure significantly impacts language sensitivity, but how to achieve this in design remains to be explored.
  • 4 How does the dissociation and specialization of functions in multi-pathway architectures affect the overall performance of the model? While the study reveals the relative importance of visual and language pathways, in-depth analysis of their interactions remains limited.
  • 5 How can the language understanding ability of models be improved when visual information is insufficient? Current research focuses primarily on visually rich scenarios, and future studies need to explore how to improve model performance when visual information is lacking.

Applications

Immediate Applications

Robot Control

Enhancing adaptability in complex environments through visual pathway-dominated action generation, particularly in industrial and service robotics.

Multimodal Task Design

Adjusting the role of language pathways based on task structure to improve model flexibility and adaptability, applicable to intelligent assistants and autonomous driving.

Debugging Vision-Language Models

Providing new debugging methods by analyzing the relative importance of visual and language pathways, helping developers better understand and optimize models.

Long-term Vision

Intelligent Robot Assistants

Developing robot assistants capable of autonomous decision-making in complex and dynamic environments, integrating visual and language information for higher levels of intelligence.

Multimodal AI Systems

Building AI systems capable of handling multiple modalities of information, applicable in fields like healthcare, education, and entertainment, enabling more natural human-machine interaction.

Abstract

Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA \texttt{libero\_goal}: 94\%$\to$10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics ($2\times$ greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.

cs.RO

References (20)

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, Percy Liang

2025 344 citations ⭐ Influential View Analysis →

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Zhao, Vikash Kumar, S. Levine et al.

2023 1439 citations ⭐ Influential View Analysis →

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi et al.

2025 39 citations ⭐ Influential View Analysis →

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal et al.

2023 2605 citations ⭐ Influential View Analysis →

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao et al.

2023 662 citations ⭐ Influential View Analysis →

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

Michal Golovanevsky, William Rudman, Michael A. Lepori et al.

2025 8 citations View Analysis →

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julia Schulz et al.

2023 579 citations View Analysis →

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, L. Smith et al.

2023 948 citations View Analysis →

Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

Joris Postmus, Steven Abreu

2024 18 citations View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 3675 citations View Analysis →

Locating and Editing Factual Associations in GPT

Kevin Meng, David Bau, A. Andonian et al.

2022 2205 citations View Analysis →

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang et al.

2025 50 citations View Analysis →

Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan, Kairan Dou, Yue Zhao et al.

2025 47 citations View Analysis →

dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

Junjie Wen, Minjie Zhu, Jiaming Liu et al.

2025 7 citations View Analysis →

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang et al.

2026 8 citations View Analysis →

Code as Policies: Language Model Programs for Embodied Control

Jacky Liang, Wenlong Huang, F. Xia et al.

2022 1405 citations View Analysis →

Interpreting CLIP with Hierarchical Sparse Autoencoders

Vladimir Zaigrajew, Hubert Baniecki, P. Biecek

2025 22 citations View Analysis →

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 1817 citations View Analysis →

GR-3 Technical Report

Chi-Lam Cheang, Sijin Chen, Zhongren Cui et al.

2025 73 citations View Analysis →

RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

Yinpei Dai, Jayjun Lee, Nima Fazeli et al.

2024 35 citations View Analysis →