Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

TL;DR

Sparse Autoencoders reveal interpretable and steerable features in VLA models, enhancing generalization on the LIBERO benchmark.

cs.RO 🔴 Advanced 2026-03-20 49 views
Aiden Swann Lachlain McGranahan Hugo Buurmeijer Monroe Kennedy Mac Schwager
Sparse Autoencoder VLA model Interpretability Robot Learning Generalization

Key Findings

Methodology

The study employs Sparse Autoencoders (SAEs) to train on the hidden layer activations of Vision-Language-Action (VLA) models, unveiling a sparse dictionary of features that distinguish between memorized sequences and interpretable motion primitives and semantic properties. These features are validated through steering experiments on the LIBERO benchmark, and a metric is proposed to categorize features based on their generalizability.

Key Results

  • Result 1: On the LIBERO benchmark, steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes, demonstrating the potential for VLAs to learn generalizable features across tasks and scenes.
  • Result 2: Supervised fine-tuning on small robotics datasets disproportionately amplifies memorization, whereas training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features.
  • Result 3: Features extracted by Sparse Autoencoders on the DROID dataset show high scene and task diversity, proving their effectiveness on larger-scale datasets.

Significance

This research provides the first mechanistic evidence that VLA models can learn generalizable features across tasks and scenes, revealing the potential of Sparse Autoencoders in understanding and steering complex models. By deeply analyzing the internal mechanisms of VLA models, this study offers new perspectives for future robot learning research, particularly in enhancing model generalization and interpretability.

Technical Contribution

The technical contribution lies in using Sparse Autoencoders to reveal interpretable features in VLA models and validating these features' causal influence on robot behavior through steering experiments. This approach not only provides deep insights into the model's internal mechanisms but also offers new avenues for model design and optimization.

Novelty

This study is the first to apply Sparse Autoencoders to the residual streams of VLA models, revealing interpretable features and validating their steerability through experiments. This innovation provides new tools and methods for understanding and optimizing complex models.

Limitations

  • Limitation 1: Supervised fine-tuning on small datasets tends to induce memorization rather than compositional skill learning, limiting model generalization.
  • Limitation 2: Although the DROID dataset is large, it is still relatively small compared to language model training datasets, with limited scene and task diversity.
  • Limitation 3: Training and classifying features with Sparse Autoencoders require substantial computational resources, potentially limiting their application in resource-constrained environments.

Future Work

Future directions include exploring larger and more diverse datasets to further enhance model generalization. Additionally, research could focus on efficiently training and applying Sparse Autoencoders in resource-constrained environments and extending this method to other types of models and tasks.

AI Executive Summary

Vision-Language-Action (VLA) models have shown significant potential in the field of robotic manipulation, yet their generalization remains inconsistent. While these models perform impressively in certain settings, fine-tuned variants often falter when faced with novel objects, scenes, and instructions. To gain a deeper understanding of the internal workings of VLA models, this study employs Sparse Autoencoders (SAEs) to train on the hidden layer activations, revealing a sparse dictionary of features that underpin the model's computations.

The study finds that the majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features align with interpretable, general, and steerable motion primitives and semantic properties, offering a promising glimpse into the potential for VLA generalizability. A metric is proposed to categorize features based on whether they represent generalizable transferable primitives or episode-specific memorization.

Through steering experiments on the LIBERO benchmark, these findings are validated. The study demonstrates that individual SAE features causally influence robot behavior. Steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes. This work provides the first mechanistic evidence that VLAs can learn generalizable features across tasks and scenes.

The study observes that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization, whereas training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features. To facilitate future research in VLA mechanistic interpretability, the study provides an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering.

While the study reveals the interpretability and steerability of VLA models, supervised fine-tuning on small datasets tends to induce memorization rather than compositional skill learning, limiting model generalization. Future research directions include exploring larger and more diverse datasets to further enhance model generalization.

Deep Analysis

Background

In recent years, the field of robotic manipulation has increasingly been shaped by research into generalist policies that combine visual inputs, natural language instructions, and continuous control outputs into a single learned system. The primary example of such a policy architecture is the Vision-Language-Action (VLA) model. VLA models typically couple a pretrained vision language model (VLM) backbone with a separate action decoding head. These models are pretrained on large, heterogeneous, cross-embodiment robot datasets such as OpenX Embodiment or DROID. The motivation for using VLA models is straightforward. Large language models (LLMs) and vision language models (VLMs) achieve impressive generalization across a wide variety of tasks, particularly as these frontier models learn rich representations that enable generalization across text, objects, and spatial relations. VLAs attempt to leverage this widespread semantic-visual knowledge through a VLM backbone with the goal of obtaining broad generalization to a variety of robot tasks in diverse visual environments, commanded by open-vocabulary language prompts.

Core Problem

Despite the impressive performance of VLA models in certain settings, their generalization remains inconsistent. Typically, VLAs must be fine-tuned on a specific task or embodiment to perform well. Despite rapid empirical progress in benchmarks like LIBERO or Robocasa, these models often lose language following and generalization abilities during supervised fine-tuning. Furthermore, papers like LIBERO-PRO have shown that models that exceed 90% success rate under the original protocol can collapse to near-zero under systematic perturbations, implying that these policies may rely on rote memorization of action sequences and environment layouts rather than generalizing to new perceptual inputs.

Innovation

To better understand the inner workings of VLA models, this study employs Sparse Autoencoders (SAEs) to train on the hidden layer activations, revealing a sparse dictionary of features that distinguish between memorized sequences and interpretable motion primitives and semantic properties. A metric is proposed to categorize features based on their generalizability. These findings are validated through steering experiments on the LIBERO benchmark, demonstrating the causal influence of individual SAE features on robot behavior.

Methodology

  • �� Employ Sparse Autoencoders (SAEs) to train on the hidden layer activations of VLA models, revealing a sparse dictionary of features.
  • �� Propose a metric to categorize features based on whether they represent generalizable transferable primitives or episode-specific memorization.
  • �� Validate these findings through steering experiments on the LIBERO benchmark.
  • �� Provide an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering.

Experiments

The experimental design includes steering experiments on the LIBERO and DROID datasets to validate the effectiveness of features extracted by Sparse Autoencoders. The experiments employ various baseline models and metrics, including feature activation patterns, task and scene diversity, and feature interpretability and steerability. Ablation studies are also conducted to assess the impact of different features on model behavior.

Results

The experimental results demonstrate that steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes, showcasing the potential for VLAs to learn generalizable features across tasks and scenes. Additionally, the study finds that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization, whereas training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features.

Applications

The applications of this study include feature steering and model optimization in robotic manipulation tasks. By revealing interpretable features in VLA models, the study provides new tools and methods for enhancing the generalization and reliability of robotic manipulation. Additionally, the open-source codebase and user-friendly interface facilitate future research in VLA mechanistic interpretability.

Limitations & Outlook

While the study reveals the interpretability and steerability of VLA models, supervised fine-tuning on small datasets tends to induce memorization rather than compositional skill learning, limiting model generalization. Additionally, training and classifying features with Sparse Autoencoders require substantial computational resources, potentially limiting their application in resource-constrained environments. Future research directions include exploring larger and more diverse datasets to further enhance model generalization.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You have a recipe that tells you how to make a delicious dish step by step. This recipe is like a robot model that needs to know how to handle different ingredients and steps. Now, suppose you have an assistant who helps you better understand the recipe. They tell you which steps are crucial and which can be adjusted flexibly. That's what Sparse Autoencoders do in a robot model. They help the model identify which features are important and which can be applied across different tasks and scenes. This way, the model can better adapt to new tasks and environments instead of just relying on past experiences. It's like your assistant helping you cook delicious dishes in different kitchens, not just the one you're familiar with.

ELI14 Explained like you're 14

Hey there! Do you know how robots learn to do things? Just like you learn new stuff at school, robots need to learn how to complete tasks in different environments. Imagine you're playing a game, and you need to remember the rules and tricks for each level. Robots do the same; they need to remember past experiences to complete tasks. But sometimes, they rely too much on these experiences, which makes them struggle in new environments. To help robots adapt better, scientists invented a tool called Sparse Autoencoder. This tool is like a super helper that helps robots figure out which experiences are important and which can be used flexibly. This way, robots can perform well in different tasks and scenes, just like you can score well in different games!

Glossary

Sparse Autoencoder

A type of unsupervised learning technique used to learn sparse representations of data by projecting dense activations onto a higher-dimensional sparse latent space.

Used in the paper to reveal interpretable features in VLA models.

Vision-Language-Action Model

A robot learning model that combines visual inputs, natural language instructions, and continuous control outputs to achieve broad task generalization.

The main subject of the study, used for robotic manipulation tasks.

LIBERO Benchmark

A standard benchmark for evaluating robot learning models, containing various tasks and scenes.

Used to validate the effectiveness of features extracted by Sparse Autoencoders.

DROID Dataset

A large, heterogeneous robot dataset containing various tasks and scenes, used for training and evaluating robot learning models.

Used to train and evaluate the generalization capabilities of VLA models.

Knowledge Insulation

A method to prevent fine-tuning from degrading internal model representations, ensuring the VLM backbone retains semantic information.

Used to promote general features in VLA models.

Mechanistic Interpretability

A set of tools for understanding the inner workings of learned models by revealing interpretable features to enhance transparency.

Used to analyze the internal mechanisms of VLA models.

Feature Steering

The process of predictably modulating model behavior by steering specific features.

Used to validate the causal influence of features extracted by Sparse Autoencoders.

Supervised Fine-Tuning

The process of fine-tuning a pretrained model on a specific task or dataset to improve its performance.

Observed in the study to amplify memorization on small datasets.

Residual Stream

A data stream within a model that contains intermediate results of model computations.

Sparse Autoencoders are applied to the residual streams of VLA models to reveal interpretable features.

Generalization

The ability of a model to perform well on new tasks and scenes without relying on specific training data.

A primary goal of the study, enhanced by Sparse Autoencoders in VLA models.

Open Questions Unanswered questions from this research

  • 1 Open Question 1: How can Sparse Autoencoders be efficiently trained and applied in resource-constrained environments? Current methods require substantial computational resources, limiting their widespread use in practical applications.
  • 2 Open Question 2: How can the generalization capabilities of VLA models be further enhanced on larger and more diverse datasets? Although the DROID dataset is large, it is still relatively small compared to language model training datasets, with limited scene and task diversity.
  • 3 Open Question 3: How can models ensure good performance on new tasks and scenes without relying on specific training data? Current VLA models tend to memorize when fine-tuned on small datasets.
  • 4 Open Question 4: How can model interpretability and transparency be enhanced without compromising performance? Sparse Autoencoders reveal interpretable features, but their training and application remain challenging.
  • 5 Open Question 5: How can the Sparse Autoencoder method be applied to other types of models and tasks? The current study focuses mainly on VLA models, and the potential for other fields remains to be explored.
  • 6 Open Question 6: How can computational efficiency be optimized without affecting model generalization? Current methods require substantial computational resources, potentially limiting their widespread use in practical applications.
  • 7 Open Question 7: How can models ensure good performance in different environments without relying on specific scenes? Current VLA models perform well in specific scenes but may struggle in new environments.

Applications

Immediate Applications

Robotic Manipulation Optimization

By revealing interpretable features in VLA models, the study provides new tools and methods for enhancing the generalization and reliability of robotic manipulation.

Model Design and Optimization

The Sparse Autoencoder method offers new avenues for future model design and optimization, particularly in enhancing model generalization and interpretability.

Open-Source Codebase

The open-source codebase and user-friendly interface provided by the study facilitate future research in VLA mechanistic interpretability.

Long-term Vision

Intelligent Robot Development

By enhancing the generalization and interpretability of VLA models, the study provides new directions for the development of intelligent robots, particularly in complex tasks and environments.

Cross-Domain Applications

The Sparse Autoencoder method has broad application potential and can be applied to other types of models and tasks, driving cross-domain technological advancements.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations of the VLA. SAEs learn a sparse dictionary whose features act as a compact, interpretable basis for the model's computation. We find that the large majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features correspond to interpretable, general, and steerable motion primitives and semantic properties, offering a promising glimpse toward VLA generalizability. We propose a metric to categorize features according to whether they represent generalizable transferable primitives or episode-specific memorization. We validate these findings through steering experiments on the LIBERO benchmark. We show that individual SAE features causally influence robot behavior. Steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes. This work provides the first mechanistic evidence that VLAs can learn generalizable features across tasks and scenes. We observe that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization. In contrast, training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features. We provide an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering. Our project page is located at http://drvla.github.io

cs.RO

References (20)

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao et al.

2023 662 citations ⭐ Influential View Analysis →

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

Danny Driess, Jost Tobias Springenberg, Brian Ichter et al.

2025 65 citations ⭐ Influential View Analysis →

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou, Yangming Xu, Guiyao Tie et al.

2025 23 citations ⭐ Influential View Analysis →

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr'e la Tour, Henk Tillman et al.

2024 363 citations ⭐ Influential View Analysis →

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 8509 citations View Analysis →

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang et al.

2024 253 citations View Analysis →

Open X-Embodiment: Robotic Learning Datasets and RT-X Models : Open X-Embodiment Collaboration0

A. Padalkar, A. Pooley, Ajinkya Jain et al.

2023 851 citations View Analysis →

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, S. Nair et al.

2024 609 citations View Analysis →

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, L. Smith et al.

2023 948 citations View Analysis →

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa et al.

2024 1773 citations View Analysis →

Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

Hong Lu, Hengxu Li, Prithviraj Singh Shahani et al.

2025 7 citations View Analysis →

Mechanistic interpretability for steering vision-language-action models

Bear Häon, Kaylene C. Stocking, Ian Chuang et al.

2025 6 citations View Analysis →

π0.5: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown et al.

2025 637 citations View Analysis →

GPT-3: Its Nature, Scope, Limits, and Consequences

L. Floridi, Massimo Chiriatti

2020 2295 citations

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

Shresth Grover, Akshay Gopalkrishnan, Bo Ai et al.

2025 9 citations View Analysis →

Building Production-Ready Probes For Gemini

J'anos Kram'ar, Joshua Engels, Zheng Wang et al.

2026 5 citations View Analysis →

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 1817 citations View Analysis →

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1315 citations View Analysis →

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot et al.

2025 25 citations View Analysis →

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky et al.

2025 11 citations View Analysis →