VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

TL;DR

VLA Foundry: A unified framework for training Vision-Language-Action models, enhancing multi-task tabletop manipulation policies.

cs.RO 🔴 Advanced 2026-04-22 47 views

Jean Mercat Sedrick Keh Kushal Arora Isabella Huang Paarth Shah Haruki Nishimura Shun Iwase Katherine Liu

AI Reader Arxiv Page Download PDF

Vision-Language Model Action Model Open-Source Framework Multi-Task Learning Robotic Manipulation

Key Findings

Methodology

VLA Foundry is an open-source framework that unifies the training of Large Language Models (LLM), Vision-Language Models (VLM), and Vision-Language-Action Models (VLA) in a single codebase. It provides end-to-end control from language pretraining to action-expert fine-tuning, supporting both from-scratch training and pretrained backbones from Hugging Face. By offering a shared data-loading and training stack, researchers can co-train across modalities, mix datasets, and prototype new architectures without stitching together disparate tools.

Key Results

Result 1: On the LBM Eval simulator, the fully open from-scratch model performs on par with prior closed-source work in nominal evaluation settings.
Result 2: Substituting in the pretrained Qwen3-VL backbone leads to a strong multi-task tabletop manipulation policy, outperforming the baseline by 20 percentage points.
Result 3: The multi-task model trained using VLA Foundry significantly outperforms the prior closed-source multi-task model on 16 simulated tasks.

Significance

The introduction of VLA Foundry provides researchers with a flexible and scalable tool for exploring and optimizing the training of Vision-Language-Action models. It addresses the incompatibility issues of existing open-source frameworks' pretraining pipelines, allowing users to conduct both from-scratch training and pretrained backbone initialization within the same codebase. This unified training stack makes it practical to build and scale VLA systems while exploring new training recipes, architectures, and data mixtures.

Technical Contribution

VLA Foundry's technical contributions lie in its modularity and composability, allowing users to swap architectures, data pipelines, and training recipes through simple command-line or YAML changes. It supports pretrained backbones from Hugging Face and offers scalable distributed training, supporting multi-node, multi-GPU runs.

Novelty

VLA Foundry is the first to unify the training of LLM, VLM, and VLA in a single codebase, providing end-to-end control from language pretraining to action-expert fine-tuning. Compared to existing open-source frameworks, it solves the issue of incompatible pretraining pipelines and offers greater flexibility and scalability.

Limitations

Limitation 1: While VLA Foundry supports multi-task training, it still faces challenges in handling data scarcity in robotic interaction data.
Limitation 2: In some cases, using pretrained backbones may limit the model's flexibility and adaptability.
Limitation 3: The complexity of the framework may pose a learning curve for novice users.

Future Work

Future research directions include further optimizing the training efficiency of VLA Foundry, exploring more diverse datasets and tasks, and improving the framework's user-friendliness and accessibility.

AI Executive Summary

VLA Foundry is an open-source framework designed to unify the training process of Vision-Language-Action (VLA) models. Existing open-source VLA frameworks often focus on the action training stage, stitching together incompatible pretraining pipelines. VLA Foundry addresses this issue by providing end-to-end control from language pretraining to action-expert fine-tuning.

The framework supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate its utility, researchers trained and released two types of models: one trained fully from scratch through the LLM→VLM→VLA pipeline, and the other built on the pretrained Qwen3-VL backbone.

Despite the significant advances in unifying the training process, VLA Foundry still faces challenges in handling data scarcity in robotic interaction data. Future research directions include further optimizing training efficiency, exploring more diverse datasets and tasks, and improving the framework's user-friendliness and accessibility.

Deep Analysis

Background

In recent years, the advancement of robotic foundation models has been rapid, with many systems demonstrating capabilities that seemed out of reach just a few years ago. As the frontier moves faster, the tooling required to support rigorous research must keep pace. Many high-impact questions—about data scaling, backbone pretraining, and the interplay between robotics and non-robotics data—require both scale (compute, data, etc.) and modular algorithmic infrastructure that allows users full control over different parts of the model and training pipeline. However, most existing codebases have either not been extensively tested at scale or are largely focused on model releases, limiting research flexibility. At the same time, data scarcity remains a fundamental bottleneck in robotics. Robot interaction data is severely constrained relative to data used for language and vision models, especially in diversity and in signal density per token. Despite this data disparity, most open-source VLA frameworks focus narrowly on the action training stage, treating the upstream data recipe as fixed or out-of-scope. Such separation is problematic: data decisions made during LLM and VLM pretraining have direct consequences for downstream robotics performance. Exploring the design space requires a framework that treats the entire pipeline, from pretraining to policy learning, as a single, controllable system.

Core Problem

Existing open-source VLA frameworks often focus on the action training stage, stitching together incompatible pretraining pipelines. This approach leads to limitations in research flexibility, as researchers cannot conduct both from-scratch training and pretrained backbone initialization within the same codebase. Additionally, data scarcity remains a fundamental bottleneck in robotics, with robot interaction data severely constrained relative to data used for language and vision models, especially in diversity and in signal density per token.

Innovation

The core innovations of VLA Foundry lie in its modularity and composability, allowing users to swap architectures, data pipelines, and training recipes through simple command-line or YAML changes. It supports pretrained backbones from Hugging Face and offers scalable distributed training, supporting multi-node, multi-GPU runs. By offering a shared data-loading and training stack, researchers can co-train across modalities, mix datasets, and prototype new architectures without stitching together disparate tools. VLA Foundry is the first to unify the training of LLM, VLM, and VLA in a single codebase, providing end-to-end control from language pretraining to action-expert fine-tuning.

Methodology

VLA Foundry is designed around end-to-end control of the embodied-model pipeline: the same training loop, data abstractions, and configuration interface extend from language pretraining to vision-language training and action learning. • Modularity and Composability: Models, data pipelines, encoders, and loss handlers are instantiated by name from a YAML-based configuration system. • Scalable Distributed Training: Supports multi-node, multi-GPU runs with automatic gradient accumulation, mixed precision, and checkpoint synchronization. • Evaluation: Supports evaluation on lbm_eval_oss, the open-source release of the benchmark, using the high fidelity Drake physics engine to model the robots and scene dynamics. • Statistical Analysis: Provides rigorous statistical analysis via STEP to compare success rates of multiple policies.

Experiments

The experimental design includes evaluating two types of models on the LBM Eval simulator: one trained fully from scratch through the LLM→VLM→VLA pipeline, and the other built on the pretrained Qwen3-VL backbone. The evaluation includes performance comparisons on 16 simulated tasks, varying in complexity and manipulation modes. The experiments also include ablation studies on multi-task vs. single-task training, as well as sim-only and real-only subsets.

Results

On the LBM Eval simulator, the fully open from-scratch model performs on par with prior closed-source work in nominal evaluation settings, while the model using the Qwen3-VL backbone excels in multi-task tabletop manipulation policies, outperforming the baseline by 20 percentage points. The multi-task model trained using VLA Foundry significantly outperforms the prior closed-source multi-task model on 16 simulated tasks. The experimental results demonstrate that using a stronger VLM backbone can significantly enhance VLA performance.

Applications

VLA Foundry's application scenarios include the development and optimization of robotic manipulation policies, particularly in multi-task tabletop manipulation. It provides researchers with a flexible and scalable tool for exploring and optimizing the training of Vision-Language-Action models. The framework can also be applied to other domains requiring multimodal data integration and cross-modal training, such as autonomous driving and human-computer interaction.

Limitations & Outlook

Despite significant advances in unifying the training process, VLA Foundry still faces challenges in handling data scarcity in robotic interaction data. Additionally, using pretrained backbones may limit the model's flexibility and adaptability. The complexity of the framework may pose a learning curve for novice users. Future research directions include further optimizing training efficiency, exploring more diverse datasets and tasks, and improving the framework's user-friendliness and accessibility.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You need a recipe (language model), a visual reference (vision model), and the actual cooking actions (action model). VLA Foundry is like a smart kitchen assistant that not only helps you find recipes but also tells you how to adjust your cooking steps based on visual cues and guides you through the actual cooking process. This assistant can learn new recipes from scratch or improve upon existing ones. Its uniqueness lies in its ability to unify all these steps in one system, without requiring you to switch between different tools. It's like having an all-in-one assistant in your kitchen, helping you from selecting ingredients to the final plating, ensuring every step is flawless.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game that requires you to use your language, vision, and action skills all at once. VLA Foundry is like a super smart game assistant that helps you make the best decisions in the game. For example, it helps you understand the game's dialogues (language model), recognize objects in the game (vision model), and guide you on how to control your character (action model). This assistant can learn new skills from scratch and improve upon existing experiences. It's like having an all-knowing game buddy who's always there to help you out, making sure you ace the game!

Glossary

Vision-Language Model

A model that combines visual and language information to perform tasks, commonly used for image captioning and visual question answering.

A core component for cross-modal representation learning in VLA Foundry.

Action Model

A model used to predict and generate robotic manipulation actions, typically trained with visual and language information.

Key component for training robotic manipulation policies in VLA Foundry.

Open-Source Framework

A framework that provides code and resources openly, allowing users to freely use, modify, and distribute.

VLA Foundry as an open-source framework provides a unified training process.

Multi-Task Learning

A machine learning method that learns multiple related tasks simultaneously, aiming to improve generalization by sharing information.

VLA Foundry supports multi-task training to enhance model performance across different tasks.

Robotic Manipulation

Tasks involving robot interaction with objects, including grasping, moving, and manipulating objects.

VLA Foundry is used to train and optimize robotic manipulation policies.

Hugging Face

An open-source platform providing pretrained models and tools, widely used for natural language processing and computer vision tasks.

VLA Foundry supports using pretrained backbones from Hugging Face.

Distributed Training

A method of training models in parallel across multiple computing nodes to improve training efficiency and model scale.

VLA Foundry supports scalable distributed training, supporting multi-node, multi-GPU runs.

Data Scarcity

A problem where the amount of data required to train models is insufficient, potentially leading to decreased model performance.

VLA Foundry faces challenges in handling data scarcity in robotic interaction data.

Modular Design

A design approach that breaks down a system into independent modules to enhance flexibility and scalability.

VLA Foundry's modular design allows users to easily swap architectures and training recipes.

End-to-End Control

Refers to a system where all steps from input to output are controlled and managed by the same framework.

VLA Foundry provides end-to-end control from language pretraining to action-expert fine-tuning.

Open Questions Unanswered questions from this research

1 How to improve robotic manipulation policy performance in the face of data scarcity? Existing methods struggle with limited diversity and signal density, necessitating the development of new data augmentation and generation techniques.
2 How to further optimize the training efficiency of VLA Foundry? While the framework supports distributed training, performance bottlenecks remain on large-scale datasets and models, requiring exploration of new parallelization and optimization techniques.
3 How to enhance the user-friendliness of VLA Foundry? The complexity of the framework may pose a learning curve for novice users, necessitating the development of more intuitive interfaces and tutorials.
4 How to integrate more diverse datasets and tasks into VLA Foundry? Existing datasets and tasks may be insufficient to comprehensively evaluate model performance, necessitating the development of new datasets and tasks.
5 How to achieve stronger cross-modal learning in VLA Foundry? Existing cross-modal learning methods may not fully leverage multimodal data, necessitating the development of new representation learning and alignment techniques.

Applications

Immediate Applications

Robotic Manipulation Policy Optimization

VLA Foundry can be used to develop and optimize robotic manipulation policies, particularly in multi-task tabletop manipulation.

Autonomous Driving System Development

By integrating vision, language, and action models, VLA Foundry can be used to develop more intelligent autonomous driving systems.

Human-Computer Interaction System Enhancement

VLA Foundry can be used to develop more natural and efficient human-computer interaction systems, enhancing user experience.

Long-term Vision

Intelligent Robotic Assistants

With continuous optimization and expansion, VLA Foundry is expected to become the foundational framework for developing intelligent robotic assistants, supporting more complex and diverse tasks.

Cross-Modal AI Systems

VLA Foundry's unified training framework provides the potential for developing more powerful cross-modal AI systems, driving further advancements in AI technology.

Abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

cs.RO cs.AI cs.CV cs.LG cs.SE

References (20)

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

Tri Lbm Team, Jose Barreiros, Andrew Beaulieu et al.

2025 79 citations ⭐ Influential View Analysis →

Intelligence

1836 2219 citations ⭐ Influential

Computing Extremely Accurate Quantiles Using t-Digests

Ted Dunning, Otmar Ertl

2019 83 citations ⭐ Influential View Analysis →

Significance tests for 2 X 2 tables.

G. Barnard

1947 374 citations

Scaling Laws for Neural Language Models

J. Kaplan, Sam McCandlish, T. Henighan et al.

2020 7595 citations View Analysis →

DataComp: In search of the next generation of multimodal datasets

S. Gadre, Gabriel Ilharco, Alex Fang et al.

2023 660 citations View Analysis →

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Siddharth Karamcheti, Suraj Nair, A. Balakrishna et al.

2024 299 citations View Analysis →

Ray: A Distributed Framework for Emerging AI Applications

Philipp Moritz, Robert Nishihara, Stephanie Wang et al.

2017 1597 citations View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 4030 citations View Analysis →

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 8965 citations View Analysis →

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang et al.

2019 2293 citations View Analysis →

DINOv3

Oriane Sim'eoni, Huy V. Vo, Maximilian Seitzer et al.

2025 612 citations View Analysis →

An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula et al.

2019 2986 citations

LLM360: Towards Fully Transparent Open-Source LLMs

Zhengzhong Liu, Aurick Qiao, Willie Neiswanger et al.

2023 106 citations View Analysis →

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Zhao, Vikash Kumar, S. Levine et al.

2023 1533 citations View Analysis →

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 1962 citations View Analysis →

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk et al.

2019 3966 citations View Analysis →

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni et al.

2018 4402 citations View Analysis →

DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, G. Smyrnis et al.

2024 284 citations View Analysis →

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, Raul Puri et al.

2019 2674 citations View Analysis →

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Vision-Language Model

Action Model

Open-Source Framework

Multi-Task Learning

Robotic Manipulation

Hugging Face

Distributed Training

Data Scarcity

Modular Design

End-to-End Control

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Robotic Manipulation Policy Optimization

Autonomous Driving System Development

Human-Computer Interaction System Enhancement

Long-term Vision

Intelligent Robotic Assistants

Cross-Modal AI Systems

Abstract

References (20)

Related Papers

Passage-Aware Structural Mapping for RGB-D Visual SLAM

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Pushing Radar Odometry Beyond the Pavement: Current Capabilities and Challenges

Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

Computational Design and Co-Robotic Fabrication for Material Reuse in Architecture

Guiding Vector Field Generation via Score-based Diffusion Model