$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

TL;DR

$Ψ_0$ model achieves 40% performance improvement using only 800 hours of human video and 30 hours of robot data.

cs.RO 🔴 Advanced 2026-03-13 13 views

Songlin Wei Hongyi Jing Boqian Li Zhenyu Zhao Jiageng Mao Zhenhao Ni Sicheng He Jie Liu Xiawei Liu Kaidi Kang Sheng Zang Weiduo Yuan Marco Pavone Di Huang Yue Wang

AI Reader Arxiv Page Download PDF

humanoid robots motion control vision-language model data efficiency robot learning

Key Findings

Methodology

This paper introduces $Ψ_0$, a foundational model designed to tackle complex humanoid loco-manipulation tasks. The model employs a staged training strategy, initially pre-training a vision-language model (VLM) on large-scale egocentric human videos to acquire generalizable visual-action representations. Subsequently, a flow-based action expert is post-trained on high-quality humanoid robot data to learn precise joint control. This approach maximizes the utility of heterogeneous data sources, circumventing the difficulties of direct knowledge transfer from human videos to robot control.

Key Results

The $Ψ_0$ model achieves a 40% increase in success rate across multiple tasks using only 800 hours of human video and 30 hours of robot data, outperforming baselines trained on 10 times more data.
Experimental results demonstrate that $Ψ_0$ excels in complex long-horizon tasks, particularly those involving whole-body motion and dexterous manipulation.
Ablation studies reveal that the staged training strategy is crucial for enhancing model generalization and data efficiency.

Significance

This research provides a novel solution for humanoid robots' motion manipulation in complex environments, overcoming previous methods' bottlenecks in data efficiency and model performance. By introducing a staged training strategy, the $Ψ_0$ model not only improves task success rates but also significantly reduces the amount of data required. This advancement offers greater feasibility for deploying robots in practical applications.

Technical Contribution

The technical contributions include proposing a new staged training framework that combines vision-language models and flow-based action experts, significantly enhancing humanoid robots' manipulation capabilities. Additionally, the research highlights the importance of pre-training on high-quality egocentric human videos, offering new perspectives for future robot learning.

Novelty

The novelty of this paper lies in the first application of a staged training strategy to humanoid robot manipulation tasks, significantly improving model generalization and data efficiency by pre-training VLMs on egocentric human videos and post-training action experts on robot data.

Limitations

The model still faces limitations in handling extremely complex manipulation tasks, potentially requiring more task-specific data for fine-tuning.
In tasks requiring high precision, the model may exhibit action jitter, indicating room for improvement.
While the model performs well in multiple tasks, its adaptability in certain specific environments remains to be further validated.

Future Work

Future research directions include exploring more diverse task scenarios to further enhance model generalization and robustness. Additionally, the research could be extended to other types of robots to verify the method's universality.

AI Executive Summary

Humanoid robots have long faced challenges in motion manipulation, with existing methods often relying on large-scale data training but still encountering bottlenecks in data efficiency and model performance.

The $Ψ_0$ model introduces a staged training strategy, initially pre-training a vision-language model (VLM) on large-scale egocentric human videos to acquire generalizable visual-action representations. Subsequently, a flow-based action expert is post-trained on high-quality humanoid robot data to learn precise joint control. This approach effectively utilizes heterogeneous data sources, avoiding the difficulties of direct knowledge transfer from human videos to robot control.

Experimental results demonstrate that the $Ψ_0$ model excels in complex tasks, particularly those involving whole-body motion and dexterous manipulation. Using only 800 hours of human video and 30 hours of robot data, the model achieves a 40% increase in success rate, significantly outperforming baselines trained on 10 times more data.

However, the model still faces limitations in handling extremely complex manipulation tasks, potentially requiring more task-specific data for fine-tuning. Future research directions include exploring more diverse task scenarios to further enhance model generalization and robustness.

Deep Analysis

Background

The study of humanoid robots has garnered significant attention, with notable progress in whole-body motion control. However, complex manipulation capabilities remain an unsolved challenge. Recent advancements in large language models have inspired researchers to explore scaling laws suitable for embodied agents. Although early studies suggest that large models can significantly enhance generalization in robotic manipulation, these methods often rely on large-scale teleoperation data, which is costly and difficult to obtain. Human egocentric videos offer a scalable alternative, but the substantial embodiment gap between humans and robots makes direct knowledge transfer non-trivial.

Core Problem

Humanoid robots lack sufficient motion manipulation capabilities in complex environments, with existing methods facing bottlenecks in data efficiency and model performance. The kinematic and dynamic disparities between humans and robots make direct learning from human videos suboptimal for robot control. Effectively utilizing heterogeneous data sources to enhance model generalization and data efficiency is a pressing challenge.

Innovation

This paper proposes a novel staged training framework that combines vision-language models and flow-based action experts, significantly enhancing humanoid robots' manipulation capabilities. Initially, a VLM is pre-trained on large-scale egocentric human videos to acquire generalizable visual-action representations. Subsequently, a flow-based action expert is post-trained on high-quality humanoid robot data to learn precise joint control. This approach not only improves task success rates but also significantly reduces the amount of data required.

Methodology

�� Initially pre-train a vision-language model (VLM) on large-scale egocentric human videos to acquire generalizable visual-action representations.

�� Subsequently, post-train a flow-based action expert on high-quality humanoid robot data to learn precise joint control.

�� Implement the action expert using a multi-modal diffusion transformer (MM-DiT), efficiently outputting joint-space action chunks by fusing action and vision-language features.

�� Introduce a real-time action chunking mechanism during training to mitigate motion jitter caused by inference latency.

Experiments

The experimental design includes testing the $Ψ_0$ model's performance across multiple complex tasks. The EgoDex dataset, containing approximately 829 hours of human egocentric video, is used for pre-training. The post-training phase utilizes the Humanoid Everyday dataset, comprising approximately 3 million frames of real-world teleoperated data. Experiments also include ablation studies to verify the staged training strategy's effectiveness in enhancing model generalization and data efficiency.

Results

Applications

The model can be directly applied to complex humanoid robot motion manipulation tasks, such as industrial automation and home service robots. By improving data efficiency and model performance, the $Ψ_0$ model offers greater feasibility for deploying robots in practical applications.

Limitations & Outlook

Despite the $Ψ_0$ model's strong performance across multiple tasks, it still faces limitations in handling extremely complex manipulation tasks, potentially requiring more task-specific data for fine-tuning. Additionally, the model may exhibit action jitter in tasks requiring high precision. Future research directions include exploring more diverse task scenarios to further enhance model generalization and robustness.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a meal. You need to take ingredients from the fridge, chop vegetables, cook, and finally serve the meal. This process is similar to a robot completing a series of complex tasks. The $Ψ_0$ model acts like a smart assistant, first learning how to chop and cook by watching videos of you cooking, then practicing in a simulated environment to master moving around the kitchen. This way, it can perform like an experienced chef during actual operations. What's special about this model is that it can not only learn to cook but also adapt to different kitchen environments, just like a versatile chef who can handle various situations.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool robot game. You need to control a robot to do things like grab a cup, push a cart, and wipe a table. The $Ψ_0$ model is like a super smart helper in the game. It first learns these actions by watching lots of human videos, then practices in a virtual world to get better at its skills. So when you need it to help in the game, it can complete tasks like a pro! Isn't that awesome? Plus, it only needs to watch 800 hours of videos and practice for 30 hours to perform better than other helpers that need more time!

Glossary

Vision-Language Model (VLM)

A vision-language model is a deep learning model that combines visual and language information to understand and generate multimodal data.

In this paper, VLM is used to learn visual-action representations from human videos.

Flow-Based Action Expert

A flow-based action expert is an action predictor based on flow models, capable of learning precise joint control from robot data.

In this paper, the flow-based action expert is used in the post-training phase on robot data.

Egocentric Video

Egocentric video refers to videos captured from a first-person perspective, often used to capture natural motion patterns and behavior information.

In this paper, egocentric videos are used for pre-training the VLM.

Multi-Modal Diffusion Transformer (MM-DiT)

A multi-modal diffusion transformer is a deep learning model that combines multimodal information to efficiently output action predictions.

In this paper, MM-DiT is used to implement the flow-based action expert.

Action Chunking Mechanism

An action chunking mechanism is a technique introduced during training to mitigate motion jitter caused by inference latency.

In this paper, the action chunking mechanism is used to improve the model's real-time performance.

EgoDex Dataset

The EgoDex dataset is a dataset containing a large amount of human egocentric video, used for training vision-language models.

In this paper, the EgoDex dataset is used for pre-training the VLM.

Humanoid Everyday Dataset

The Humanoid Everyday dataset is a dataset containing real-world teleoperated data, used in the post-training phase.

In this paper, the Humanoid Everyday dataset is used for training the flow-based action expert.

Ablation Study

An ablation study is a method of evaluating the impact of removing or modifying model components on overall performance.

In this paper, ablation studies are used to verify the effectiveness of the staged training strategy.

Embodied Agent

An embodied agent refers to an agent with a physical entity that can interact and learn in the physical world.

In this paper, embodied agents refer to humanoid robots.

Teleoperation Data

Teleoperation data refers to robot operation data obtained through remote control devices, often used for training and evaluating robot models.

In this paper, teleoperation data is used for post-training the flow-based action expert.

Open Questions Unanswered questions from this research

1 Although the $Ψ_0$ model performs well across multiple tasks, it still faces limitations in handling extremely complex manipulation tasks. Future research needs to explore how to further enhance model generalization and robustness.
2 The model may exhibit action jitter in tasks requiring high precision, indicating that performance in high-precision tasks still needs improvement.
3 While the staged training strategy shows excellent data efficiency, further research is needed to validate its effectiveness on larger datasets.
4 Current experiments focus primarily on indoor environments, and applying the model in more complex outdoor environments remains an open question.
5 While the model performs well in multiple tasks, its adaptability in certain specific environments remains to be further validated.
6 How to extend the model to other types of robots to verify its universality still requires further research.
7 In practical applications, how to effectively integrate multi-sensor data to improve model robustness and accuracy remains a problem to be solved.

Applications

Immediate Applications

Industrial Automation

The model can be used for complex tasks in industrial automation, such as material handling and equipment operation on assembly lines, improving production efficiency.

Home Service Robots

In home environments, the model can be used for service robots to perform tasks such as cleaning and item delivery, enhancing convenience.

Medical Assistance Robots

In the medical field, the model can be used for assistance robots to help with tasks such as medication delivery and patient movement, improving healthcare quality.

Long-term Vision

Smart Cities

In future smart cities, the model can be used for urban management and service robots, improving city operation efficiency and residents' quality of life.

Human-Robot Collaboration

The model can be used for complex human-robot collaboration tasks, such as post-disaster rescue and hazardous environment operations, enhancing task completion safety and efficiency.

Abstract

We introduce $Ψ_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data efficiency and model performance remain unsatisfactory despite the considerable data volume. To address this challenge, \ours\;decouples the learning process to maximize the utility of heterogeneous data sources. Specifically, we propose a staged training paradigm with different learning objectives: First, we autoregressively pre-train a VLM backbone on large-scale egocentric human videos to acquire generalizable visual-action representations. Then, we post-train a flow-based action expert on high-quality humanoid robot data to learn precise robot joint control. Our research further identifies a critical yet often overlooked data recipe: in contrast to approaches that scale with noisy Internet clips or heterogeneous cross-embodiment robot datasets, we demonstrate that pre-training on high-quality egocentric human manipulation data followed by post-training on domain-specific real-world humanoid trajectories yields superior performance. Extensive real-world experiments demonstrate that \ours\ achieves the best performance using only about 800 hours of human video data and 30 hours of real-world robot data, outperforming baselines pre-trained on more than 10$\times$ as much data by over 40\% in overall success rate across multiple tasks. We will open-source the entire ecosystem to the community, including a data processing and training pipeline, a humanoid foundation model, and a real-time action inference engine.

cs.RO

References (20)

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Nvidia, Johan Bjorck, Fernando Castañeda et al.

2025 542 citations ⭐ Influential View Analysis →

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen et al.

2025 307 citations ⭐ Influential View Analysis →

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, A. Blattmann et al.

2024 3192 citations ⭐ Influential View Analysis →

π0.5: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown et al.

2025 595 citations ⭐ Influential View Analysis →

Training-Time Action Conditioning for Efficient Real-Time Chunking

Kevin Black, Allen Z. Ren, Michael Equi et al.

2025 12 citations ⭐ Influential View Analysis →

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y. Galliker, Sergey Levine

2025 63 citations ⭐ Influential View Analysis →

Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

Zhenyu Zhao, Hongyi Jing, Xiawei Liu et al.

2025 8 citations ⭐ Influential View Analysis →

AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control

Jialong Li, Xuxin Cheng, Tianshu Huang et al.

2025 66 citations ⭐ Influential View Analysis →

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal et al.

2022 1941 citations View Analysis →

EgoMimic: Scaling Imitation Learning via Egocentric Video

Simar Kareer, Dhruv Patel, Ryan Punamiya et al.

2024 119 citations View Analysis →

Expressive Whole-Body Control for Humanoid Robots

Xuxin Cheng, Yandong Ji, Junming Chen et al.

2024 213 citations View Analysis →

In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data

Xiongyi Cai, Ri-Zhao Qiu, Geng Chen et al.

2025 5 citations View Analysis →

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Ryan Hoque, Peide Huang, David J. Yoon et al.

2025 71 citations View Analysis →

Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling

Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie et al.

2024 24 citations

H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

Hongzhe Bi, Lingxuan Wu, Tianwei Lin et al.

2025 18 citations View Analysis →

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Zhao, Vikash Kumar, S. Levine et al.

2023 1404 citations View Analysis →

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 1740 citations View Analysis →

Visual Imitation Enables Contextual Humanoid Control

Arthur Allshire, Hongsuk Choi, Junyi Zhang et al.

2025 67 citations View Analysis →

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo, Yicheng Feng, Wanpeng Zhang et al.

2025 42 citations View Analysis →

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1261 citations View Analysis →

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Vision-Language Model (VLM)

Flow-Based Action Expert

Egocentric Video

Multi-Modal Diffusion Transformer (MM-DiT)

Action Chunking Mechanism

EgoDex Dataset

Humanoid Everyday Dataset

Ablation Study

Embodied Agent

Teleoperation Data

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Industrial Automation

Home Service Robots

Medical Assistance Robots

Long-term Vision

Smart Cities

Human-Robot Collaboration

Abstract

References (20)

Related Papers

A Feasibility-Enhanced Control Barrier Function Method for Multi-UAV Collision Avoidance

Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

From Passive Monitoring to Active Defence: Resilient Control of Manipulators Under Cyberattacks

Route Fragmentation Based on Resource-centric Prioritisation for Efficient Multi-Robot Path Planning in Agricultural Environments

HumDex:Humanoid Dexterous Manipulation Made Easy