Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

TL;DR

Humanoid-GPT employs a 2B-frame large-scale motion dataset and GPT-style causal Transformer to achieve zero-shot high-dynamic motion tracking, surpassing shallow MLP trackers.

cs.RO 🔴 Advanced 2026-06-03 50 views
Zekun Qi Xuchuan Chen Dairu Liu Chenghuai Lin Yunrui Lian Sikai Liang Zhikai Zhang Yu Guan Jilong Wang Wenyao Zhang Xinqiang Yu He Wang Li Yi
Deep Learning Motion Control Transformer Zero-Shot Generalization Large-Scale Data Robotics

Key Findings

Methodology

This paper introduces Humanoid-GPT, a GPT-style causal Transformer architecture trained on a billion-scale motion corpus that unifies multiple mocap datasets including Lafan1, AMASS, Motion-X++, and PHUMA, supplemented with proprietary real-world motion recordings. The data undergoes rigorous filtering, segmentation, and augmentation, resulting in a 2-billion-frame dataset. The model employs causal attention mechanisms to ensure online, real-time control, aligning with deployment constraints. Multiple motion experts are trained on distinct motion clusters using PPO, with their behaviors distilled into a single Transformer via DAgger, enabling robust zero-shot generalization. The Harmonic Motion Embedding (HME) quantifies motion diversity, guiding balanced sampling during training, which enhances the model’s ability to handle rare and unseen motions.

Key Results

  • Humanoid-GPT, with 80 million parameters trained on 2 billion frames, achieves a success rate (SR) of 90.43%, MPJPE of 76.8mm, and MPJVE of 0.4891 rad, outperforming prior state-of-the-art methods by significant margins. It demonstrates strong zero-shot generalization to unseen motions such as dance, jump, and martial arts, both in simulation and real-world tests.
  • On the Unitree-G1 robot platform, Humanoid-GPT accurately tracks complex, high-dynamic motions without finetuning, with an average MPJPE of 0.095 and MPJVE of 1.2 rad/sec, confirming its robustness and real-time capability. The model maintains high fidelity across diverse motion categories, including those not present in training data.
  • A systematic analysis reveals that increasing data scale, model capacity, and motion diversity jointly improve performance. Notably, larger models trained on 2B frames show diminishing returns on small datasets, emphasizing the importance of data diversity and balanced training strategies. The derived scaling law provides a quantitative framework for future development.

Significance

This work marks a pivotal advancement in embodied AI, demonstrating that large-scale data and Transformer architectures can fundamentally enhance zero-shot generalization in complex motion tracking tasks. It addresses longstanding limitations of shallow models constrained by limited data, paving the way for more adaptable and intelligent robotic systems. The ability to track highly dynamic, unseen motions in real time opens new horizons for autonomous robots in industrial, service, and entertainment domains. Moreover, the insights into data diversity and model scaling inform future research directions, fostering the development of more scalable, robust, and versatile embodied AI systems.

Technical Contribution

The paper’s core technical contributions include: 1) the integration of a GPT-style causal Transformer for online whole-body motion control, enabling long-horizon sequence modeling; 2) the development of a large, unified motion corpus that combines multiple datasets and proprietary recordings, scaled to 2 billion frames; 3) the introduction of motion expert training and distillation via DAgger, which consolidates diverse motion priors into a single, generalist model; 4) the design of Harmonic Motion Embedding (HME) for measuring and balancing motion diversity, improving training stability and generalization; 5) comprehensive empirical scaling laws that relate data/model size to performance, guiding future system design.

Novelty

This research is pioneering in applying a large-scale, unified motion dataset combined with GPT-style causal Transformers for zero-shot high-dynamic motion tracking. Unlike prior works limited by small datasets and shallow architectures, this study demonstrates that scaling both data and model capacity leads to unprecedented generalization. The innovative use of motion expert distillation and the HME-based diversity balancing mechanism distinguishes it from existing approaches, establishing a new paradigm for embodied AI and robotic control.

Limitations

  • Despite impressive results, the model’s computational complexity remains high, requiring substantial hardware resources for training and inference, which may limit deployment in resource-constrained environments.
  • The dataset, although large and diverse, still cannot cover all possible real-world motions, especially in highly unpredictable or extreme scenarios, potentially affecting robustness in untested conditions.
  • Current models are primarily trained offline and lack online adaptation capabilities, which are crucial for dynamic environments where continuous learning and adjustment are needed. Future work should focus on online learning and model compression techniques.

Future Work

Future directions include developing more efficient model architectures to reduce computational costs, integrating multi-modal inputs such as vision and tactile data for richer motion understanding, and enabling online adaptive learning to improve robustness in real-world, unpredictable environments. Additionally, expanding datasets to include more extreme and context-specific motions will further enhance generalization. Exploring transfer learning across different robot platforms and real-time self-supervised learning are promising avenues to realize truly autonomous, versatile embodied agents.

AI Executive Summary

The pursuit of artificial general intelligence (AGI) for embodied agents hinges on overcoming the challenge of robust, flexible motion understanding and control. Traditional approaches, constrained by limited datasets and shallow models, struggle to generalize beyond predefined motion sets, especially when faced with highly dynamic or unseen behaviors. This bottleneck has impeded progress toward autonomous robots capable of natural, adaptive movement in complex environments.

Recent advances in large-scale motion datasets and deep learning architectures have opened new possibilities. However, most prior work relied on small datasets and simple models like shallow MLPs, which inherently limit their capacity to capture the rich variability of human motion. Recognizing this, the authors propose Humanoid-GPT, a novel framework that leverages a billion-scale motion corpus and a GPT-style causal Transformer architecture to achieve unprecedented zero-shot generalization in whole-body motion tracking.

Humanoid-GPT’s core innovation lies in its comprehensive data curation, integrating multiple mocap datasets and proprietary recordings into a unified, high-diversity corpus of 2 billion frames. This dataset encompasses a wide spectrum of human activities, styles, and dynamics. To effectively utilize this data, the authors introduce a Harmonic Motion Embedding (HME) metric, which quantifies motion diversity and guides balanced sampling during training. This ensures the model learns from both common and rare behaviors, preventing overfitting to frequent patterns.

The model architecture adopts a causal Transformer with GPT-style attention, designed for online, real-time control. It predicts joint targets based on historical states and current references, respecting the causality constraint. To handle the complexity of diverse motions, the authors train multiple motion experts on distinct motion clusters using reinforcement learning (PPO), then distill their behaviors into a single, versatile Transformer via DAgger. This approach consolidates specialized knowledge and enhances generalization.

Extensive experiments demonstrate that Humanoid-GPT, with 80 million parameters trained on 2 billion frames, outperforms existing methods across multiple metrics. In simulation, it achieves a success rate of over 90%, with significantly reduced joint position and velocity errors. On a real robot platform, it accurately tracks complex motions like dance and martial arts without finetuning, showcasing robust zero-shot transfer. The study also reveals that increasing data and model size yields predictable improvements, following well-defined scaling laws.

This work marks a significant leap forward in embodied AI, illustrating that large-scale data and deep sequence modeling can unlock new levels of generalization and control fidelity. It paves the way for future research into scalable, adaptive, and multi-modal robotic systems, with broad implications for automation, entertainment, and human-robot interaction. Despite current computational demands and dataset limitations, the insights gained set a clear direction for ongoing innovation in high-fidelity, autonomous motion control.

Deep Dive

Abstract

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

cs.RO cs.AI cs.CV

References (20)

AMASS: Archive of Motion Capture As Surface Shapes

Naureen Mahmood, N. Ghorbani, N. Troje et al.

2019 1839 citations ⭐ Influential View Analysis →

PHUMA: Physically-Grounded Humanoid Locomotion Dataset

Kyungmin Lee, Sibeen Kim, Minho Park et al.

2025 11 citations ⭐ Influential View Analysis →

Go to Zero: Towards Zero-Shot Motion Generation with Million-Scale Data

Ke Fan, Shunlin Lu, Minyue Dai et al.

2025 56 citations ⭐ Influential View Analysis →

TWIST: Teleoperated Whole-Body Imitation System

Yanjie Ze, Zixuan Chen, J. P. Ara'ujo et al.

2025 133 citations ⭐ Influential View Analysis →

Robust motion in-betweening

Félix G. Harvey, Mike Yurick, D. Nowrouzezahrai et al.

2020 376 citations ⭐ Influential View Analysis →

Segment Anything

A. Kirillov, Eric Mintun, Nikhila Ravi et al.

2023 13710 citations View Analysis →

Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset

Yuhong Zhang, Jing-de Lin, Ailing Zeng et al.

2025 30 citations View Analysis →

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

S. Ross, Geoffrey J. Gordon, J. Bagnell

2010 4050 citations View Analysis →

AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System

Yuzhe Qin, Wei Yang, Binghao Huang et al.

2023 243 citations View Analysis →

OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning

Tairan He, Zhengyi Luo, Xialin He et al.

2024 278 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 58798 citations View Analysis →

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

Jing-de Lin, Ailing Zeng, Shunlin Lu et al.

2023 265 citations View Analysis →

From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots

Yuxuan Wang, Ming Yang, Weishuai Zeng et al.

2025 19 citations View Analysis →

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang et al.

2022 21318 citations View Analysis →

Object Motion Guided Human Motion Synthesis

Jiaman Li, Jiajun Wu, C. K. Liu

2023 214 citations View Analysis →

GMT: General Motion Tracking for Humanoid Whole-Body Control

Zixuan Chen, Mazeyu Ji, Xuxin Cheng et al.

2025 96 citations View Analysis →

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani et al.

2022 3570 citations View Analysis →

Track Any Motions under Any Disturbances

Zhikai Zhang, Jun Guo, Chao Chen et al.

2025 47 citations View Analysis →

ExBody2: Advanced Expressive Humanoid Whole-Body Control

Mazeyu Ji, Xuanbin Peng, Fangchen Liu et al.

2024 132 citations View Analysis →

Expressive Whole-Body Control for Humanoid Robots

Xuxin Cheng, Yandong Ji, Junming Chen et al.

2024 240 citations View Analysis →