Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
Humanoid-GPT employs a 2B-frame large-scale motion dataset and GPT-style causal Transformer to achieve zero-shot high-dynamic motion tracking, surpassing shallow MLP trackers.
Key Findings
Methodology
This paper introduces Humanoid-GPT, a GPT-style causal Transformer architecture trained on a billion-scale motion corpus that unifies multiple mocap datasets including Lafan1, AMASS, Motion-X++, and PHUMA, supplemented with proprietary real-world motion recordings. The data undergoes rigorous filtering, segmentation, and augmentation, resulting in a 2-billion-frame dataset. The model employs causal attention mechanisms to ensure online, real-time control, aligning with deployment constraints. Multiple motion experts are trained on distinct motion clusters using PPO, with their behaviors distilled into a single Transformer via DAgger, enabling robust zero-shot generalization. The Harmonic Motion Embedding (HME) quantifies motion diversity, guiding balanced sampling during training, which enhances the model’s ability to handle rare and unseen motions.
Key Results
- Humanoid-GPT, with 80 million parameters trained on 2 billion frames, achieves a success rate (SR) of 90.43%, MPJPE of 76.8mm, and MPJVE of 0.4891 rad, outperforming prior state-of-the-art methods by significant margins. It demonstrates strong zero-shot generalization to unseen motions such as dance, jump, and martial arts, both in simulation and real-world tests.
- On the Unitree-G1 robot platform, Humanoid-GPT accurately tracks complex, high-dynamic motions without finetuning, with an average MPJPE of 0.095 and MPJVE of 1.2 rad/sec, confirming its robustness and real-time capability. The model maintains high fidelity across diverse motion categories, including those not present in training data.
- A systematic analysis reveals that increasing data scale, model capacity, and motion diversity jointly improve performance. Notably, larger models trained on 2B frames show diminishing returns on small datasets, emphasizing the importance of data diversity and balanced training strategies. The derived scaling law provides a quantitative framework for future development.
Significance
This work marks a pivotal advancement in embodied AI, demonstrating that large-scale data and Transformer architectures can fundamentally enhance zero-shot generalization in complex motion tracking tasks. It addresses longstanding limitations of shallow models constrained by limited data, paving the way for more adaptable and intelligent robotic systems. The ability to track highly dynamic, unseen motions in real time opens new horizons for autonomous robots in industrial, service, and entertainment domains. Moreover, the insights into data diversity and model scaling inform future research directions, fostering the development of more scalable, robust, and versatile embodied AI systems.
Technical Contribution
The paper’s core technical contributions include: 1) the integration of a GPT-style causal Transformer for online whole-body motion control, enabling long-horizon sequence modeling; 2) the development of a large, unified motion corpus that combines multiple datasets and proprietary recordings, scaled to 2 billion frames; 3) the introduction of motion expert training and distillation via DAgger, which consolidates diverse motion priors into a single, generalist model; 4) the design of Harmonic Motion Embedding (HME) for measuring and balancing motion diversity, improving training stability and generalization; 5) comprehensive empirical scaling laws that relate data/model size to performance, guiding future system design.
Novelty
This research is pioneering in applying a large-scale, unified motion dataset combined with GPT-style causal Transformers for zero-shot high-dynamic motion tracking. Unlike prior works limited by small datasets and shallow architectures, this study demonstrates that scaling both data and model capacity leads to unprecedented generalization. The innovative use of motion expert distillation and the HME-based diversity balancing mechanism distinguishes it from existing approaches, establishing a new paradigm for embodied AI and robotic control.
Limitations
- Despite impressive results, the model’s computational complexity remains high, requiring substantial hardware resources for training and inference, which may limit deployment in resource-constrained environments.
- The dataset, although large and diverse, still cannot cover all possible real-world motions, especially in highly unpredictable or extreme scenarios, potentially affecting robustness in untested conditions.
- Current models are primarily trained offline and lack online adaptation capabilities, which are crucial for dynamic environments where continuous learning and adjustment are needed. Future work should focus on online learning and model compression techniques.
Future Work
Future directions include developing more efficient model architectures to reduce computational costs, integrating multi-modal inputs such as vision and tactile data for richer motion understanding, and enabling online adaptive learning to improve robustness in real-world, unpredictable environments. Additionally, expanding datasets to include more extreme and context-specific motions will further enhance generalization. Exploring transfer learning across different robot platforms and real-time self-supervised learning are promising avenues to realize truly autonomous, versatile embodied agents.
AI Executive Summary
The pursuit of artificial general intelligence (AGI) for embodied agents hinges on overcoming the challenge of robust, flexible motion understanding and control. Traditional approaches, constrained by limited datasets and shallow models, struggle to generalize beyond predefined motion sets, especially when faced with highly dynamic or unseen behaviors. This bottleneck has impeded progress toward autonomous robots capable of natural, adaptive movement in complex environments.
Recent advances in large-scale motion datasets and deep learning architectures have opened new possibilities. However, most prior work relied on small datasets and simple models like shallow MLPs, which inherently limit their capacity to capture the rich variability of human motion. Recognizing this, the authors propose Humanoid-GPT, a novel framework that leverages a billion-scale motion corpus and a GPT-style causal Transformer architecture to achieve unprecedented zero-shot generalization in whole-body motion tracking.
Humanoid-GPT’s core innovation lies in its comprehensive data curation, integrating multiple mocap datasets and proprietary recordings into a unified, high-diversity corpus of 2 billion frames. This dataset encompasses a wide spectrum of human activities, styles, and dynamics. To effectively utilize this data, the authors introduce a Harmonic Motion Embedding (HME) metric, which quantifies motion diversity and guides balanced sampling during training. This ensures the model learns from both common and rare behaviors, preventing overfitting to frequent patterns.
The model architecture adopts a causal Transformer with GPT-style attention, designed for online, real-time control. It predicts joint targets based on historical states and current references, respecting the causality constraint. To handle the complexity of diverse motions, the authors train multiple motion experts on distinct motion clusters using reinforcement learning (PPO), then distill their behaviors into a single, versatile Transformer via DAgger. This approach consolidates specialized knowledge and enhances generalization.
Extensive experiments demonstrate that Humanoid-GPT, with 80 million parameters trained on 2 billion frames, outperforms existing methods across multiple metrics. In simulation, it achieves a success rate of over 90%, with significantly reduced joint position and velocity errors. On a real robot platform, it accurately tracks complex motions like dance and martial arts without finetuning, showcasing robust zero-shot transfer. The study also reveals that increasing data and model size yields predictable improvements, following well-defined scaling laws.
This work marks a significant leap forward in embodied AI, illustrating that large-scale data and deep sequence modeling can unlock new levels of generalization and control fidelity. It paves the way for future research into scalable, adaptive, and multi-modal robotic systems, with broad implications for automation, entertainment, and human-robot interaction. Despite current computational demands and dataset limitations, the insights gained set a clear direction for ongoing innovation in high-fidelity, autonomous motion control.
Deep Dive
Abstract
We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.
References (20)
AMASS: Archive of Motion Capture As Surface Shapes
Naureen Mahmood, N. Ghorbani, N. Troje et al.
PHUMA: Physically-Grounded Humanoid Locomotion Dataset
Kyungmin Lee, Sibeen Kim, Minho Park et al.
Go to Zero: Towards Zero-Shot Motion Generation with Million-Scale Data
Ke Fan, Shunlin Lu, Minyue Dai et al.
TWIST: Teleoperated Whole-Body Imitation System
Yanjie Ze, Zixuan Chen, J. P. Ara'ujo et al.
Robust motion in-betweening
Félix G. Harvey, Mike Yurick, D. Nowrouzezahrai et al.
Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset
Yuhong Zhang, Jing-de Lin, Ailing Zeng et al.
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
S. Ross, Geoffrey J. Gordon, J. Bagnell
AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System
Yuzhe Qin, Wei Yang, Binghao Huang et al.
OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning
Tairan He, Zhengyi Luo, Xialin He et al.
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder et al.
Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset
Jing-de Lin, Ailing Zeng, Shunlin Lu et al.
From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots
Yuxuan Wang, Ming Yang, Weishuai Zeng et al.
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al.
Object Motion Guided Human Motion Synthesis
Jiaman Li, Jiajun Wu, C. K. Liu
GMT: General Motion Tracking for Humanoid Whole-Body Control
Zixuan Chen, Mazeyu Ji, Xuxin Cheng et al.
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani et al.
Track Any Motions under Any Disturbances
Zhikai Zhang, Jun Guo, Chao Chen et al.
ExBody2: Advanced Expressive Humanoid Whole-Body Control
Mazeyu Ji, Xuanbin Peng, Fangchen Liu et al.
Expressive Whole-Body Control for Humanoid Robots
Xuxin Cheng, Yandong Ji, Junming Chen et al.