Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control

Key Findings

Methodology

The paper introduces a Neural Motion Retargeting (NMR) framework that transforms static geometric mapping into a dynamics-aware learned process, addressing the non-convexity issues of traditional optimization methods. The NMR framework includes a hierarchical data pipeline called Clustered-Expert Physics Refinement (CEPR), which uses a Variational Autoencoder (VAE) for motion clustering to group heterogeneous movements into latent motifs. This strategy significantly reduces computational overhead for massively parallel reinforcement learning experts, which project and repair noisy human demonstrations onto the robot's feasible motion manifold. The resulting high-fidelity data supervises a non-autoregressive CNN-Transformer architecture.

Key Results

Experiments on the Unitree G1 humanoid across diverse dynamic tasks (e.g., martial arts, dancing) show that NMR eliminates joint jumps and significantly reduces self-collisions compared to state-of-the-art baselines, with a 54% reduction in self-collisions and joint limit violations reduced to 16.80%.
NMR-generated references accelerate the convergence of downstream whole-body control policies, establishing a scalable path for bridging the human-robot embodiment gap.
With 30,000 physics-validated motion pairs, NMR suppresses physically infeasible components in upstream SMPL-X noise.

Significance

The NMR framework redefines the motion retargeting problem, overcoming the non-convexity and local optima issues of traditional methods, significantly enhancing humanoid robots' performance in complex environments. This method not only provides new research directions in academia but also offers more efficient solutions for industrial applications of robots in dynamic tasks. By eliminating joint jumps and self-collisions, NMR paves the way for robots' real-world applications, especially in fields requiring high-precision motion control, such as medical assistance and complex manufacturing.

Technical Contribution

The technical contribution of the NMR framework lies in transforming the motion retargeting problem from static optimization to dynamic distribution mapping, utilizing the CEPR pipeline and a non-autoregressive CNN-Transformer architecture to address the limitations of traditional methods. By incorporating physics simulation and reinforcement learning, NMR generates high-fidelity, physically consistent motion data, enhancing the model's generalization ability and physical feasibility. This approach offers new engineering possibilities for humanoid robots in complex dynamic tasks.

Novelty

NMR is the first framework to redefine the motion retargeting problem as dynamic distribution mapping, breaking through the non-convexity limitations of traditional optimization methods. Compared to existing optimization-based methods, NMR generates high-fidelity, physically consistent motion data through physics simulation and reinforcement learning, significantly improving the model's generalization ability and physical feasibility.

Limitations

NMR may face challenges in handling extreme dynamic tasks, as these tasks may exceed the capabilities of current physics simulation and reinforcement learning strategies.
The requirement for large-scale computational resources and training time may limit NMR's deployment in practical applications.
Although NMR performs well in various tasks, it may still have limitations in handling very complex or irregular motion sequences.

Future Work

Future research directions include further optimizing the NMR framework to handle more complex dynamic tasks, exploring more efficient training strategies to reduce computational resource requirements, and applying NMR to more diverse robotic platforms. Additionally, validating NMR's performance on larger-scale real-world datasets is an important direction.

AI Executive Summary

Humanoid robots are at a critical stage in their transition from laboratory settings to complex human environments, and the acquisition of diverse motor skills is fundamental to this progression. However, bridging the kinematic and dynamic embodiment gap from human data remains a major bottleneck. Traditional optimization-based retargeting methods are inherently non-convex and prone to local optima, leading to physical artifacts like joint jumps and self-penetration.

To address these issues, the paper proposes a Neural Motion Retargeting (NMR) framework that transforms static geometric mapping into a dynamics-aware learned process. NMR utilizes a hierarchical data pipeline called Clustered-Expert Physics Refinement (CEPR), leveraging a Variational Autoencoder (VAE) for motion clustering to group heterogeneous movements into latent motifs. This strategy significantly reduces computational overhead for massively parallel reinforcement learning experts, which project and repair noisy human demonstrations onto the robot's feasible motion manifold.

Experimental results show that NMR eliminates joint jumps and significantly reduces self-collisions across diverse dynamic tasks (e.g., martial arts, dancing) on the Unitree G1 humanoid. Compared to state-of-the-art baselines, NMR achieves a 54% reduction in self-collisions and reduces joint limit violations to 16.80%. Furthermore, NMR-generated references accelerate the convergence of downstream whole-body control policies, establishing a scalable path for bridging the human-robot embodiment gap.

The technical contribution of the NMR framework lies in transforming the motion retargeting problem from static optimization to dynamic distribution mapping, utilizing the CEPR pipeline and a non-autoregressive CNN-Transformer architecture to address the limitations of traditional methods. By incorporating physics simulation and reinforcement learning, NMR generates high-fidelity, physically consistent motion data, enhancing the model's generalization ability and physical feasibility. This approach offers new engineering possibilities for humanoid robots in complex dynamic tasks.

However, NMR may face challenges in handling extreme dynamic tasks, as these tasks may exceed the capabilities of current physics simulation and reinforcement learning strategies. Future research directions include further optimizing the NMR framework to handle more complex dynamic tasks, exploring more efficient training strategies to reduce computational resource requirements, and applying NMR to more diverse robotic platforms.

Deep Analysis

Background

Humanoid robots play an increasingly important role in modern technology, especially in applications requiring complex motor skills, such as medical assistance, entertainment, and manufacturing. Traditionally, researchers have relied on large-scale human motion data, such as video recordings or motion-capture databases, to train robot motor control policies through imitation learning or reinforcement learning. In this process, motion retargeting serves as a critical bridge between human demonstrations and robotic execution. Conventional retargeting methods, including inverse kinematics (IK)-based approaches and differential optimization schemes like GMR, primarily seek optimal joint configurations at the geometric level. However, this conventional decoupled architecture of 'retargeting first, tracking later' suffers from two major bottlenecks: mathematical non-convexity and lack of physical feasibility.

Core Problem

The core problem of motion retargeting is how to accurately transfer human motion data to humanoid robots while accounting for their distinct kinematic structures and physical constraints. Traditional optimization-based methods are inherently non-convex and prone to local optima, leading to physical artifacts like joint jumps and self-penetration. These methods are highly sensitive to initialization and require tedious parameter tuning. Geometric optimization methods lack awareness of physical plausibility and therefore merely propagate these errors mechanically, leading to a classic 'garbage in, garbage out' dilemma.

Innovation

The core innovation of this paper is the introduction of a Neural Motion Retargeting (NMR) framework that transforms the motion retargeting problem from static optimization to dynamic distribution mapping. NMR utilizes a hierarchical data pipeline called Clustered-Expert Physics Refinement (CEPR), leveraging a Variational Autoencoder (VAE) for motion clustering to group heterogeneous movements into latent motifs. This strategy significantly reduces computational overhead for massively parallel reinforcement learning experts, which project and repair noisy human demonstrations onto the robot's feasible motion manifold. By incorporating physics simulation and reinforcement learning, NMR generates high-fidelity, physically consistent motion data, enhancing the model's generalization ability and physical feasibility.

Methodology

�� NMR framework addresses humanoid motion retargeting via dynamic mapping, significantly reducing joint jumps and self-collisions.

�� Uses a Variational Autoencoder (VAE) for motion clustering to group heterogeneous movements into latent motifs.

�� Leverages massively parallel reinforcement learning experts to project and repair noisy human demonstrations onto the robot's feasible motion manifold.

�� Employs a non-autoregressive CNN-Transformer architecture for global temporal context reasoning, suppressing reconstruction noise and bypassing geometric traps.

Experiments

The experimental design includes testing on the Unitree G1 humanoid across diverse dynamic tasks, such as martial arts and dancing. Baseline methods include GMR and PHUMA. Evaluation metrics include joint jumps, self-collisions, and joint limit violations. With 30,000 physics-validated motion pairs, NMR suppresses physically infeasible components in upstream SMPL-X noise. Experiments also include testing the convergence of downstream whole-body control policies using NMR-generated references.

Results

Experimental results show that NMR eliminates joint jumps and significantly reduces self-collisions across diverse dynamic tasks on the Unitree G1 humanoid. Compared to state-of-the-art baselines, NMR achieves a 54% reduction in self-collisions and reduces joint limit violations to 16.80%. Furthermore, NMR-generated references accelerate the convergence of downstream whole-body control policies, establishing a scalable path for bridging the human-robot embodiment gap.

Applications

The NMR framework can be directly applied in fields requiring high-precision motion control, such as medical assistance and complex manufacturing. Its advantages in eliminating joint jumps and self-collisions make it more reliable for real-world applications. Additionally, the high-fidelity motion data generated by NMR can be used to train more complex robot control strategies, improving robot performance in dynamic tasks.

Limitations & Outlook

Despite NMR's excellent performance in various tasks, it may still have limitations in handling very complex or irregular motion sequences. Additionally, the requirement for large-scale computational resources and training time may limit NMR's deployment in practical applications. Future research directions include further optimizing the NMR framework to handle more complex dynamic tasks, exploring more efficient training strategies to reduce computational resource requirements, and applying NMR to more diverse robotic platforms.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditional motion retargeting methods are like following a recipe step by step, but sometimes the ingredients aren't fresh enough, or the steps aren't detailed enough, leading to a dish that doesn't taste quite right. The NMR framework is like a smart chef who can not only follow the recipe but also adjust based on the actual ingredients to ensure each dish reaches its best flavor. NMR dynamically maps human motion data into actions that robots can execute, just like a chef adjusts cooking methods based on the characteristics of the ingredients. This way, robots can perform various tasks in complex environments without experiencing joint jumps or self-collisions. The core of the NMR framework is its ability to learn and adjust intelligently, like an experienced chef who can make delicious dishes in any situation.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool robot game. You need to make the robot mimic human actions, like dancing or boxing. But sometimes, the robot does weird things, like suddenly jumping or hitting itself. Traditional methods are like giving the robot a fixed list of actions, but these lists aren't always perfect. NMR is like a super smart game assistant that helps you adjust the robot's actions to make them look more natural and smooth. NMR learns human actions and then uses a special method to turn these actions into something the robot can do. This way, the robot won't do weird things anymore and can perfectly complete each task. Isn't that cool?

Glossary

Motion Retargeting

The process of transferring human motion data to robots while accounting for their distinct kinematic structures and physical constraints.

In this paper, motion retargeting is the critical bridge between human demonstrations and robotic execution.

Non-convexity

A characteristic of mathematical optimization problems, indicating that the problem may have multiple local optima rather than a single global optimum.

Traditional optimization-based retargeting methods are inherently non-convex and prone to local optima.

Variational Autoencoder (VAE)

A generative model used to learn latent representations of data, commonly used for dimensionality reduction and generation tasks.

In this paper, VAE is used for motion clustering to group heterogeneous movements into latent motifs.

Reinforcement Learning (RL)

A machine learning method where an agent learns a policy through interactions with the environment to maximize cumulative rewards.

In this paper, RL is used to train expert policies to generate high-fidelity, physically consistent motion data.

CNN-Transformer Architecture

An architecture combining Convolutional Neural Networks (CNN) and Transformer for processing sequential data.

In this paper, CNN-Transformer is used for global temporal context reasoning, suppressing reconstruction noise.

Joint Jump

An artifact in robot motion characterized by sudden changes in joint positions.

NMR eliminates joint jumps through dynamic mapping.

Self-collision

A phenomenon in robot motion where parts of the robot collide with each other.

NMR significantly reduces self-collisions.

Physics Simulation

The use of computer simulations to model the motion and forces in the physical world to verify and optimize robot motion.

In this paper, physics simulation is used to validate the physical consistency of motion data.

SMPL Model

A parametric model for estimating human shape and pose, commonly used in computer vision and graphics.

In this paper, SMPL is used to generate initial human motion data.

Dynamic Distribution Mapping

A method of mapping data from one distribution to another, typically used for handling complex sequential data.

NMR addresses the non-convexity issues of traditional methods through dynamic distribution mapping.

Open Questions Unanswered questions from this research

1 Although NMR performs well in various tasks, it may still have limitations in handling very complex or irregular motion sequences. These tasks may exceed the capabilities of current physics simulation and reinforcement learning strategies, requiring further research to optimize the NMR framework.
2 The computational resource requirements of NMR are high, which may limit its deployment in practical applications. Future research could explore more efficient training strategies to reduce computational resource requirements.
3 While the high-fidelity motion data generated by NMR can be used to train more complex robot control strategies, validating NMR's performance on larger-scale real-world datasets remains an open question.
4 The physics simulation and reinforcement learning strategies of the NMR framework may face challenges in handling extreme dynamic tasks. Future research could explore more advanced physics simulation techniques and reinforcement learning algorithms to improve NMR's performance.
5 NMR has scalable potential in bridging the human-robot embodiment gap, but further research is needed to apply it to more diverse robotic platforms.

Applications

Immediate Applications

Medical Assistance

The NMR framework can be used to develop more precise medical robots, assisting doctors in complex surgical operations, improving the success rate and safety of surgeries.

Complex Manufacturing

In manufacturing, NMR can be used to develop smarter robots to perform complex assembly tasks, improving production efficiency and product quality.

Entertainment Industry

NMR can be used to develop more realistic robot actors for films and stage performances, providing richer entertainment experiences.

Long-term Vision

Smart Home

In the future, NMR can be used to develop smart home robots that help users with daily chores, improving quality of life.

Human-Robot Collaboration

NMR can be used to develop smarter human-robot collaboration systems, assisting humans in completing complex tasks, improving work efficiency and safety.

Abstract

Humanoid robots require diverse motor skills to integrate into complex environments, but bridging the kinematic and dynamic embodiment gap from human data remains a major bottleneck. We demonstrate through Hessian analysis that traditional optimization-based retargeting is inherently non-convex and prone to local optima, leading to physical artifacts like joint jumps and self-penetration. To address this, we reformulate the targeting problem as learning data distribution rather than optimizing optimal solutions, where we propose NMR, a Neural Motion Retargeting framework that transforms static geometric mapping into a dynamics-aware learned process. We first propose Clustered-Expert Physics Refinement (CEPR), a hierarchical data pipeline that leverages VAE-based motion clustering to group heterogeneous movements into latent motifs. This strategy significantly reduces the computational overhead of massively parallel reinforcement learning experts, which project and repair noisy human demonstrations onto the robot's feasible motion manifold. The resulting high-fidelity data supervises a non-autoregressive CNN-Transformer architecture that reasons over global temporal context to suppress reconstruction noise and bypass geometric traps. Experiments on the Unitree G1 humanoid across diverse dynamic tasks (e.g., martial arts, dancing) show that NMR eliminates joint jumps and significantly reduces self-collisions compared to state-of-the-art baselines. Furthermore, NMR-generated references accelerate the convergence of downstream whole-body control policies, establishing a scalable path for bridging the human-robot embodiment gap.

cs.RO

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Motion Retargeting

Non-convexity

Variational Autoencoder (VAE)

Reinforcement Learning (RL)

CNN-Transformer Architecture

Joint Jump

Self-collision

Physics Simulation

SMPL Model

Dynamic Distribution Mapping

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Medical Assistance

Complex Manufacturing

Entertainment Industry

Long-term Vision

Smart Home

Human-Robot Collaboration

Abstract

Related Papers

Passage-Aware Structural Mapping for RGB-D Visual SLAM

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Pushing Radar Odometry Beyond the Pavement: Current Capabilities and Challenges

Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

Computational Design and Co-Robotic Fabrication for Material Reuse in Architecture

Guiding Vector Field Generation via Score-based Diffusion Model