Do as I Do: Dexterous Manipulation Data from Everyday Human Videos
Proposes DO AS I DO framework that reconstructs human hand-object interactions from monocular RGB videos and retargets them to dexterous robots, outperforming state-of-the-art methods.
Key Findings
Methodology
The proposed DO AS I DO framework consists of two main stages: first, it employs SAM 3D and HaWoR-based models to reconstruct and track 3D hand and object interactions from diverse in-the-wild monocular RGB videos. This involves segmentation, depth estimation, and mesh generation, enabling robust 4D hand-object pose estimation despite occlusions and low resolution. Next, it utilizes sampling-based motion planning algorithms such as Model Predictive Path Integral (MPPI) control, combined with a novel retargeting pipeline that incorporates warm-up steps, random force perturbations, and transition rewards. These components ensure physically plausible, stable, and natural-looking trajectories when transferring the human demonstrations onto robotic hands. The entire pipeline operates without reliance on depth sensors or specialized hardware, making it scalable to internet-scale video data.
Key Results
- On DexYCB and HOI4D datasets, the method achieved Chamfer distances of 6.66 and 0.49 respectively, surpassing previous state-of-the-art methods. Human preference evaluations on 150 in-the-wild videos favored the proposed approach 67% of the time, indicating superior temporal consistency and realism.
- In robotic experiments, the success rate of transferring human demonstrations to a 22-DoF dexterous hand increased from 25% to 71%. The average positional error was reduced to 0.05 meters, and rotational error to 0.28 radians, demonstrating effective real-world applicability.
- Ablation studies confirmed that warm-up steps and perturbation strategies significantly improved stability and naturalness of the generated trajectories, especially under noisy or incomplete reference data.
Significance
This work addresses a longstanding challenge in robotic learning: acquiring large-scale, diverse, and realistic manipulation data without expensive hardware or manual annotation. By leveraging the vast repository of internet videos, the framework opens new avenues for scalable robot training, enabling robots to learn complex dexterous behaviors through passive observation. The integration of advanced 3D vision models with physics-based motion optimization marks a significant step toward autonomous, data-driven robot skill acquisition, with broad implications for industrial automation, service robots, and human-robot collaboration. It effectively bridges the gap between passive human demonstrations and active robot execution, reducing reliance on costly data collection setups.
Technical Contribution
The core technical contributions include: 1) a modular pipeline combining SAM 3D for object reconstruction and HaWoR for hand tracking, capable of handling diverse, noisy in-the-wild videos; 2) a novel sampling-based retargeting algorithm that incorporates warm-up, force perturbations, and transition-aware rewards, improving robustness against imperfect references; 3) a complete end-to-end system that converts internet videos into physically feasible robot trajectories, validated on real hardware. These innovations collectively enable scalable, generalizable dexterous manipulation data generation from monocular RGB videos.
Novelty
This study is the first to demonstrate end-to-end reconstruction and retargeting of human hand-object interactions directly from monocular RGB videos collected in unconstrained environments, without relying on depth sensors or MoCap data. Unlike prior works limited to controlled settings or specific object categories, the framework handles diverse objects and behaviors, leveraging generative models (SAM 3D) and sampling-based optimization. The integration of these components into a scalable pipeline for real-world robot deployment represents a significant leap forward in the field.
Limitations
- The approach assumes objects are rigid and relies on monocular depth estimation, which may fail with deformable objects or in scenes with severe depth ambiguity.
- Performance degrades under heavy occlusion, poor lighting, or low-resolution videos, limiting robustness in highly cluttered or challenging environments.
- The current system does not incorporate scene context or obstacle avoidance, restricting its use in complex, obstacle-rich scenarios. Additionally, the physics simulation approximates real-world dynamics, which may impact precise manipulation tasks.
Future Work
Future research will focus on extending the framework to handle non-rigid objects and articulated scenes, integrating multi-modal sensing such as tactile feedback and stereo vision. Improving the fidelity of physics simulation and incorporating scene constraints will enhance real-world robustness. Additionally, developing fully end-to-end learning systems that jointly optimize perception, reconstruction, and control policies from large-scale video datasets will push toward truly autonomous dexterous robots capable of learning new tasks with minimal human intervention.
AI Executive Summary
The quest for scalable, versatile robotic manipulation data has long been hindered by the high costs and complexity of traditional data collection methods. Conventional approaches rely heavily on expensive hardware setups such as motion capture systems, depth sensors, or labor-intensive manual annotation, which limit the diversity and volume of training data. Meanwhile, the proliferation of online videos—ranging from casual internet clips to professionally produced content—presents an untapped resource for learning rich manipulation behaviors.
However, leveraging these videos for robot learning is non-trivial. The core challenge lies in extracting precise 3D hand and object poses from monocular RGB footage, which is often noisy, occluded, and captured in unconstrained environments. Existing methods have made progress in controlled settings but struggle with the variability and scale of real-world internet videos. Moreover, transferring human demonstrations onto robotic platforms requires addressing differences in morphology, dynamics, and physical constraints.
This paper introduces the DO AS I DO framework, a comprehensive pipeline that bridges this gap. The approach begins with advanced computer vision models—SAM 3D and HaWoR—to reconstruct and track 3D hand-object interactions from diverse in-the-wild videos. These models perform segmentation, depth estimation, and mesh generation, enabling high-fidelity 4D representations despite occlusions and low resolution. The reconstructed trajectories are then refined and retargeted onto a dexterous robotic hand using sampling-based motion planning algorithms like MPPI. To enhance robustness, the authors incorporate warm-up phases, random force perturbations, and transition-aware rewards, ensuring the generated motions are physically plausible and natural.
Experimental results demonstrate the effectiveness of this approach. On standard benchmarks such as DexYCB and HOI4D, the method achieves state-of-the-art reconstruction accuracy, with Chamfer distances of 6.66 and 0.49, respectively. Human preference evaluations on 150 in-the-wild videos favor the proposed reconstructions 67% of the time, significantly outperforming existing methods like FPose. When transferred to real robot hardware, the success rate of executing these human-inspired tasks increases from 25% to 71%, with positional errors reduced to 0.05 meters. These results validate the potential of passive observation as a scalable data source for dexterous manipulation.
The broader impact of this work is substantial. It paves the way for robots to learn from the vast, unstructured repository of human videos, drastically reducing data collection costs and expanding the diversity of learned behaviors. This approach could accelerate progress in industrial automation, service robotics, and assistive technologies, making robots more adaptable and capable in complex, unstructured environments. Despite these advances, limitations remain, including assumptions of object rigidity, challenges under occlusion, and the need for more accurate physics modeling. Future work aims to address these issues by integrating multi-modal sensing, scene understanding, and end-to-end learning frameworks, moving closer to autonomous, human-like robotic dexterity.
Deep Analysis
Background
Robotics research has historically depended on costly hardware setups such as motion capture systems, depth sensors, and manual annotations to generate manipulation datasets. These methods, while precise, are limited in scale and diversity, constraining the development of generalizable dexterous manipulation policies. Recent advances in computer vision, especially monocular 3D reconstruction models like SAM 3D and hand tracking algorithms such as HaWoR, have opened new possibilities for passive data collection from internet videos. Prior works like H2Sim2Robot and VideoManip have demonstrated the potential of using visual data for robot training, but they often require controlled environments or specialized hardware. The challenge remains to leverage the vast, noisy, and diverse online video repositories to produce high-quality, scalable manipulation datasets that can be directly used for robot learning. This paper builds on these developments, proposing a unified pipeline that combines state-of-the-art 3D vision models with physics-based motion planning, aiming to democratize access to manipulation data and accelerate robotic dexterity research.
Core Problem
The central problem addressed is how to extract accurate, physically plausible hand-object interaction trajectories from monocular RGB videos captured in unconstrained, real-world environments, and then transfer these trajectories onto robotic platforms. Existing methods struggle with the inherent ambiguities of monocular vision, occlusions, and diverse object categories. Additionally, the gap between human demonstrations and robot embodiments—differing in morphology, kinematics, and dynamics—poses a significant challenge for direct transfer. The problem becomes even more complex when dealing with noisy, incomplete, or occluded data typical of internet videos. Overcoming these hurdles requires robust perception algorithms capable of handling diverse visual conditions, as well as retargeting techniques that preserve the intent and physical feasibility of human actions in robotic execution.
Innovation
This work introduces several key innovations. First, it leverages SAM 3D, a generative 3D model trained on diverse datasets, to reconstruct object shape and pose from monocular videos, handling occlusion and low resolution effectively. Second, it employs a modular approach combining hand tracking (HaWoR) with object reconstruction, enabling flexible handling of various object categories and behaviors. Third, it develops a sampling-based retargeting pipeline that incorporates warm-up steps, force perturbations, and transition rewards, significantly improving robustness against noisy references. These components work synergistically to produce stable, realistic robot trajectories from passive human videos, a feat not achieved by prior methods relying on controlled data or hardware-specific sensors.
Methodology
- �� Data collection: Gather diverse monocular RGB videos from internet sources, including egocentric, exocentric, and generated clips.
- �� Preprocessing: Use SAM 3D for object segmentation, shape reconstruction, and pose estimation; employ HaWoR for hand tracking; estimate depth and camera parameters via MoGe.
- �� Hand-object reconstruction: Combine segmented meshes and pose estimates into a coherent 4D representation, addressing occlusions with generative priors.
- �� Trajectory refinement: Apply flow-matching inference with guided diffusion, fixing object shape and sampling per-frame poses, guided by previous pose estimates.
- �� Retargeting: Use a sampling-based optimizer (MPPI) with warm-up, random force perturbations, and transition rewards to generate physically feasible trajectories.
- �� Simulation: Map optimized trajectories onto the robot platform, perform inverse kinematics, and execute in real-world experiments.
- �� Evaluation: Quantitatively assess reconstruction accuracy and transfer success rate, perform ablation studies to analyze component contributions.
Experiments
- �� Datasets: Evaluate on DexYCB and HOI4D for quantitative metrics; collect 150 in-the-wild videos for qualitative human preference studies; test on real robot hardware.
- �� Baselines: Compare with existing methods such as FPose, SPIDER, and other hand-object reconstruction algorithms.
- �� Metrics: Chamfer distance, success rate, positional and rotational errors, user preference scores.
- �� Ablation: Analyze the impact of warm-up, perturbation, and reward components.
- �� Hardware deployment: Execute 10 selected trajectories on a 22-DoF dexterous hand, measuring task success and motion naturalness.
Results
- �� Achieved state-of-the-art Chamfer distances of 6.66 (DexYCB) and 0.49 (HOI4D), outperforming prior methods.
- �� Human evaluators preferred the reconstructed object trajectories 67% of the time in in-the-wild videos, indicating higher realism.
- �� Transfer success rate to robot increased from 25% to 71%, with positional errors reduced to 0.05 meters.
- �� Ablation studies confirmed the importance of warm-up and perturbation modules, which improved stability and naturalness, especially under noisy references.
Applications
- �� Immediate applications include autonomous robot learning, teleoperation, and virtual reality training, where passive observation can generate high-quality manipulation data.
- �� Long-term vision involves creating scalable, end-to-end systems that learn new manipulation skills directly from internet videos, reducing dependence on costly hardware and manual labeling, thus democratizing advanced robotics capabilities across industries.
Limitations & Outlook
- �� The method assumes objects are rigid and relies on monocular depth estimation, limiting performance with deformable objects or scenes with severe depth ambiguity.
- �� Performance drops significantly under occlusion, poor lighting, or low-resolution videos, which are common in real-world scenarios.
- �� The current pipeline does not incorporate scene context or obstacle avoidance, restricting its use in cluttered or dynamic environments.
- �� Physics simulation approximates real-world dynamics, which may affect the precision of manipulation tasks, necessitating further refinement for high-accuracy applications.
Plain Language Accessible to non-experts
想象你在看一个人在厨房里做菜。你只用手机拍了短视频,看到他用手拿锅、翻炒、倒菜,但你不知道他用的力气有多大,也看不到锅里的细节。你想让机器人也学会做菜,但机器人没有人类那样的手,也没有厨房的场景信息。
这就像用手机拍的视频,里面有人在做菜。我们希望让机器人通过看这些视频,学会用自己的机械手模仿这些动作。首先,我们用特别的AI模型,把视频中的手和锅的形状、位置、运动都“还原”成三维模型,就像用3D软件重建场景一样。接着,我们用一种智能的“模拟器”,让机器人试着复制这些动作,调整姿势和力度,直到动作看起来和视频中的人一样自然。这一过程就像你在游戏中反复练习跳舞动作,直到跳得像专业舞者一样。
这个方法的关键在于:不用昂贵的硬件,也不用手工标记,只靠普通视频就能让机器人学会复杂的操作。它不仅可以用在厨房,还能帮机器人学会在工厂里装配、在医院里协助手术,甚至在家里帮忙打扫。未来,这项技术可能让机器人变得更聪明、更灵活,像人一样自主完成各种任务。
ELI14 Explained like you're 14
想象你在看一个朋友用手做手工,比如折纸或者拼积木。你看到他用手拿着纸,折出各种形状,然后把它们拼在一起。这些动作看起来很自然,但你其实不知道他用的具体手指弯曲的角度或者用的力气。现在,假设你想让机器人也学会这些手工活,但机器人没有像人一样的手指,也没有相机可以直接看到他的动作。
这就像你用手机拍了一段视频,里面有人在做手工。我们希望让机器人通过看这些视频,学会模仿他的动作。首先,我们用特殊的AI模型,把视频中的手和纸的形状、位置、运动“还原”成三维的模型,就像用3D软件重建场景一样。然后,我们用一个聪明的“模拟器”,让机器人试着复制这些动作,调整手指的弯曲和力度,直到动作看起来和视频中的人一样自然。这就像你在游戏里反复练习跳舞动作,直到跳得像专业舞者一样。
这个方法的厉害之处在于:不用昂贵的传感器,也不用手工标记,只靠普通的视频就能让机器人学会复杂的操作。它不仅可以用在厨房,还能帮机器人在工厂里装配东西,或者在医院帮忙做手术。未来,这项技术可能让机器人变得更聪明、更灵活,像人一样自主完成各种任务。是不是很酷?
Abstract
How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.