iMaC: Translating Actions into Motion and Contact Images for Embodied World Models
iMaC translates future robot actions into image controls, significantly improving spatial accuracy in video prediction and policy evaluation.
Key Findings
Methodology
This paper introduces iMaC (Image as Action Control), a novel framework that converts future robot actions into dense image controls—motion images and contact images—rendered from URDF models and point clouds. The motion images are generated by applying robot URDF and forward kinematics to produce visual representations of future robot configurations from multiple camera views, directly encoding the robot’s spatial pose. Concurrently, contact images are constructed from multi-view point clouds, capturing the spatial relationships between the robot and environment through two streams: scene-to-gripper and robot-to-scene distances. These image controls are integrated into a DiT-based (Image-to-Video Transformer) architecture, where they are added to the latent tokens during training, enabling the model to generate high-fidelity future videos conditioned explicitly on spatial geometry. The training employs a progressive rollout strategy, where generated chunks serve as references for subsequent predictions, reducing exposure bias. The model also incorporates depth prediction to enhance geometric understanding, facilitating accurate long-horizon manipulation predictions.
Key Results
- In eight challenging real-world manipulation tasks, iMaC achieved an average MSE of 0.028, FID of 36.96, and superior PSNR and SSIM scores compared to baseline models like Ctrl-World and ABot-PhysWorld. These metrics demonstrate the model’s ability to produce more accurate and realistic future videos with explicit spatial control.
- Correlation analysis between world model success scores and actual robot performance yielded a coefficient of 0.956, indicating that iMaC’s predictions reliably reflect real-world task success, especially in long-horizon scenarios.
- Ablation studies confirmed that removing contact or motion images significantly degrades prediction accuracy and task success rates, underscoring the importance of spatially explicit control signals for complex manipulation tasks.
Significance
This work addresses fundamental limitations in robot video prediction by introducing explicit spatial control through dense image representations. Moving beyond traditional low-dimensional action vectors, iMaC provides a more intuitive and precise way to model robot-environment interactions, which is crucial for tasks requiring fine spatial accuracy such as contact-rich manipulation. The approach enhances the reliability of learned world models for policy evaluation, enabling safer and more scalable robot training and testing. Its ability to generalize across diverse scenes and tasks marks a significant step forward in embodied AI, bridging the gap between visual prediction and physical control. The methodology also opens avenues for integrating multi-modal sensory data and improving long-term autonomy in robots.
Technical Contribution
Technically, iMaC innovates by rendering future robot configurations into dense motion images via URDF and forward kinematics, which are then injected into a DiT-based video prediction model. It introduces two-stream point cloud-based contact images to encode robot-environment geometry explicitly. The model employs a progressive training strategy with chunk-wise rollouts, reducing exposure bias and improving long-horizon prediction stability. The integration of depth prediction further enhances spatial understanding, enabling more accurate geometric reasoning. These contributions collectively enable the model to produce high-fidelity, spatially consistent future videos conditioned explicitly on robot actions, surpassing prior methods that relied on latent or sparse action representations.
Novelty
This research is the first to embed dense, image-like control signals—motion images and contact images—into a video prediction framework for robotic manipulation. Unlike previous works that used low-dimensional vectors or sparse projections, iMaC’s dense image controls directly encode spatial pose and interaction geometry, providing a more explicit and interpretable conditioning mechanism. This approach significantly improves the accuracy of long-term predictions and policy evaluation, especially in contact-rich tasks. The combination of URDF-based rendering, multi-view point clouds, and the progressive training strategy constitutes a novel methodology that advances the state-of-the-art in embodied world modeling.
Limitations
- The model’s reliance on accurate depth estimation from multi-view RGB images (via DA3) introduces potential errors, especially in occluded or textureless scenes, affecting the precision of contact control.
- Dependence on robot URDF models limits generalization to robots with unknown or complex kinematics, requiring re-configuration or retraining.
- Long-horizon rollouts, while improved, still incur high computational costs, and error accumulation in highly dynamic or cluttered environments remains a challenge.
Future Work
Future directions include integrating more robust and high-resolution depth sensors to improve geometric accuracy, extending the framework to multi-robot systems for coordinated manipulation, and exploring adaptive control strategies that can handle unknown or changing environments. Additionally, developing more efficient training algorithms and model architectures will be crucial for real-time deployment. Incorporating tactile and auditory modalities could further enrich the spatial understanding, enabling more nuanced manipulation. Ultimately, these advancements aim to realize autonomous robots capable of long-term, reliable operation in unstructured, real-world settings.
AI Executive Summary
Robotics research has long grappled with the challenge of enabling autonomous agents to understand and manipulate complex environments. Traditional control paradigms rely heavily on low-dimensional action vectors, such as joint angles or end-effector poses, which, while computationally convenient, lack the spatial expressiveness needed for precise manipulation. These abstractions often hinder the robot’s ability to generalize across diverse embodiments and environments, especially when subtle physical interactions like contact and collision are involved.
Recent advances in video prediction and embodied world models have opened new avenues for simulating and evaluating robot policies without physical trials. However, existing models typically encode actions as compact vectors, which are injected into the generative process through learned conditioning modules. This indirect encoding makes it difficult for the model to accurately predict the spatial consequences of actions, particularly in contact-rich tasks where centimeter-level precision determines success or failure.
To address this, the authors propose iMaC (Image as Action Control), a groundbreaking framework that transforms future robot actions into dense, image-like controls. The core idea is to render future robot configurations directly as images using URDF models and forward kinematics, producing motion images that visually depict the robot’s future pose from multiple camera viewpoints. These images serve as intuitive, spatially explicit control signals, guiding the video prediction model to generate more accurate and geometrically consistent future states.
In addition to motion images, iMaC constructs contact images derived from multi-view point clouds. These images encode the spatial proximity between the robot and environment, capturing contact-relevant geometry through two streams: scene-to-gripper and robot-to-scene distances. By injecting these dense geometric cues into the generative process, the model can better predict contact interactions and scene dynamics, crucial for manipulation tasks.
The architecture builds upon the DiT (Image-to-Video Transformer) backbone, with modifications to incorporate the control images at the latent level. During training, a progressive rollout strategy is employed, where each generated chunk of video serves as the reference for subsequent predictions, effectively reducing exposure bias and improving long-term stability. The model also integrates depth prediction, further enhancing spatial reasoning.
Experimental results across eight real-world manipulation tasks demonstrate that iMaC outperforms baseline methods in both video prediction quality and policy evaluation accuracy. Quantitative metrics such as FID, PSNR, and SSIM show consistent improvements, while correlation analysis confirms that world model scores strongly predict actual robot performance (correlation coefficient up to 0.956). Ablation studies highlight the importance of explicit spatial controls, with the absence of contact or motion images leading to degraded performance.
This work marks a significant step forward in embodied AI, providing a scalable, interpretable, and spatially explicit approach to robot world modeling. By bridging the gap between visual prediction and physical control, iMaC paves the way for more reliable autonomous robots capable of complex manipulation in unstructured environments. Future research will focus on integrating higher-fidelity sensors, multi-modal data, and real-time deployment strategies to bring these advances closer to practical, everyday robotic applications.
Deep Dive
Limitations & Outlook
What gaps remain?
Plain Language Accessible to non-experts
想象你在一家厨房里做饭。每次你都要决定下一步怎么操作,比如“用勺子搅拌”或“把菜倒进锅里”。传统的方法就像用手指指点,告诉自己“加点盐”,但不能直观表现具体动作。现在,假设你有一台智能厨房助手,它可以把你未来要做的动作画成一幅图,比如“用勺子搅拌的动作场景”。这幅图让你一眼就知道下一步要怎么操作,也能提前预估效果。
这就像iMaC模型,把机器人未来的动作变成一张“操作场景图”,让机器人更聪明地理解自己要做什么。它不仅告诉机器人“转动关节”,还把动作画成图,确保每一步都很精准。这样,机器人在复杂任务中就能更可靠、更灵活,就像你用地图导航一样,知道每个转弯和距离,变得更聪明、更会干活。
ELI14 Explained like you're 14
想象你在玩一个机器人游戏,你可以告诉机器人“去拿那个球”或者“把书放到桌子上”。以前的机器人只能听懂一些简单的指令,比如“转动手臂到某个位置”,但它们很难理解动作背后的空间关系。就像你用手指指着某个地方,但机器人不知道你指的是哪个角落,也不知道怎么准确到达。
现在,假设有一种新方法,可以把未来机器人要做的动作画成一幅画,比如画出机器人手臂的运动轨迹,甚至画出它和物体之间的距离。这就像你在画画,告诉机器人“我想让它的手臂像这样运动”,它可以用这幅画来理解动作的空间细节。
这样一来,机器人就能更聪明地理解你要它做什么,尤其是在需要精确操作的任务中,比如拼装玩具或抓取小物件。它不再只是模糊地“转动关节”,而是用一幅“动作图”来指导自己,确保每一步都很准确。这就像你用地图导航,知道每个转弯和距离,机器人也能像人一样,靠“画出来的动作”更聪明、更可靠地完成任务。
Abstract
Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.