Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

TL;DR

Proposes Qwen-RobotWorld, a language-conditioned video world model using double-stream MMDiT and 8.6M embodied video-text pairs, achieving top performance on multiple benchmarks.

cs.CV 🔴 Advanced 2026-06-16 82 views

Jie Zhang Xiaoyue Chen Anzhe Chen Chenxu Lv Deqing Li Gengze Zhou Hang Yin Haoqi Yuan Haoyang Li Jiahao Li Jiazhao Zhang Jingren Zhou Kaiyuan Gao Kun Yan Lihan Jiang Ningyuan Tang Pei Lin Qihang Peng Shengming Yin Tianhe Wu Tianyi Yan Xiao Xu Yan Shu Yanran Zhang Ye Wang Yi Wang Yilei Chen Yixian Xu Yiyang Huang Yuxiang Chen Zekai Zhang Zhendong Wang Zhixing Lei Zhixuan Liang Zihao Liu Zikai Zhou Xiong-Hui Chen Chenfei Wu

AI Reader Arxiv Page Download PDF

robot vision multimodal learning video generation language understanding cross-domain simulation

Key Findings

Methodology

The proposed Qwen-RobotWorld employs a double-stream multimodal diffusion transformer (MMDiT) architecture, integrating frozen Qwen2.5-VL semantic encoder and video VAE latent features through layer-wise joint attention mechanisms. The core innovation involves leveraging a large-scale embodied dataset of 8.6 million video-text pairs (EWK), covering over 20 robot morphologies and 500 action categories, with an action-language mapping framework that standardizes diverse action expressions into natural language commands. The training adopts a two-stage curriculum: initial pretraining on general visual data to establish universal priors, followed by fine-tuning with embodied data to reinforce physical realism. The model demonstrates state-of-the-art performance on benchmarks such as EWMBench, DreamGen, and WorldModelBench, especially excelling in physics adherence and multi-view consistency.

Key Results

Achieved a total score of 4.60 on EWMBench, with a motion fidelity score of 0.566, outperforming all open-source models by 33%, indicating superior realism in motion prediction.
In WorldModelBench, the model scored perfect physics adherence, complying with Newtonian laws, mass conservation, fluid dynamics, and gravity, surpassing most existing models.
Ranked first overall on DreamGen, demonstrating excellent object compositional generalization across multiple robotic scenarios, with robust zero-shot transfer capabilities on RoboTwin-IF.

Significance

This work addresses longstanding challenges in creating universal, physically plausible world models capable of generalizing across diverse embodied scenarios. By integrating large-scale multimodal datasets and innovative architecture, it bridges the gap between generic video generation and physics-aware simulation. The resulting model not only advances academic understanding of multimodal embodied AI but also offers practical tools for policy training, virtual evaluation, and natural language-guided robot control, significantly accelerating progress toward autonomous, adaptable robotic systems.

Technical Contribution

The key technical contributions include: • Designing the double-stream MMDiT architecture with layer-wise joint attention for fine-grained multimodal fusion; • Constructing and utilizing the 8.6M embodied video-text dataset with a standardized action-language mapping; • Developing a two-stage training curriculum that combines general priors with embodied specialization; • Achieving multi-view synchronization to enhance spatial consistency; • Demonstrating competitive performance across multiple benchmarks, validating the framework’s robustness and scalability.

Novelty

This study is the first to unify large-scale embodied action data with a multimodal diffusion transformer under a natural language interface, enabling cross-scenario, multi-task simulation. Unlike prior works limited to single domains or relying on robot-specific control interfaces, this approach leverages language as a universal action medium, facilitating zero-shot generalization and multi-robot collaboration. The integration of extensive embodied datasets with a sophisticated architecture marks a significant step forward in generalizable embodied AI.

Limitations

The model still struggles with complex, long-horizon reasoning in highly dynamic or multi-agent environments, indicating room for improvement in temporal modeling.
High-quality data collection and annotation are resource-intensive, posing scalability challenges for further expansion.
Generalization to unseen robot morphologies or extreme physical conditions remains limited, necessitating adaptive learning mechanisms such as reinforcement learning or online fine-tuning.

Future Work

Future research will focus on enhancing long-term temporal reasoning, integrating reinforcement learning for autonomous policy refinement, and expanding multimodal perception (e.g., tactile, auditory) to improve environment understanding. Additionally, efforts will be made to develop more efficient data collection and annotation pipelines, possibly leveraging self-supervised learning, to reduce costs. Exploring online adaptation and continual learning will further improve the model’s robustness and applicability in real-world scenarios.

AI Executive Summary

The field of embodied AI has long grappled with the challenge of creating versatile, physically consistent world models that can operate across diverse scenarios. Traditional approaches relied heavily on scenario-specific control interfaces and physics engines, limiting their scalability and generalization. Recent advances in deep learning, especially in large-scale video and language modeling, have opened new avenues, but these models often lack the physical fidelity necessary for realistic simulation of robotic behaviors.

This paper introduces Qwen-RobotWorld, a groundbreaking framework that unifies embodied world modeling through a language-conditioned video generation approach. Central to this innovation is the double-stream multimodal diffusion transformer (MMDiT), which effectively fuses semantic understanding with visual generative capabilities. The model is trained on an unprecedented scale of 8.6 million video-text pairs, collectively called the Embodied World Knowledge (EWK) dataset, encompassing a wide variety of robot morphologies, actions, and environments. This extensive dataset is constructed through a novel action-language mapping framework that standardizes diverse action representations into a unified natural language interface, enabling seamless cross-scenario and cross-task learning.

The training strategy is carefully designed in two stages: initial pretraining on general visual data to establish broad priors, followed by a fine-tuning phase incorporating rich embodied data. This curriculum ensures the model captures both universal visual and physical priors, as well as task-specific embodied knowledge. The architecture leverages layer-wise joint attention mechanisms, allowing bidirectional information flow between semantic and visual modalities, which enhances the model’s ability to predict physically plausible future states conditioned on language instructions.

Experimental evaluations demonstrate that Qwen-RobotWorld achieves top-tier performance across multiple benchmarks, including EWMBench, DreamGen, and WorldModelBench. Notably, it attains a physics adherence score of 4.60 on EWMBench, with motion fidelity surpassing 33% over the second-best open-source model. The model also excels in multi-view scene consistency and object-level compositional generalization, validating its robustness and scalability.

The significance of this work lies in its potential to revolutionize how robots are trained and evaluated. By providing a scalable, physics-aware, and language-conditioned simulation platform, it paves the way for more autonomous, adaptable, and intelligent robotic systems. The approach reduces reliance on costly real-world data collection, accelerates policy development, and enhances human-robot interaction capabilities.

Looking ahead, future efforts will aim to improve long-term reasoning, incorporate multimodal perception such as tactile and auditory signals, and develop online learning mechanisms for continual adaptation. Despite current limitations in complex multi-agent scenarios and data costs, this research marks a pivotal step toward realizing truly general-purpose embodied AI systems that can seamlessly operate across diverse environments and tasks.

Deep Dive

Abstract

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

cs.CV

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence