Mask World Model: Predicting What Matters for Robust Robot Policy Learning

TL;DR

Mask World Model predicts semantic masks instead of pixels, enhancing robust robot policy learning, excelling in LIBERO and RLBench.

cs.RO πŸ”΄ Advanced 2026-04-22 38 views
Yunfan Lou Xiaowei Chi Xiaojie Zhang Zezhong Qian Chengxuan Li Rongyu Zhang Yaoxu Lyu Guoyu Song Chuyao Fu Haoxuan Xu Pengwei Wang Shanghang Zhang
robotics policy learning semantic mask video diffusion robustness

Key Findings

Methodology

The Mask World Model (MWM) employs video diffusion architectures to predict the evolution of semantic masks rather than pixels. This approach introduces a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. MWM seamlessly integrates this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. The training is conducted in two stages: first, a mask-centric predictive model is learned via conditional diffusion objectives, and then a diffusion policy head is trained based on mask-centric predictive features.

Key Results

  • In the LIBERO benchmark, MWM achieved a 98.3% average success rate, significantly outperforming RGB-based world models. In RLBench, MWM's average success rate was 68.3%, also surpassing existing RGB models.
  • In real-world experiments, MWM achieved an average success rate of 67.5% across four tasks, far exceeding GE-ACT's 23.8% and Ο€'s 38.8%. These tasks involved complex goal constraints and high sensitivity to error accumulation.
  • Robustness evaluation via random visual token pruning revealed that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

Significance

This research significantly enhances the robustness of robot policy learning by introducing semantic mask prediction, addressing the overfitting issues in traditional RGB video prediction caused by dynamic backgrounds and illumination changes. MWM's superior performance across multiple benchmarks demonstrates its significant advantages in handling visual variability and improving decision-relevant geometric information capture. This approach not only offers new research directions in academia but also provides more robust solutions for industrial robot control strategies.

Technical Contribution

The technical contributions include shifting the predictive space from RGB frames to semantic masks, providing a geometric bottleneck that preserves object identity, spatial layout, and interaction-relevant structure. MWM does not require an external segmentation model during inference, using semantic labels only during training. This method's outstanding performance across multiple benchmarks indicates its significant advantages in handling visual variability and improving decision-relevant geometric information capture.

Novelty

MWM is the first to apply video diffusion architectures to semantic mask prediction instead of traditional RGB video prediction. This innovation significantly reduces the impact of visual noise by introducing a geometric information bottleneck, enhancing the model's generalization capabilities and robustness.

Limitations

  • MWM may still face limitations in handling extreme lighting changes and complex backgrounds, especially when lacking sufficient training data.
  • The model's computational resource requirements are relatively high, potentially unsuitable for resource-constrained real-time applications.
  • In certain specific tasks, the accuracy of mask prediction may not be as high as direct RGB prediction.

Future Work

Future research directions include optimizing MWM's computational efficiency for resource-constrained applications; exploring more semantic mask generation methods to improve model generalization; and validating MWM's effectiveness in more real-world tasks.

AI Executive Summary

In the field of robot policy learning, maintaining reliability under visual variability remains a central challenge. Traditional approaches often rely on high-fidelity RGB video prediction, which can lead to overfitting to irrelevant factors such as dynamic backgrounds and illumination changes, ultimately reducing the model's ability to generalize. To address this issue, researchers have introduced the Mask World Model (MWM), a novel approach that leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels.

MWM introduces a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. This method seamlessly integrates the mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. The training process is conducted in two stages: first, a mask-centric predictive model is learned via conditional diffusion objectives, and then a diffusion policy head is trained based on mask-centric predictive features.

In experiments, MWM demonstrated superior performance in the LIBERO and RLBench benchmarks, achieving average success rates of 98.3% and 68.3%, respectively, significantly outperforming existing RGB-based world models. Additionally, in real-world experiments, MWM achieved an average success rate of 67.5% across four tasks, far exceeding other baseline models.

The significance of this research lies in its ability to significantly enhance the robustness of robot policy learning by introducing semantic mask prediction, addressing the overfitting issues in traditional RGB video prediction caused by dynamic backgrounds and illumination changes. MWM's superior performance across multiple benchmarks demonstrates its significant advantages in handling visual variability and improving decision-relevant geometric information capture.

However, MWM may still face limitations in handling extreme lighting changes and complex backgrounds, especially when lacking sufficient training data. Future research directions include optimizing MWM's computational efficiency for resource-constrained applications; exploring more semantic mask generation methods to improve model generalization; and validating MWM's effectiveness in more real-world tasks.

Deep Analysis

Background

In the field of robot policy learning, maintaining reliability under visual variability remains a central challenge. Traditional approaches often rely on high-fidelity RGB video prediction, which can lead to overfitting to irrelevant factors such as dynamic backgrounds and illumination changes, ultimately reducing the model's ability to generalize. Recently, with the development of video generative pre-training techniques, world models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, these methods often focus on predicting RGB pixels, and this photometric objective is often misaligned with control tasks. RGB frames contain substantial nuisance variation, including texture, lighting, reflections, and dynamic backgrounds, which are weakly related to action selection. Pixel prediction compels a model to allocate capacity to these factors and to entangle appearance with dynamics, treating changes in illumination or background as comparable to contact-relevant motion. In closed-loop execution, this misallocation becomes more damaging: small appearance-driven errors can compound over time, causing predictive drift and brittle policies under modest distribution shifts.

Core Problem

Traditional RGB video prediction methods in robot policy learning face overfitting issues, particularly to irrelevant factors such as dynamic backgrounds and illumination changes. This reduces the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To achieve robust robot policy learning, there is an urgent need for a method that can effectively filter out visual noise and capture decision-relevant geometric information.

Innovation

The core innovation of the Mask World Model (MWM) lies in shifting the predictive space from RGB frames to future semantic masks. Semantic masks impose a geometric bottleneck that preserves object identity, spatial layout, and interaction-relevant structure while discarding redundant appearance. MWM does not require an external segmentation model at inference: semantic labels are used only offline during training, while deployment uses only raw multi-view RGB. The training pipeline adopts a two-stage strategy, first learning a mask-centric predictive model via conditional diffusion objectives, and then training a diffusion policy head based on mask-centric predictive features.

Methodology

  • οΏ½οΏ½ MWM employs video diffusion architectures to predict the evolution of semantic masks rather than pixels.

  • οΏ½οΏ½ Training is conducted in two stages: first, a mask-centric predictive model is learned via conditional diffusion objectives, and then a diffusion policy head is trained based on mask-centric predictive features.

  • οΏ½οΏ½ The mask dynamics backbone is seamlessly integrated with a diffusion-based policy head to enable robust end-to-end control.

  • οΏ½οΏ½ By introducing a geometric information bottleneck, the model is forced to capture essential physical dynamics and contact relations while filtering out visual noise.

Experiments

The experimental design includes evaluations on the LIBERO and RLBench benchmarks. LIBERO consists of 130 simulated manipulation tasks evaluated using templated language instructions. RLBench contains 100 tabletop manipulation tasks evaluated using standardized multi-view observations and natural language goals. In RLBench, 20 evaluation episodes per task are conducted using randomized seeds and initializations. Baseline models include OpenVLA, CogACT, Ο€, Cosmos+IDM, Cosmos+LatentIDM, and GE-ACT.

Results

In the LIBERO benchmark, MWM achieved a 98.3% average success rate, significantly outperforming RGB-based world models. In RLBench, MWM's average success rate was 68.3%, also surpassing existing RGB models. In real-world experiments, MWM achieved an average success rate of 67.5% across four tasks, far exceeding GE-ACT's 23.8% and Ο€'s 38.8%. Robustness evaluation via random visual token pruning revealed that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

Applications

MWM has broad application potential in robot policy learning, particularly in scenarios requiring handling visual variability and improving decision-relevant geometric information capture. Its application scenarios include automated manufacturing, autonomous vehicles, and smart home robots. MWM's robustness and generalization capabilities make it suitable for various complex real-world tasks.

Limitations & Outlook

Despite MWM's outstanding performance across multiple benchmarks, it may still face limitations in handling extreme lighting changes and complex backgrounds, especially when lacking sufficient training data. Additionally, the model's computational resource requirements are relatively high, potentially unsuitable for resource-constrained real-time applications. In certain specific tasks, the accuracy of mask prediction may not be as high as direct RGB prediction. Future research directions include optimizing MWM's computational efficiency for resource-constrained applications; exploring more semantic mask generation methods to improve model generalization; and validating MWM's effectiveness in more real-world tasks.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. Traditional methods are like using a high-definition camera to record every detail, including the color of the tiles on the wall and the changes in sunlight outside. While this information is rich, it doesn't really help with the actual cooking. Instead, the Mask World Model is like a smart assistant that only focuses on the ingredients in your hand, the cookware, and the heat, ignoring those unimportant background details. This way, you can concentrate more on the cooking itself without being distracted by irrelevant information. Through this approach, MWM helps robots make better decisions in complex visual environments, just like an experienced chef can cook delicious dishes in any kitchen setting.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool robot game where your mission is to make the robot complete tasks in various environments, like moving things, opening doors, or even cooking! But here's the catch: the game's environment keeps changing, like the lights dimming, backgrounds shifting, or objects changing color. Traditional robots are like newbies who only look at the surface and get confused by these changes. But our Mask World Model is different! It's like a smart detective that only focuses on the important clues, like the shape and position of objects, while ignoring those unimportant changes. This way, the robot can perform great in any environment, just like a superhero! Isn't that awesome?

Glossary

Mask World Model

A model that uses video diffusion architectures to predict the evolution of semantic masks, enhancing the robustness of robot policy learning by introducing a geometric information bottleneck.

In the paper, MWM is used to replace traditional RGB video prediction methods.

Semantic Mask

A mask used to represent the geometric information of different objects in an image, preserving object identity and spatial layout.

MWM predicts semantic masks to capture decision-relevant geometric information.

Video Diffusion Architecture

An architecture used for video generation and prediction, generating video frames or masks through a diffusion process.

MWM uses video diffusion architectures to predict the evolution of semantic masks.

LIBERO Benchmark

A benchmark consisting of 130 simulated manipulation tasks used to evaluate the performance of robot policy learning.

MWM achieved a 98.3% average success rate in the LIBERO benchmark.

RLBench Benchmark

A benchmark consisting of 100 tabletop manipulation tasks evaluated using standardized multi-view observations and natural language goals.

MWM's average success rate in the RLBench benchmark was 68.3%.

Generalization

The ability of a model to maintain good performance on unseen data or environments.

MWM significantly improves generalization by introducing a geometric information bottleneck.

Robustness

The ability of a model to maintain stable performance when faced with input variations or noise.

MWM exhibits superior robustness in handling texture information loss.

Diffusion Policy Head

A policy head used to generate actions, producing action sequences through a diffusion process.

MWM integrates the mask dynamics backbone with a diffusion-based policy head.

Geometric Information Bottleneck

A mechanism that preserves decision-relevant geometric information by limiting information flow, filtering out visual noise.

MWM introduces a geometric information bottleneck to filter out visual noise.

Random Token Pruning

A technique to evaluate model robustness by randomly removing visual tokens.

MWM shows excellent performance in robustness evaluation via random token pruning.

Open Questions Unanswered questions from this research

  • 1 How can MWM's robustness be further improved under extreme lighting changes and complex backgrounds? Current methods may still have limitations in handling these situations, requiring exploration of new techniques to enhance model adaptability.
  • 2 How can MWM's computational efficiency be optimized for resource-constrained real-time applications? The current model's computational resource requirements are relatively high, potentially unsuitable for some real-time applications.
  • 3 How can MWM's generalization be improved when lacking sufficient training data? Exploration of new data augmentation or transfer learning methods is needed to address this issue.
  • 4 How can MWM's effectiveness be validated in more real-world tasks? More experiments in different application scenarios are needed to verify the model's robustness and generalization capabilities.
  • 5 In certain specific tasks, the accuracy of mask prediction may not be as high as direct RGB prediction. How can this issue be resolved? Exploration of new mask generation and prediction methods is needed.

Applications

Immediate Applications

Automated Manufacturing

MWM can be used in industrial robots operating in complex environments, enhancing the level of automation and efficiency on production lines.

Autonomous Vehicles

By improving adaptability to visual changes, MWM can enhance the navigation capabilities of autonomous vehicles in different environments.

Smart Home Robots

MWM can help home robots perform complex tasks in dynamic home environments, such as cleaning and moving objects.

Long-term Vision

Fully Automated Factories

By integrating MWM, future factories can achieve fully automated production processes, reducing the need for human intervention.

Smart City Management

MWM can be used in urban management for automated monitoring and maintenance, improving the management efficiency of urban infrastructure.

Abstract

World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

cs.RO

References (20)

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao et al.

2023 737 citations ⭐ Influential View Analysis β†’

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Jonas Pai, Liam Achenbach, Victoriano Montesinos et al.

2025 29 citations View Analysis β†’

World Models

David R Ha, J. Schmidhuber

2018 1594 citations View Analysis β†’

WoW: Towards a World omniscient World model Through Embodied Interaction

Xiaowei Chi, Peidong Jia, Chunkai Fan et al.

2025 29 citations View Analysis β†’

MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation

Yu Shang, Yangcheng Yu, Xin Zhang et al.

2025 2 citations View Analysis β†’

MONet: Unsupervised Scene Decomposition and Representation

Christopher P. Burgess, L. Matthey, Nicholas Watters et al.

2019 595 citations View Analysis β†’

ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao et al.

2025 33 citations View Analysis β†’

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li, Renshan Zhang, Rui Shao et al.

2025 10 citations View Analysis β†’

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization

Xun Huang, Serge J. Belongie

2017 5162 citations View Analysis β†’

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan et al.

2025 297 citations View Analysis β†’

Multi-Object Representation Learning with Iterative Variational Inference

Klaus Greff, Raphael Lopez Kaufman, Rishabh Kabra et al.

2019 563 citations View Analysis β†’

3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding

Xindian Ma, Wenyuan Liu, Peng Zhang et al.

2024 14 citations View Analysis β†’

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Chenghao Liu, Jiachen Zhang, Chengxuan Li et al.

2025 10 citations View Analysis β†’

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie et al.

2021 11128 citations View Analysis β†’

GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement

Martin Engelcke, Oiwi Parker Jones, I. Posner

2021 138 citations View Analysis β†’

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Rongyu Zhang, Menghang Dong, Yuan Zhang et al.

2025 54 citations View Analysis β†’

Object-Centric Learning with Slot Attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner et al.

2020 1088 citations View Analysis β†’

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng et al.

2025 82 citations View Analysis β†’

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan Tong, Yibing Song, Jue Wang et al.

2022 1854 citations View Analysis β†’

RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation

Chengbo Yuan, Suraj Joshi, Shaoting Zhu et al.

2025 36 citations View Analysis β†’