Towards Generalizable Robotic Manipulation in Dynamic Environments

TL;DR

PUMA model improves success rate by 6.3% in dynamic environments using historical optical flow and world queries.

cs.CV 🔴 Advanced 2026-03-17 65 views
Heng Fang Shangru Li Shuhan Wang Xuanyang Xi Dingkang Liang Xiang Bai
robotic manipulation dynamic environments vision-language models dataset spatiotemporal reasoning

Key Findings

Methodology

This study introduces PUMA, a dynamics-aware Vision-Language-Action (VLA) architecture. PUMA integrates scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states. Its core components include a history-aware perception module and a short-horizon prediction module. PUMA is trained and evaluated using the DOMINO dataset, which features 35 tasks, over 110K expert trajectories, and a multi-dimensional evaluation suite.

Key Results

  • PUMA achieves a 6.3% improvement in success rate over baseline models in dynamic tasks. Experiments demonstrate that PUMA outperforms existing VLA models in dynamic environments, particularly in scenarios with complex moving targets.
  • Training on dynamic data enables PUMA to generate robust spatiotemporal representations that effectively transfer to static tasks, showcasing its generalization capabilities across different tasks.
  • Ablation studies reveal that the combination of historical optical flow and world queries is crucial for PUMA's performance enhancement, especially in predicting future object states.

Significance

This research is significant for the field of robotic manipulation in dynamic environments. By introducing the DOMINO dataset and PUMA architecture, the study addresses the gap in dynamic manipulation datasets and demonstrates the potential of dynamics-awareness in enhancing VLA models' spatiotemporal reasoning capabilities. These findings not only advance academic understanding of robotic manipulation in dynamic settings but also provide new insights for the industry to develop smarter robotic systems in dynamic scenarios.

Technical Contribution

Technical contributions include: 1) Introducing the PUMA architecture, which combines historical optical flow and world queries to enhance dynamics-awareness; 2) Developing the DOMINO dataset, offering a rich set of dynamic manipulation tasks and evaluation standards; 3) Systematically validating the effectiveness of dynamic data training in improving model generalization. These contributions offer new theoretical and engineering possibilities for robotic manipulation in dynamic environments.

Novelty

PUMA is the first VLA architecture to combine historical optical flow and world queries, achieving higher manipulation success rates in dynamic environments. Compared to existing methods, PUMA significantly enhances spatiotemporal reasoning capabilities, particularly in handling complex dynamic scenarios.

Limitations

  • PUMA faces challenges in handling extremely fast-moving targets, especially when target trajectories are irregular, which may reduce prediction accuracy.
  • Although the DOMINO dataset is extensive, it may not cover all possible dynamic scenarios, limiting the model's generalization in certain environments.
  • PUMA's computational complexity is relatively high, potentially posing performance bottlenecks for real-time applications.

Future Work

Future research directions include: 1) Expanding the DOMINO dataset to cover more dynamic scenarios, enhancing model generalization; 2) Optimizing PUMA's computational efficiency for real-time applications; 3) Exploring other dynamics-aware mechanisms, such as multimodal fusion, to further improve the model's spatiotemporal reasoning capabilities.

AI Executive Summary

In the field of robotic manipulation, existing Vision-Language-Action (VLA) models excel in static environments but struggle in dynamic settings. This challenge primarily arises from the lack of dynamic manipulation datasets and the reliance of mainstream VLA models on single-frame observations, which limits their spatiotemporal reasoning capabilities.

To address this issue, the research team introduces DOMINO, a large-scale dynamic manipulation dataset and benchmark, featuring 35 tasks, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, the researchers evaluate existing VLA models on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data.

Furthermore, the study proposes PUMA, a dynamics-aware VLA architecture. PUMA integrates scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states. By coupling history-aware perception with short-horizon prediction, PUMA achieves state-of-the-art performance in dynamic environments.

Experimental results show that PUMA improves the success rate by 6.3% over baseline models in dynamic tasks. Additionally, training on dynamic data enables PUMA to generate robust spatiotemporal representations that effectively transfer to static tasks, demonstrating its generalization capabilities across different tasks.

The significance of this research lies in addressing the gap in dynamic manipulation datasets and demonstrating the potential of dynamics-awareness in enhancing VLA models' spatiotemporal reasoning capabilities. These findings not only advance academic understanding of robotic manipulation in dynamic settings but also provide new insights for the industry to develop smarter robotic systems in dynamic scenarios.

However, PUMA faces challenges in handling extremely fast-moving targets, especially when target trajectories are irregular, which may reduce prediction accuracy. Future research directions include expanding the DOMINO dataset to cover more dynamic scenarios, optimizing PUMA's computational efficiency, and exploring other dynamics-aware mechanisms.

Deep Analysis

Background

In recent years, the field of robotic manipulation has made significant progress, particularly in object manipulation tasks within static environments. Vision-Language-Action (VLA) models, which integrate visual information, language instructions, and action planning, have enabled the automation of complex tasks. However, as application scenarios diversify, robots are increasingly required to interact with moving targets in dynamic environments, posing new challenges to existing models. Dynamic environments demand not only real-time perception capabilities but also sophisticated spatiotemporal reasoning. Yet, most current VLA models rely on single-frame observations, lacking comprehensive understanding of dynamic scenes. Additionally, the scarcity of dynamic manipulation datasets limits model performance in such tasks.

Core Problem

The core problem is achieving generalizable robotic manipulation in dynamic environments. While existing VLA models perform well in static tasks, they often underperform in dynamic scenarios. Specific bottlenecks include: 1) The lack of large-scale dynamic manipulation datasets, preventing models from adequately learning dynamic scene characteristics during training; 2) Mainstream models' reliance on single-frame observations, lacking the ability to predict target trajectories; 3) The increased uncertainty in dynamic scenes complicates spatiotemporal reasoning. Solving these issues is crucial for enhancing robotic manipulation capabilities in complex dynamic environments.

Innovation

Core innovations include: 1) Introducing the DOMINO dataset, filling the gap in dynamic manipulation datasets and providing a rich set of tasks and trajectories for model training and evaluation; 2) Proposing the PUMA architecture, which enhances dynamics-awareness by integrating historical optical flow and world queries; 3) Systematically validating the effectiveness of dynamic data training in improving model generalization. Compared to existing methods, PUMA significantly enhances spatiotemporal reasoning capabilities, particularly in handling complex dynamic scenarios.

Methodology

Method details:


  • �� DOMINO Dataset: Features 35 tasks, over 110K expert trajectories, and provides a multi-dimensional evaluation suite.

  • �� PUMA Architecture:
  • History-aware Perception Module: Integrates scene-centric historical optical flow to capture target motion trajectories.
  • Short-horizon Prediction Module: Uses world queries to implicitly forecast object-centric future states.

  • �� Dynamic Awareness Training Strategy: Utilizes the DOMINO dataset to train models, generating robust spatiotemporal representations.

Experiments

Experimental design includes:


  • �� Datasets: Training and evaluation conducted on the DOMINO dataset, covering 35 dynamic tasks.

  • �� Baseline Models: Comparison with existing VLA models, such as XYZ model, to assess PUMA's performance improvements.

  • �� Evaluation Metrics: Success rate, spatiotemporal reasoning accuracy, etc.

  • �� Ablation Studies: Analyze the impact of historical optical flow and world queries on model performance.

Results

Results analysis:


  • �� PUMA achieves a 6.3% improvement in success rate over baseline models in dynamic tasks, demonstrating its superior performance in dynamic environments.

  • �� Ablation studies reveal that the combination of historical optical flow and world queries is crucial for PUMA's performance enhancement, especially in predicting future object states.

  • �� Training on dynamic data enables PUMA to generate robust spatiotemporal representations that effectively transfer to static tasks.

Applications

Application scenarios include:


  • �� Industrial Robotics: Performing complex manipulation tasks on dynamic production lines, improving efficiency and flexibility.

  • �� Service Robotics: Interacting with moving targets in homes or public spaces, providing smarter services such as cleaning and delivery robots.

  • �� Autonomous Driving: Making real-time decisions and planning in dynamic traffic environments, enhancing safety and driving experience.

Limitations & Outlook

Limitations & outlook:


  • �� PUMA faces challenges in handling extremely fast-moving targets, especially when target trajectories are irregular, which may reduce prediction accuracy.

  • �� Although the DOMINO dataset is extensive, it may not cover all possible dynamic scenarios, limiting the model's generalization in certain environments.

  • �� PUMA's computational complexity is relatively high, potentially posing performance bottlenecks for real-time applications. Future research directions include expanding datasets, optimizing computational efficiency, and exploring other dynamics-aware mechanisms.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You have a recipe (language instructions) and need to find the right ingredients (visual information) to cook (action). In a static environment, this process is relatively straightforward because the ingredients don't move. But if there are kittens running around the kitchen, you need to keep an eye on their positions while cooking (dynamic awareness). PUMA is like a smart kitchen assistant that not only helps you find ingredients but also predicts the kittens' movements, ensuring you're not disturbed while cooking. By combining past observations (historical optical flow) and future predictions (world queries), PUMA helps robots perform tasks better in dynamic environments.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a game with lots of moving targets. You need to catch these targets with your character, but they're always on the move! It's like playing hide and seek, where you need to guess where the targets will go next. PUMA is like a super helper that predicts where these targets are going, making it easier for you to catch them. It's like having a friend with a magic crystal ball who can always tell you what's going to happen next. By watching past actions (historical optical flow) and predicting future changes (world queries), PUMA makes robots smarter and more flexible in dynamic environments. Cool, right?

Glossary

Vision-Language-Action (VLA)

A model that combines visual information, language instructions, and action planning to automate complex tasks.

Used for robotic manipulation in static and dynamic environments.

Dynamic Environment

Refers to scenarios where the state of the environment is constantly changing, such as those with moving targets.

PUMA achieves higher manipulation success rates in dynamic environments.

Optical Flow

A technique for estimating object motion in image sequences by analyzing changes between consecutive frames.

PUMA uses historical optical flow to capture target motion trajectories.

World Queries

A technique for predicting future states by analyzing the current environment to infer future object positions.

PUMA uses world queries to implicitly forecast object-centric future states.

DOMINO Dataset

A large-scale dynamic manipulation dataset featuring 35 tasks and over 110K expert trajectories.

Used to train and evaluate PUMA's dynamics-awareness.

Spatiotemporal Reasoning

The ability to reason using both time and spatial information, crucial in dynamic environments.

PUMA enhances spatiotemporal reasoning capabilities through dynamics-awareness.

Generalization

The ability of a model to maintain high performance across different tasks and environments.

PUMA demonstrates generalization capabilities across dynamic and static tasks.

Ablation Study

A method of evaluating the impact of specific components on overall performance by removing them.

Used to analyze the impact of historical optical flow and world queries on PUMA's performance.

Baseline Model

A reference model used to compare the performance of new models, typically the current best method.

PUMA improves success rate by 6.3% over baseline models.

Success Rate

A metric that measures the proportion of tasks a model successfully completes.

PUMA significantly improves success rates in dynamic tasks.

Open Questions Unanswered questions from this research

  • 1 How can we improve model prediction accuracy in extremely dynamic environments? Current methods underperform with fast-moving and irregularly moving targets, requiring more advanced dynamics-aware mechanisms.
  • 2 How can we reduce PUMA's computational complexity for real-time applications? The current computational demands may pose performance bottlenecks for real-time scenarios, necessitating algorithm and hardware optimization.
  • 3 Does the DOMINO dataset cover all possible dynamic scenarios? While extensive, the dataset may still miss certain scenarios, limiting model generalization.
  • 4 How can we further enhance the model's spatiotemporal reasoning capabilities? Current methods have limited reasoning capabilities in complex dynamic scenarios, requiring exploration of new reasoning mechanisms.
  • 5 How can the generalization capabilities of dynamic data training be applied to other fields? Research is needed to explore its applicability and effectiveness across different domains and tasks.

Applications

Immediate Applications

Industrial Robotics

Performing complex manipulation tasks on dynamic production lines, improving efficiency and flexibility, applicable to manufacturing and assembly lines.

Service Robotics

Interacting with moving targets in homes or public spaces, providing smarter services such as cleaning and delivery robots.

Autonomous Driving

Making real-time decisions and planning in dynamic traffic environments, enhancing safety and driving experience, applicable to self-driving cars.

Long-term Vision

Smart Cities

Optimizing urban infrastructure management through dynamics-aware technologies, improving resource efficiency and quality of life for residents.

Human-Robot Collaboration

Achieving more efficient human-robot collaboration in dynamic work environments, driving transformation in smart manufacturing and service industries.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

cs.CV cs.RO

References (20)

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1294 citations ⭐ Influential View Analysis →

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen et al.

2025 128 citations ⭐ Influential View Analysis →

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 1782 citations ⭐ Influential View Analysis →

SAPIEN: A SimulAted Part-Based Interactive ENvironment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo et al.

2020 710 citations ⭐ Influential View Analysis →

π0.5: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown et al.

2025 617 citations ⭐ Influential View Analysis →

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Shiduo Zhang, Zhe Xu, Peiju Liu et al.

2024 87 citations View Analysis →

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, S. Nair et al.

2024 597 citations View Analysis →

Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding

Maxim A. Patratskiy, A. Kovalev, Aleksandr I. Panov

2025 9 citations View Analysis →

RLBench: The Robot Learning Benchmark & Learning Environment

Stephen James, Zicong Ma, David Rovick Arrojo et al.

2019 815 citations View Analysis →

OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation

Can Cui, Pengxiang Ding, Wenxuan Song et al.

2025 56 citations View Analysis →

ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations

Tongzhou Mu, Z. Ling, Fanbo Xiang et al.

2021 199 citations View Analysis →

AgiBot World Colosseo: A Large-Scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai et al.

2025 253 citations View Analysis →

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng et al.

2024 970 citations View Analysis →

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao et al.

2023 641 citations View Analysis →

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

Weikang Qiu, Tinglin Huang, Rex Ying

2026 1 citations View Analysis →

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al.

2024 2681 citations View Analysis →

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Zhao, Vikash Kumar, S. Levine et al.

2023 1425 citations View Analysis →

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

Wilbert Pumacay, Ishika Singh, Jiafei Duan et al.

2024 108 citations View Analysis →

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Nvidia, Johan Bjorck, Fernando Castañeda et al.

2025 559 citations View Analysis →

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan et al.

2025 113 citations View Analysis →