OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

TL;DR

OmniVTA integrates predictive contact modeling with high-frequency tactile feedback for breakthroughs in contact-rich manipulation.

cs.RO 🔴 Advanced 2026-03-20 50 views
Yuhang Zheng Songen Gu Weize Li Yupeng Zheng Yujie Zang Shuai Tian Xiang Li Ruihai Wu Ce Hao Chen Gao Si Liu Haoran Li Yilun Chen Shuicheng Yan Wenchao Ding
robotic manipulation visuo-tactile world modeling dataset closed-loop control

Key Findings

Methodology

OmniVTA is a world-model-based visuo-tactile manipulation framework integrating four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model, a contact-aware fusion policy, and a 60Hz reflexive controller. The tactile encoder extracts features from tactile signals, the world model predicts short-horizon contact evolution, the fusion policy generates actions, and the reflexive controller corrects deviations between predicted and observed tactile signals in a closed loop.

Key Results

  • OmniVTA outperforms existing methods in real-robot experiments across all six interaction categories, demonstrating strong generalization to unseen objects and geometric configurations, highlighting the value of combining predictive contact modeling with high-frequency tactile feedback.
  • Trained on the large-scale OmniViTac dataset with 21,000+ trajectories, 86 tasks, and 100+ objects, OmniVTA achieves significant performance improvements across multiple tasks.
  • Ablation studies confirm the contribution of each module to overall performance, particularly the critical role of the 60Hz reflexive controller in enhancing manipulation precision.

Significance

OmniVTA addresses long-standing limitations in visuo-tactile manipulation, such as small dataset sizes and narrow task coverage, by explicitly modeling contact dynamics and enabling closed-loop control. This framework advances academic research in robotic manipulation and offers more efficient automation solutions for industry, especially in precision-demanding scenarios.

Technical Contribution

OmniVTA fundamentally differs from state-of-the-art methods by integrating a large-scale visuo-tactile dataset with a world model, providing new theoretical guarantees and engineering possibilities. Its self-supervised tactile encoder and two-stream world model offer new perspectives for contact dynamics modeling, while the 60Hz reflexive controller significantly enhances manipulation precision.

Novelty

OmniVTA is the first framework to combine a large-scale visuo-tactile dataset with a world model, distinguishing itself from previous works that treat tactile signals as passive observations. Its innovation lies in achieving closed-loop control through high-frequency feedback, enhancing manipulation precision.

Limitations

  • OmniVTA may struggle in extremely complex contact scenarios, particularly those involving rapidly changing friction and forces, due to current limitations in multi-modal data fusion and high-frequency feedback control.
  • The system's computational cost is high due to the hardware requirements for high-frequency feedback, potentially limiting its application in resource-constrained environments.
  • The current framework still has room for improvement in multi-modal data fusion, especially when dealing with noise and uncertainty.

Future Work

Future work could include optimizing the system's computational efficiency for more resource-constrained environments, expanding the dataset to cover more complex contact scenarios, and further exploring multi-modal data fusion methods to enhance system robustness and adaptability.

AI Executive Summary

In the field of robotic manipulation, contact-rich tasks such as wiping and assembly require accurate perception of contact forces, friction changes, and state transitions, which cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress has been constrained by small dataset sizes and narrow task coverage, as well as existing methods treating tactile signals as passive observations rather than explicitly modeling contact dynamics or enabling closed-loop control.

The OmniVTA framework addresses these issues by building on a large-scale visuo-tactile-action dataset, OmniViTac, which comprises over 21,000 trajectories across 86 tasks and more than 100 objects, organized into six physics-grounded interaction patterns. OmniVTA integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model, a contact-aware fusion policy, and a 60Hz reflexive controller.

The self-supervised tactile encoder extracts features from tactile signals, the two-stream world model predicts short-horizon contact evolution, the fusion policy generates actions, and the reflexive controller corrects deviations between predicted and observed tactile signals in a closed loop. This design enables OmniVTA to achieve higher precision and stability in contact-rich manipulation tasks.

In real-robot experiments, OmniVTA outperforms existing methods across all six interaction categories, demonstrating strong generalization to unseen objects and geometric configurations. This result highlights the value of combining predictive contact modeling with high-frequency tactile feedback, significantly enhancing the performance of robots in complex manipulation tasks.

However, OmniVTA also has limitations, such as potential struggles in extremely complex contact scenarios and high computational costs. Future research directions include optimizing the system's computational efficiency, expanding the dataset to cover more complex contact scenarios, and further exploring multi-modal data fusion methods.

Deep Analysis

Background

The development of robotic manipulation technology has evolved from simple visual perception to multi-modal perception. Initially, robots primarily relied on visual information for environmental perception and task execution. However, as task complexity increased, visual information alone became insufficient. In recent years, visuo-tactile manipulation has become a research hotspot, with representative works including TACTO and GelSight, which improve manipulation precision by combining visual and tactile information. However, these methods generally suffer from small dataset sizes and narrow task coverage, limiting their application in complex manipulation tasks.

Core Problem

Contact-rich manipulation tasks require accurate perception of contact forces, friction changes, and state transitions, which cannot be reliably inferred from vision alone. Existing visuo-tactile manipulation methods often treat tactile signals as passive observations, failing to fully utilize them for explicitly modeling contact dynamics or achieving closed-loop control. Additionally, existing datasets are small and task coverage is narrow, limiting the model's generalization ability and applicability.

Innovation

The core innovations of OmniVTA include:

1) The first integration of a large-scale visuo-tactile dataset with a world model, providing new theoretical guarantees and engineering possibilities.

2) The design of a self-supervised tactile encoder and a two-stream visuo-tactile world model, offering new perspectives for contact dynamics modeling.

3) The introduction of a 60Hz reflexive controller that achieves closed-loop control through high-frequency feedback, significantly enhancing manipulation precision.

These innovations enable OmniVTA to achieve higher precision and stability in complex manipulation tasks.

Methodology

The implementation of OmniVTA involves the following key steps:

  • �� Self-supervised tactile encoder: Extracts features from tactile signals, input is raw tactile data, output is encoded features.
  • �� Two-stream visuo-tactile world model: Predicts short-horizon contact evolution, input is visual and tactile encoded features, output is contact state prediction.
  • �� Contact-aware fusion policy: Generates actions, input is contact state prediction, output is control commands.
  • �� 60Hz reflexive controller: Corrects deviations between predicted and observed tactile signals, input is current tactile signal and predicted signal, output is corrected control commands.

Experiments

The experimental design includes training and evaluation on the OmniViTac dataset, which contains over 21,000 trajectories, covering 86 tasks and more than 100 objects. Baseline methods include TACTO and GelSight. Evaluation metrics include manipulation precision, task completion rate, and generalization ability. Key hyperparameters include the learning rate of the tactile encoder and the feedback frequency of the reflexive controller. Ablation studies are conducted to verify the contribution of each module to overall performance.

Results

Experimental results show that OmniVTA outperforms existing methods across all six interaction categories, particularly in terms of manipulation precision and task completion rate. Specifically, OmniVTA demonstrates significantly better generalization to unseen objects and geometric configurations compared to baseline methods. Ablation studies reveal the critical role of the 60Hz reflexive controller in enhancing manipulation precision, with a significant performance drop observed when this module is removed.

Applications

OmniVTA can be applied in precision-demanding scenarios such as industrial assembly, medical robotics, and service robots. Its high precision and stability make it suitable for complex contact tasks. The industry can leverage OmniVTA to improve the efficiency of automated production lines, while medical robots can achieve more precise operations during surgeries.

Limitations & Outlook

OmniVTA may struggle in extremely complex contact scenarios, particularly those involving rapidly changing friction and forces. Additionally, the system's computational cost is high due to the hardware requirements for high-frequency feedback, potentially limiting its application in resource-constrained environments. Future improvements could include optimizing the system's computational efficiency, expanding the dataset to cover more complex contact scenarios, and further exploring multi-modal data fusion methods.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. OmniVTA is like a super-intelligent kitchen assistant that not only sees what you're doing but also feels every move you make. For example, when you're chopping vegetables, it can sense the contact force between the knife and the cutting board, knowing when to apply pressure and when to be gentle. It's like having a tactile-aware robotic assistant that helps you complete various complex tasks in the kitchen. By combining visual and tactile information, it can better understand the kitchen environment, ensuring every action is precise and accurate. Just like an experienced chef, it can provide assistance when needed, making your cooking process smoother and more efficient.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where your character is a robot that needs to complete various tasks, like wiping a table or assembling toys. OmniVTA is like your game's ultimate cheat code. It not only sees what you're doing but also feels every move you make. For instance, when you press a button hard in the game, it can sense the pressure, knowing when to press hard and when to be gentle. This way, you can complete tasks more accurately and score higher! OmniVTA is like your secret weapon, making you unbeatable in the game. Isn't that awesome?

Glossary

OmniVTA

OmniVTA is a world-model-based visuo-tactile manipulation framework that integrates a self-supervised tactile encoder, a two-stream visuo-tactile world model, a contact-aware fusion policy, and a 60Hz reflexive controller.

OmniVTA is used to achieve contact-rich manipulation tasks.

OmniViTac

OmniViTac is a large-scale visuo-tactile-action dataset comprising over 21,000 trajectories, 86 tasks, and more than 100 objects, organized into six physics-grounded interaction patterns.

OmniViTac is used for training and evaluating the OmniVTA framework.

Self-supervised tactile encoder

The self-supervised tactile encoder extracts features from tactile signals, with input as raw tactile data and output as encoded features.

In the OmniVTA framework, the tactile encoder is a key module.

Two-stream visuo-tactile world model

The two-stream visuo-tactile world model predicts short-horizon contact evolution, with input as visual and tactile encoded features and output as contact state prediction.

This model is an important component of the OmniVTA framework.

Contact-aware fusion policy

The contact-aware fusion policy generates actions, with input as contact state prediction and output as control commands.

In the OmniVTA framework, the fusion policy is crucial for achieving closed-loop control.

60Hz reflexive controller

The 60Hz reflexive controller corrects deviations between predicted and observed tactile signals, with input as current tactile signal and predicted signal, and output as corrected control commands.

This controller significantly enhances OmniVTA's manipulation precision.

Visuo-tactile manipulation

Visuo-tactile manipulation refers to methods that combine visual and tactile information for robotic manipulation, aiming to improve precision and stability.

OmniVTA is a recent advancement in visuo-tactile manipulation.

Closed-loop control

Closed-loop control is a method that corrects system output through feedback signals, ensuring stability and precision in dynamic environments.

OmniVTA achieves closed-loop control through the 60Hz reflexive controller.

Contact dynamics

Contact dynamics is the study of force and motion changes during object contact, involving friction, force transmission, etc.

OmniVTA explicitly models contact dynamics through the two-stream world model.

Multi-modal data fusion

Multi-modal data fusion combines data from different sensors to enhance system perception and decision-making accuracy.

OmniVTA achieves more efficient manipulation through the fusion of visual and tactile information.

Open Questions Unanswered questions from this research

  • 1 Current visuo-tactile manipulation methods struggle in extremely complex contact scenarios, particularly those involving rapidly changing friction and forces. This is due to limitations in multi-modal data fusion and high-frequency feedback control. Future advancements require developing more sophisticated algorithms to address these challenges.
  • 2 OmniVTA's application in resource-constrained environments is limited by its computational cost. Although its high-frequency feedback control significantly enhances manipulation precision, it also increases the system's computational burden. Future research needs to explore more efficient computational methods to reduce the system's resource requirements.
  • 3 The scale and diversity of existing datasets remain insufficient, limiting model generalization. Although OmniViTac is a large-scale dataset, it still needs expansion to cover more complex contact scenarios and tasks. Future efforts should focus on building larger and more diverse datasets.
  • 4 Methods for multi-modal data fusion still require improvement, especially when dealing with noise and uncertainty. Existing methods have limited robustness in these areas, and future work should develop more advanced fusion strategies to enhance system adaptability and stability.
  • 5 OmniVTA's framework design may lack flexibility in certain extreme cases, particularly when dealing with nonlinear and non-stationary contact dynamics. Future research could explore more flexible model structures to accommodate more complex contact scenarios.

Applications

Immediate Applications

Industrial Assembly

OmniVTA can be applied to industrial assembly lines, improving the precision and efficiency of automated production. By combining visual and tactile information, the system can better adapt to complex assembly tasks, reducing human intervention.

Medical Robotics

In the medical field, OmniVTA can be used in surgical robots to provide more precise operational support. Its high-frequency feedback control ensures that every action during surgery is accurate, reducing surgical risks.

Service Robots

OmniVTA can be used in service robots, such as home assistants, to help complete daily tasks. Its high precision and stability enable it to provide reliable service in complex home environments.

Long-term Vision

Smart Manufacturing

OmniVTA has great potential in smart manufacturing. By improving robot precision and stability in complex tasks, it can enable more efficient and flexible production processes, driving the intelligent transformation of manufacturing.

Human-Robot Collaboration

OmniVTA can promote the development of human-robot collaboration. By enhancing robots' environmental perception and decision-making precision, humans and robots can work more closely together to complete more complex tasks.

Abstract

Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising $21{,}000+$ trajectories across $86$ tasks and $100+$ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.

cs.RO

References (20)

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

Jianxin Bi, Kevin Yuchen Ma, Ce Hao et al.

2025 27 citations ⭐ Influential View Analysis →

GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators

Philipp Wu, Yide Shentu, Zhongke Yi et al.

2023 263 citations View Analysis →

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, A. Blattmann, Dominik Lorenz et al.

2021 22993 citations View Analysis →

3D Implicit Transporter for Temporally Consistent Keypoint Discovery

Chengliang Zhong, Yuhang Zheng, Yupeng Zheng et al.

2023 23 citations View Analysis →

Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization

Jialei Huang, Shuo Wang, Fanqi Lin et al.

2025 42 citations

Stretch not flex: programmable rubber keyboard

Daniel Xu, Andreas Tairych, I. Anderson

2015 47 citations

AUTO-ENCODING VARIATIONAL BAYES

Romain Lopez, Pierre Boyeau, N. Yosef et al.

2020 22049 citations

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, S. Feng, Yilun Du et al.

2023 2686 citations View Analysis →

Tac3D: A Novel Vision-based Tactile Sensor for Measuring Forces Distribution and Estimating Friction Coefficient Distribution

Lunwei Zhang, Yue Wang, Yao Jiang

2022 50 citations View Analysis →

Bayesian Learning via Stochastic Gradient Langevin Dynamics

M. Welling, Y. Teh

2011 2965 citations

FoAR: Force-Aware Reactive Policy for Contact-Rich Robotic Manipulation

Zihao He, Hongjie Fang, Jingjing Chen et al.

2024 37 citations View Analysis →

Demonstrating the Octopi-1.5 Visual-Tactile-Language Model

Samson Yu, Kelvin Lin, Harold Soh

2025 7 citations View Analysis →

UniT: Data Efficient Tactile Representation With Generalization to Unseen Objects

Zhengtong Xu, Raghava Uppuluri, Xinwei Zhang et al.

2024 28 citations View Analysis →

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Fengyu Yang, Chao Feng, Ziyang Chen et al.

2024 121 citations View Analysis →

AgiBot World Colosseo: A Large-Scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai et al.

2025 259 citations View Analysis →

ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

Ruohan Gao, Zilin Si, Yen-Yu Chang et al.

2022 113 citations View Analysis →

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Hao Yu, Haotong Lin, Jiawei Wang et al.

2026 3 citations View Analysis →

Tac-Man: Tactile-Informed Prior-Free Manipulation of Articulated Objects

Zihang Zhao, Yuyang Li, Wanlin Li et al.

2024 31 citations View Analysis →

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh et al.

2025 96 citations View Analysis →

Octopi: Object Property Reasoning with Large Tactile-Language Models

Samson Yu, Kelvin Lin, Anxing Xiao et al.

2024 59 citations View Analysis →