OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
OmniStream achieves perception, reconstruction, and action in visual streams using causal spatiotemporal attention and 3D-RoPE, excelling across 29 datasets.
Key Findings
Methodology
OmniStream employs a unified streaming visual backbone using causal spatiotemporal attention and 3D Rotary Positional Embeddings (3D-RoPE) for efficient frame-by-frame online processing of video streams. Pre-trained on 29 datasets, it integrates static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment. Key components include a persistent KV-cache, a lightweight autoregressive language decoder, and dual DPT modules for predicting depth maps, ray maps, and camera poses.
Key Results
- OmniStream excels in image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, and robotic manipulation tasks, even with a frozen backbone. For instance, it achieves 68.5% accuracy on the SSv2 dataset, significantly outperforming DINOv3's 54.0%.
- In online 3D reconstruction tasks, OmniStream performs exceptionally well, achieving absolute relative errors of 0.314, 0.072, and 0.136 on the Sintel, BONN, and KITTI datasets, respectively.
- In VLM and VLA tasks, OmniStream demonstrates strong spatial reasoning capabilities, achieving a leading score of 70.6% on the VSI-Bench benchmark, surpassing many specialized baselines equipped with additional geometry encoders.
Significance
OmniStream's significance lies in its ability to unify perception, reconstruction, and action in visual streams, overcoming the fragmentation of current vision foundation models. By employing causal spatiotemporal attention and 3D-RoPE, OmniStream enables efficient online inference without modifying the backbone. This capability is crucial for general-purpose visual understanding in interactive and embodied agents, providing consistent representations across image, video, geometric, and language tasks, thus advancing the field of vision.
Technical Contribution
OmniStream's technical contributions include proposing a unified streaming visual backbone capable of generalizing across semantic, spatial, and temporal reasoning without fine-tuning the backbone. By introducing causal spatiotemporal attention and 3D-RoPE, OmniStream achieves strict temporal causality while preserving spatial priors. Additionally, the synergistic effect of the multi-task pre-training framework allows the model to excel on diverse objectives, showcasing new engineering possibilities.
Novelty
OmniStream is the first to apply causal spatiotemporal attention and 3D Rotary Positional Embeddings to a visual streaming backbone, addressing the fragmentation in semantic, temporal, and spatial geometry. Unlike existing work, OmniStream demonstrates its versatility and efficiency without relying on benchmark-specific dominance, providing a more meaningful path toward general-purpose visual understanding.
Limitations
- OmniStream may experience performance degradation when handling very long video sequences, as its pre-training temporal window is fixed at 16 frames.
- In certain complex geometric reasoning tasks, OmniStream may not fully replace specialized geometric expert models.
- Due to the model's complexity, training and inference are computationally expensive, potentially unsuitable for resource-constrained environments.
Future Work
Future research directions include optimizing OmniStream's performance on long sequences, exploring more efficient causal spatiotemporal attention mechanisms, and applications in resource-constrained environments. Further investigation into enhancing the model's geometric reasoning capabilities without increasing computational costs is also crucial.
AI Executive Summary
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. OmniStream introduces a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs by incorporating causal spatiotemporal attention and 3D Rotary Positional Embeddings (3D-RoPE).
OmniStream is pre-trained on 29 datasets, coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment. Its design includes a persistent KV-cache and a lightweight autoregressive language decoder, supporting efficient frame-by-frame online processing of video streams. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, and robotic manipulation tasks.
The core technical principles of OmniStream include causal spatiotemporal attention and 3D-RoPE, enabling efficient online inference without modifying the backbone. The synergistic effect of the multi-task pre-training framework allows the model to excel on diverse objectives, showcasing new engineering possibilities.
In experiments, OmniStream achieves 68.5% accuracy on the SSv2 dataset, significantly outperforming DINOv3's 54.0%. In online 3D reconstruction tasks, OmniStream achieves absolute relative errors of 0.314, 0.072, and 0.136 on the Sintel, BONN, and KITTI datasets, respectively. Additionally, in VLM and VLA tasks, OmniStream demonstrates strong spatial reasoning capabilities, achieving a leading score of 70.6% on the VSI-Bench benchmark.
OmniStream's significance lies in its ability to unify perception, reconstruction, and action in visual streams, overcoming the fragmentation of current vision foundation models. This capability is crucial for general-purpose visual understanding in interactive and embodied agents, providing consistent representations across image, video, geometric, and language tasks, thus advancing the field of vision.
Despite its versatility and efficiency, OmniStream may experience performance degradation when handling very long video sequences. Additionally, due to the model's complexity, training and inference are computationally expensive, potentially unsuitable for resource-constrained environments. Future research directions include optimizing OmniStream's performance on long sequences, exploring more efficient causal spatiotemporal attention mechanisms, and applications in resource-constrained environments.
Deep Analysis
Background
Visual agents are increasingly deployed in real-time streaming environments, from camera surveillance to augmented reality devices, requiring agents to update their beliefs online under tight latency and memory budgets. Traditional vision foundation models often focus on image semantic perception, offline temporal modeling, or spatial geometry, leading to fragmentation within the field. In recent years, language models have achieved generality with a single autoregressive backbone trained under next-token prediction. However, vision tasks differ not only in supervision but also in the nature of their outputs, such as discrete labels, segmentation masks, dense depth, 3D geometry, and temporally evolving predictions. This has led to the emergence of specialized foundation models, such as image encoders, video models, and geometric experts. While effective in-domain, these models typically learn representations tailored to a narrow objective, making it difficult to transfer across tasks.
Core Problem
The core problem is how to unify perception, reconstruction, and action in the vision domain to achieve general-purpose visual understanding. Current vision foundation models are fragmented in semantic, temporal, and spatial geometry, making it challenging to generalize across tasks without fine-tuning the backbone. Additionally, existing models often rely on expensive re-training, re-tokenization of outputs, or architectural adjustments to the generative head, complicating the realization of unified visual representations. Therefore, the key research question is whether a representation can be learned that supports multiple downstream tasks without modifying or fine-tuning the backbone.
Innovation
OmniStream's core innovations include proposing a unified streaming visual backbone capable of generalizing across semantic, spatial, and temporal reasoning without fine-tuning the backbone. β’ Introducing causal spatiotemporal attention ensures strict temporal causality and enables efficient frame-by-frame inference via a persistent KV-cache, avoiding re-computation over past frames. β’ Proposing 3D Rotary Positional Embeddings (3D-RoPE) extends 2D RoPE to the spatiotemporal domain, enhancing the model's ability to handle long video streams. β’ Adopting a unified multi-task pre-training framework that couples static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment, encouraging representations that are temporally coherent, geometrically grounded, and language-aligned.
Methodology
OmniStream's methodology includes the following key steps: β’ Employing causal spatiotemporal attention with a persistent KV-cache for efficient frame-by-frame inference, avoiding re-computation over past frames. Input: current frame and historical context; Output: composite output state. β’ Using 3D Rotary Positional Embeddings (3D-RoPE) to extend 2D RoPE to the spatiotemporal domain, enhancing the model's ability to handle long video streams. Input: non-overlapping patches per frame; Output: dense spatiotemporal feature maps and global semantics. β’ Adopting a unified multi-task pre-training framework that couples static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment. Input: set of views (global/local crops from an image or video clip); Output: global semantic consistency and patch-level discriminative features.
Experiments
The experimental design includes multi-task pre-training on 29 datasets, covering static images, dynamic videos, and geometric 3D/4D scenes. Baselines used include DINOv3, V-JEPA, CUT3R, etc. Evaluation metrics include image classification accuracy, video action recognition accuracy, video depth estimation absolute relative error, etc. Key hyperparameters include sequence length T=16, optimizer Adam, learning rate 1e-4. Ablation studies analyze the impact of causal spatiotemporal attention and 3D-RoPE.
Results
Results analysis shows that OmniStream excels in multiple tasks. It achieves 68.5% accuracy on the SSv2 dataset, significantly outperforming DINOv3's 54.0%. In online 3D reconstruction tasks, OmniStream achieves absolute relative errors of 0.314, 0.072, and 0.136 on the Sintel, BONN, and KITTI datasets, respectively. Additionally, in VLM and VLA tasks, OmniStream demonstrates strong spatial reasoning capabilities, achieving a leading score of 70.6% on the VSI-Bench benchmark.
Applications
OmniStream's application scenarios include: β’ Real-time video surveillance: enabling efficient frame-by-frame processing for real-time monitoring and analysis of dynamic scenes. β’ Augmented reality devices: supporting real-time updates and interactions from the user's perspective, enhancing user experience. β’ Robotic manipulation: supporting complex robotic manipulation tasks through a unified visual stream representation, achieving more efficient task execution.
Limitations & Outlook
OmniStream's limitations and outlook include: β’ Performance degradation when handling very long video sequences, as its pre-training temporal window is fixed at 16 frames. β’ In certain complex geometric reasoning tasks, OmniStream may not fully replace specialized geometric expert models. β’ Due to the model's complexity, training and inference are computationally expensive, potentially unsuitable for resource-constrained environments. Future research directions include optimizing OmniStream's performance on long sequences, exploring more efficient causal spatiotemporal attention mechanisms, and applications in resource-constrained environments.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. OmniStream is like a multi-functional kitchen assistant that not only helps you chop vegetables but also monitors the cooking process and even adjusts the heat when needed. Traditional kitchen assistants might focus on one task, like chopping or stirring, but OmniStream can handle multiple tasks simultaneously, like an all-in-one chef. It uses something called causal spatiotemporal attention to ensure each step is based on previous ones, not predicting future steps. It's like cooking where you decide the next step based on what's already done, not guessing what's next. OmniStream also uses 3D Rotary Positional Embeddings to help you better understand the kitchen's spatial layout, like a smart assistant planning your kitchen layout. With these technologies, OmniStream can help you complete the entire cooking process, from preparing ingredients to finishing the meal, without needing to be retrained.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super cool game where you control multiple characters, each with different tasks. OmniStream is like a super-smart game assistant that helps you manage all the characters' tasks at once, so you don't have to worry about each one individually. It has a skill called causal spatiotemporal attention, ensuring each character's actions are based on past situations, not guessing what will happen next. Just like in a game, you decide the next move based on what's already happened, not imagining things. OmniStream also uses 3D Rotary Positional Embeddings to help you better understand the game's spatial layout, like a smart assistant planning the game map. This way, you can easily tackle various challenges in the game without spending too much time adjusting each character's tasks. Isn't that cool?
Glossary
OmniStream
A unified streaming visual backbone capable of generalizing across semantic, spatial, and temporal reasoning without fine-tuning the backbone.
OmniStream achieves perception, reconstruction, and action in visual streams using causal spatiotemporal attention and 3D-RoPE.
Causal Spatiotemporal Attention
An attention mechanism that ensures the model relies only on past and present frames during inference, avoiding future frame predictions.
OmniStream employs causal spatiotemporal attention for efficient frame-by-frame online processing.
3D Rotary Positional Embeddings (3D-RoPE)
A technique that extends 2D RoPE to the spatiotemporal domain, enhancing the model's ability to handle long video streams.
OmniStream uses 3D-RoPE to enhance its handling of long video streams.
KV-cache
A persistent cache mechanism used to store keys and values from past frames, avoiding redundant computations.
OmniStream uses a persistent KV-cache for efficient frame-by-frame inference.
Multi-task Pre-training Framework
A framework that integrates static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment.
OmniStream is pre-trained on 29 datasets using a multi-task pre-training framework.
Vision-Language Alignment
A technique that aligns visual representations with linguistic concepts, enhancing the model's semantic understanding.
OmniStream achieves better semantic understanding through vision-language alignment.
Dual DPT Modules
Modules used for predicting depth maps, ray maps, and camera poses.
OmniStream uses dual DPT modules for streaming geometric reconstruction.
Autoregressive Language Decoder
A lightweight decoder that connects visual tokens with linguistic concepts.
OmniStream uses an autoregressive language decoder for vision-language alignment.
Ablation Study
A method of analyzing the impact of model components on overall performance by gradually removing them.
OmniStream's ablation studies analyze the impact of causal spatiotemporal attention and 3D-RoPE.
VSI-Bench Benchmark
A benchmark used to evaluate a model's spatial intelligence.
OmniStream achieves a leading score of 70.6% on the VSI-Bench benchmark.
Open Questions Unanswered questions from this research
- 1 OmniStream may experience performance degradation when handling very long video sequences, as its pre-training temporal window is fixed at 16 frames. Future research needs to explore how to optimize the model's performance on long sequences without increasing computational costs.
- 2 In certain complex geometric reasoning tasks, OmniStream may not fully replace specialized geometric expert models. This indicates a need for further research into enhancing the model's geometric reasoning capabilities.
- 3 Due to the model's complexity, training and inference are computationally expensive, potentially unsuitable for resource-constrained environments. Future research can explore how to reduce computational costs without compromising performance.
- 4 OmniStream may underperform in certain specific vision-language tasks, particularly those requiring highly fine-grained semantic understanding. Further research is needed to enhance the model's semantic understanding capabilities.
- 5 Despite OmniStream's strong performance across multiple tasks, it may still require fine-tuning in certain specific application scenarios. This indicates a need for further research into improving the model's generalization capabilities.
Applications
Immediate Applications
Real-time Video Surveillance
OmniStream enables efficient frame-by-frame processing for real-time monitoring and analysis of dynamic scenes, suitable for security and traffic surveillance.
Augmented Reality Devices
OmniStream supports real-time updates and interactions from the user's perspective, enhancing user experience in AR glasses and mobile devices.
Robotic Manipulation
OmniStream supports complex robotic manipulation tasks through a unified visual stream representation, achieving more efficient task execution in industrial automation and household robots.
Long-term Vision
Smart Cities
OmniStream's real-time monitoring capabilities enable intelligent management and optimization of urban infrastructure, enhancing city operational efficiency.
Autonomous Driving
OmniStream can be used in the perception systems of autonomous vehicles, enhancing their understanding and decision-making capabilities in complex environments for safer autonomous driving.
Abstract
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.
References (20)
DINOv3
Oriane Sim'eoni, Huy V. Vo, Maximilian Seitzer et al.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran, Adrien Bardes, David Fan et al.
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo et al.
VGGT: Visual Geometry Grounded Transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev et al.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li et al.
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu, Christopher Clark, Rowan Zellers et al.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong, Ellis Brown, Penghao Wu et al.
How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites
Zhe Chen, Weiyun Wang, Hao Tian et al.
Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
Qi Feng
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Ruosen Zhao, Zhikang Zhang, Jialei Xu et al.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon et al.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen et al.
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun et al.
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Limin Wang, Bingkun Huang, Zhiyu Zhao et al.
Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu et al.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li et al.
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
Lu Ling, Yichen Sheng, Zhi Tu et al.
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron, Hugo Touvron, Ishan Misra et al.
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Lihe Yang, Bingyi Kang, Zilong Huang et al.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang et al.