OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

TL;DR

OmniStream achieves perception, reconstruction, and action in visual streams using causal spatiotemporal attention and 3D-RoPE, excelling across 29 datasets.

cs.CV πŸ”΄ Advanced 2026-03-13 41 views
Yibin Yan Jilan Xu Shangzhe Di Haoning Wu Weidi Xie
visual streams causal attention 3D reconstruction multi-task learning vision-language alignment

Key Findings

Methodology

OmniStream employs a unified streaming visual backbone using causal spatiotemporal attention and 3D Rotary Positional Embeddings (3D-RoPE) for efficient frame-by-frame online processing of video streams. Pre-trained on 29 datasets, it integrates static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment. Key components include a persistent KV-cache, a lightweight autoregressive language decoder, and dual DPT modules for predicting depth maps, ray maps, and camera poses.

Key Results

  • OmniStream excels in image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, and robotic manipulation tasks, even with a frozen backbone. For instance, it achieves 68.5% accuracy on the SSv2 dataset, significantly outperforming DINOv3's 54.0%.
  • In online 3D reconstruction tasks, OmniStream performs exceptionally well, achieving absolute relative errors of 0.314, 0.072, and 0.136 on the Sintel, BONN, and KITTI datasets, respectively.
  • In VLM and VLA tasks, OmniStream demonstrates strong spatial reasoning capabilities, achieving a leading score of 70.6% on the VSI-Bench benchmark, surpassing many specialized baselines equipped with additional geometry encoders.

Significance

OmniStream's significance lies in its ability to unify perception, reconstruction, and action in visual streams, overcoming the fragmentation of current vision foundation models. By employing causal spatiotemporal attention and 3D-RoPE, OmniStream enables efficient online inference without modifying the backbone. This capability is crucial for general-purpose visual understanding in interactive and embodied agents, providing consistent representations across image, video, geometric, and language tasks, thus advancing the field of vision.

Technical Contribution

OmniStream's technical contributions include proposing a unified streaming visual backbone capable of generalizing across semantic, spatial, and temporal reasoning without fine-tuning the backbone. By introducing causal spatiotemporal attention and 3D-RoPE, OmniStream achieves strict temporal causality while preserving spatial priors. Additionally, the synergistic effect of the multi-task pre-training framework allows the model to excel on diverse objectives, showcasing new engineering possibilities.

Novelty

OmniStream is the first to apply causal spatiotemporal attention and 3D Rotary Positional Embeddings to a visual streaming backbone, addressing the fragmentation in semantic, temporal, and spatial geometry. Unlike existing work, OmniStream demonstrates its versatility and efficiency without relying on benchmark-specific dominance, providing a more meaningful path toward general-purpose visual understanding.

Limitations

  • OmniStream may experience performance degradation when handling very long video sequences, as its pre-training temporal window is fixed at 16 frames.
  • In certain complex geometric reasoning tasks, OmniStream may not fully replace specialized geometric expert models.
  • Due to the model's complexity, training and inference are computationally expensive, potentially unsuitable for resource-constrained environments.

Future Work

Future research directions include optimizing OmniStream's performance on long sequences, exploring more efficient causal spatiotemporal attention mechanisms, and applications in resource-constrained environments. Further investigation into enhancing the model's geometric reasoning capabilities without increasing computational costs is also crucial.

AI Executive Summary

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. OmniStream introduces a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs by incorporating causal spatiotemporal attention and 3D Rotary Positional Embeddings (3D-RoPE).

OmniStream is pre-trained on 29 datasets, coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment. Its design includes a persistent KV-cache and a lightweight autoregressive language decoder, supporting efficient frame-by-frame online processing of video streams. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, and robotic manipulation tasks.

The core technical principles of OmniStream include causal spatiotemporal attention and 3D-RoPE, enabling efficient online inference without modifying the backbone. The synergistic effect of the multi-task pre-training framework allows the model to excel on diverse objectives, showcasing new engineering possibilities.

In experiments, OmniStream achieves 68.5% accuracy on the SSv2 dataset, significantly outperforming DINOv3's 54.0%. In online 3D reconstruction tasks, OmniStream achieves absolute relative errors of 0.314, 0.072, and 0.136 on the Sintel, BONN, and KITTI datasets, respectively. Additionally, in VLM and VLA tasks, OmniStream demonstrates strong spatial reasoning capabilities, achieving a leading score of 70.6% on the VSI-Bench benchmark.

OmniStream's significance lies in its ability to unify perception, reconstruction, and action in visual streams, overcoming the fragmentation of current vision foundation models. This capability is crucial for general-purpose visual understanding in interactive and embodied agents, providing consistent representations across image, video, geometric, and language tasks, thus advancing the field of vision.

Despite its versatility and efficiency, OmniStream may experience performance degradation when handling very long video sequences. Additionally, due to the model's complexity, training and inference are computationally expensive, potentially unsuitable for resource-constrained environments. Future research directions include optimizing OmniStream's performance on long sequences, exploring more efficient causal spatiotemporal attention mechanisms, and applications in resource-constrained environments.

Deep Analysis

Background

Visual agents are increasingly deployed in real-time streaming environments, from camera surveillance to augmented reality devices, requiring agents to update their beliefs online under tight latency and memory budgets. Traditional vision foundation models often focus on image semantic perception, offline temporal modeling, or spatial geometry, leading to fragmentation within the field. In recent years, language models have achieved generality with a single autoregressive backbone trained under next-token prediction. However, vision tasks differ not only in supervision but also in the nature of their outputs, such as discrete labels, segmentation masks, dense depth, 3D geometry, and temporally evolving predictions. This has led to the emergence of specialized foundation models, such as image encoders, video models, and geometric experts. While effective in-domain, these models typically learn representations tailored to a narrow objective, making it difficult to transfer across tasks.

Core Problem

The core problem is how to unify perception, reconstruction, and action in the vision domain to achieve general-purpose visual understanding. Current vision foundation models are fragmented in semantic, temporal, and spatial geometry, making it challenging to generalize across tasks without fine-tuning the backbone. Additionally, existing models often rely on expensive re-training, re-tokenization of outputs, or architectural adjustments to the generative head, complicating the realization of unified visual representations. Therefore, the key research question is whether a representation can be learned that supports multiple downstream tasks without modifying or fine-tuning the backbone.

Innovation

OmniStream's core innovations include proposing a unified streaming visual backbone capable of generalizing across semantic, spatial, and temporal reasoning without fine-tuning the backbone. β€’ Introducing causal spatiotemporal attention ensures strict temporal causality and enables efficient frame-by-frame inference via a persistent KV-cache, avoiding re-computation over past frames. β€’ Proposing 3D Rotary Positional Embeddings (3D-RoPE) extends 2D RoPE to the spatiotemporal domain, enhancing the model's ability to handle long video streams. β€’ Adopting a unified multi-task pre-training framework that couples static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment, encouraging representations that are temporally coherent, geometrically grounded, and language-aligned.

Methodology

OmniStream's methodology includes the following key steps: β€’ Employing causal spatiotemporal attention with a persistent KV-cache for efficient frame-by-frame inference, avoiding re-computation over past frames. Input: current frame and historical context; Output: composite output state. β€’ Using 3D Rotary Positional Embeddings (3D-RoPE) to extend 2D RoPE to the spatiotemporal domain, enhancing the model's ability to handle long video streams. Input: non-overlapping patches per frame; Output: dense spatiotemporal feature maps and global semantics. β€’ Adopting a unified multi-task pre-training framework that couples static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment. Input: set of views (global/local crops from an image or video clip); Output: global semantic consistency and patch-level discriminative features.

Experiments

The experimental design includes multi-task pre-training on 29 datasets, covering static images, dynamic videos, and geometric 3D/4D scenes. Baselines used include DINOv3, V-JEPA, CUT3R, etc. Evaluation metrics include image classification accuracy, video action recognition accuracy, video depth estimation absolute relative error, etc. Key hyperparameters include sequence length T=16, optimizer Adam, learning rate 1e-4. Ablation studies analyze the impact of causal spatiotemporal attention and 3D-RoPE.

Results

Results analysis shows that OmniStream excels in multiple tasks. It achieves 68.5% accuracy on the SSv2 dataset, significantly outperforming DINOv3's 54.0%. In online 3D reconstruction tasks, OmniStream achieves absolute relative errors of 0.314, 0.072, and 0.136 on the Sintel, BONN, and KITTI datasets, respectively. Additionally, in VLM and VLA tasks, OmniStream demonstrates strong spatial reasoning capabilities, achieving a leading score of 70.6% on the VSI-Bench benchmark.

Applications

OmniStream's application scenarios include: β€’ Real-time video surveillance: enabling efficient frame-by-frame processing for real-time monitoring and analysis of dynamic scenes. β€’ Augmented reality devices: supporting real-time updates and interactions from the user's perspective, enhancing user experience. β€’ Robotic manipulation: supporting complex robotic manipulation tasks through a unified visual stream representation, achieving more efficient task execution.

Limitations & Outlook

OmniStream's limitations and outlook include: β€’ Performance degradation when handling very long video sequences, as its pre-training temporal window is fixed at 16 frames. β€’ In certain complex geometric reasoning tasks, OmniStream may not fully replace specialized geometric expert models. β€’ Due to the model's complexity, training and inference are computationally expensive, potentially unsuitable for resource-constrained environments. Future research directions include optimizing OmniStream's performance on long sequences, exploring more efficient causal spatiotemporal attention mechanisms, and applications in resource-constrained environments.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. OmniStream is like a multi-functional kitchen assistant that not only helps you chop vegetables but also monitors the cooking process and even adjusts the heat when needed. Traditional kitchen assistants might focus on one task, like chopping or stirring, but OmniStream can handle multiple tasks simultaneously, like an all-in-one chef. It uses something called causal spatiotemporal attention to ensure each step is based on previous ones, not predicting future steps. It's like cooking where you decide the next step based on what's already done, not guessing what's next. OmniStream also uses 3D Rotary Positional Embeddings to help you better understand the kitchen's spatial layout, like a smart assistant planning your kitchen layout. With these technologies, OmniStream can help you complete the entire cooking process, from preparing ingredients to finishing the meal, without needing to be retrained.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you control multiple characters, each with different tasks. OmniStream is like a super-smart game assistant that helps you manage all the characters' tasks at once, so you don't have to worry about each one individually. It has a skill called causal spatiotemporal attention, ensuring each character's actions are based on past situations, not guessing what will happen next. Just like in a game, you decide the next move based on what's already happened, not imagining things. OmniStream also uses 3D Rotary Positional Embeddings to help you better understand the game's spatial layout, like a smart assistant planning the game map. This way, you can easily tackle various challenges in the game without spending too much time adjusting each character's tasks. Isn't that cool?

Glossary

OmniStream

A unified streaming visual backbone capable of generalizing across semantic, spatial, and temporal reasoning without fine-tuning the backbone.

OmniStream achieves perception, reconstruction, and action in visual streams using causal spatiotemporal attention and 3D-RoPE.

Causal Spatiotemporal Attention

An attention mechanism that ensures the model relies only on past and present frames during inference, avoiding future frame predictions.

OmniStream employs causal spatiotemporal attention for efficient frame-by-frame online processing.

3D Rotary Positional Embeddings (3D-RoPE)

A technique that extends 2D RoPE to the spatiotemporal domain, enhancing the model's ability to handle long video streams.

OmniStream uses 3D-RoPE to enhance its handling of long video streams.

KV-cache

A persistent cache mechanism used to store keys and values from past frames, avoiding redundant computations.

OmniStream uses a persistent KV-cache for efficient frame-by-frame inference.

Multi-task Pre-training Framework

A framework that integrates static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment.

OmniStream is pre-trained on 29 datasets using a multi-task pre-training framework.

Vision-Language Alignment

A technique that aligns visual representations with linguistic concepts, enhancing the model's semantic understanding.

OmniStream achieves better semantic understanding through vision-language alignment.

Dual DPT Modules

Modules used for predicting depth maps, ray maps, and camera poses.

OmniStream uses dual DPT modules for streaming geometric reconstruction.

Autoregressive Language Decoder

A lightweight decoder that connects visual tokens with linguistic concepts.

OmniStream uses an autoregressive language decoder for vision-language alignment.

Ablation Study

A method of analyzing the impact of model components on overall performance by gradually removing them.

OmniStream's ablation studies analyze the impact of causal spatiotemporal attention and 3D-RoPE.

VSI-Bench Benchmark

A benchmark used to evaluate a model's spatial intelligence.

OmniStream achieves a leading score of 70.6% on the VSI-Bench benchmark.

Open Questions Unanswered questions from this research

  • 1 OmniStream may experience performance degradation when handling very long video sequences, as its pre-training temporal window is fixed at 16 frames. Future research needs to explore how to optimize the model's performance on long sequences without increasing computational costs.
  • 2 In certain complex geometric reasoning tasks, OmniStream may not fully replace specialized geometric expert models. This indicates a need for further research into enhancing the model's geometric reasoning capabilities.
  • 3 Due to the model's complexity, training and inference are computationally expensive, potentially unsuitable for resource-constrained environments. Future research can explore how to reduce computational costs without compromising performance.
  • 4 OmniStream may underperform in certain specific vision-language tasks, particularly those requiring highly fine-grained semantic understanding. Further research is needed to enhance the model's semantic understanding capabilities.
  • 5 Despite OmniStream's strong performance across multiple tasks, it may still require fine-tuning in certain specific application scenarios. This indicates a need for further research into improving the model's generalization capabilities.

Applications

Immediate Applications

Real-time Video Surveillance

OmniStream enables efficient frame-by-frame processing for real-time monitoring and analysis of dynamic scenes, suitable for security and traffic surveillance.

Augmented Reality Devices

OmniStream supports real-time updates and interactions from the user's perspective, enhancing user experience in AR glasses and mobile devices.

Robotic Manipulation

OmniStream supports complex robotic manipulation tasks through a unified visual stream representation, achieving more efficient task execution in industrial automation and household robots.

Long-term Vision

Smart Cities

OmniStream's real-time monitoring capabilities enable intelligent management and optimization of urban infrastructure, enhancing city operational efficiency.

Autonomous Driving

OmniStream can be used in the perception systems of autonomous vehicles, enhancing their understanding and decision-making capabilities in complex environments for safer autonomous driving.

Abstract

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.

cs.CV

References (20)

DINOv3

Oriane Sim'eoni, Huy V. Vo, Maximilian Seitzer et al.

2025 439 citations ⭐ Influential View Analysis β†’

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan et al.

2025 231 citations ⭐ Influential View Analysis β†’

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo et al.

2024 2044 citations ⭐ Influential View Analysis β†’

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev et al.

2025 764 citations ⭐ Influential View Analysis β†’

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li et al.

2024 332 citations ⭐ Influential View Analysis β†’

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Jiasen Lu, Christopher Clark, Rowan Zellers et al.

2022 496 citations View Analysis β†’

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu et al.

2024 704 citations View Analysis β†’

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian et al.

2024 1067 citations View Analysis β†’

Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

Qi Feng

2025 9 citations View Analysis β†’

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Ruosen Zhao, Zhikang Zhang, Jialei Xu et al.

2025 4 citations View Analysis β†’

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon et al.

2022 3771 citations View Analysis β†’

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen et al.

2025 73 citations View Analysis β†’

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun et al.

2025 156 citations View Analysis β†’

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Limin Wang, Bingkun Huang, Zhiyu Zhao et al.

2023 584 citations View Analysis β†’

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 8342 citations View Analysis β†’

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li et al.

2025 68 citations View Analysis β†’

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Lu Ling, Yichen Sheng, Zhi Tu et al.

2023 312 citations View Analysis β†’

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra et al.

2021 8446 citations View Analysis β†’

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang, Bingyi Kang, Zilong Huang et al.

2024 1565 citations View Analysis β†’

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang et al.

2024 717 citations View Analysis β†’