O3N: Omnidirectional Open-Vocabulary Occupancy Prediction

TL;DR

O3N framework achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks using polar-spiral topology for 360° spatial representation.

cs.CV 🔴 Advanced 2026-03-13 9 views
Mengfei Duan Hao Shi Fei Teng Guoqiang Zhao Yuheng Zhang Zhiyong Li Kailun Yang
omnidirectional perception open vocabulary occupancy prediction polar-spiral semantic alignment

Key Findings

Methodology

The O3N framework employs a polar-spiral topology for 360° spatial representation, integrating the Occupancy Cost Aggregation (OCA) module and Natural Modality Alignment (NMA) module to provide a consistent pixel-voxel-text representation. The PsM module captures long-range context through polar-spiral scanning, the OCA module unifies geometric and semantic supervision within the voxel space, and the NMA module achieves gradient-free alignment of visual features, voxel embeddings, and text semantics.

Key Results

  • On the QuadOcc benchmark, O3N achieves significant improvements on both known and novel classes, with an overall mIoU of 16.54, surpassing OVO's 14.33, particularly excelling in novel classes with an mIoU of 21.16, outperforming several fully supervised methods.
  • On the Human360Occ dataset, O3N achieves an overall mIoU of 24.25 under the open-vocabulary setting, outperforming all open-vocabulary counterparts and achieving results comparable to several fully supervised methods.
  • Ablation studies demonstrate that the combination of PsM, OCA, and NMA modules significantly enhances the model's generalization ability and semantic learning effectiveness, especially on unseen semantics.

Significance

The O3N framework pioneers a new direction in omnidirectional open-vocabulary occupancy prediction, addressing the limitations of traditional methods in recognizing complex dynamic objects during open-world exploration. By introducing polar-spiral topology and natural modality alignment, O3N not only advances the universality of 3D world modeling in academia but also provides a safer and more comprehensive scene perception solution for the industry.

Technical Contribution

O3N fundamentally differentiates itself from existing methods through polar-spiral topology and gradient-free alignment mechanisms, offering new theoretical guarantees and engineering possibilities. The PsM module effectively captures spatial geometric and semantic details of omnidirectional images, the OCA module enhances robustness in the open-vocabulary space through voxel-text cost volume construction, and the NMA module effectively narrows the semantic gap across modalities.

Novelty

O3N is the first framework to achieve purely visual, end-to-end omnidirectional open-vocabulary occupancy prediction. Compared to existing methods, O3N introduces a novel approach to spatial representation and semantic alignment through polar-spiral topology and natural modality alignment, breaking the limitations of fixed perspective inputs and predefined training categories.

Limitations

  • Due to the complexity of the polar-spiral topology, O3N has certain computational resource requirements, which may not be suitable for resource-constrained devices.
  • In extremely complex dynamic scenes, O3N's semantic alignment may face challenges, especially in generalizing to unseen semantics.
  • While O3N excels in open-vocabulary prediction, further optimization is needed for specific semantic categories in certain domains.

Future Work

Future work could include further optimizing O3N's performance on resource-constrained devices and validating its generalization capabilities across a wider range of scenarios and datasets. Additionally, exploring integration with other perception modalities, such as LiDAR, could further enhance O3N's scene understanding capabilities.

AI Executive Summary

In the fields of autonomous driving and intelligent robotics, omnidirectional perception has become an inevitable trend. However, existing 3D occupancy prediction methods are limited by perspective inputs and predefined training distributions, making it challenging to achieve comprehensive and safe scene perception during open-world exploration.

To address this issue, the O3N framework is proposed as the first purely visual, end-to-end omnidirectional open-vocabulary occupancy prediction framework. O3N utilizes a polar-spiral topology for 360° spatial representation and integrates the Occupancy Cost Aggregation (OCA) module and Natural Modality Alignment (NMA) module to provide a consistent pixel-voxel-text representation.

The core technical principles of O3N include: the PsM module captures long-range context through polar-spiral scanning, the OCA module unifies geometric and semantic supervision within the voxel space, and the NMA module achieves gradient-free alignment of visual features, voxel embeddings, and text semantics.

In experiments, O3N achieves state-of-the-art performance on the QuadOcc and Human360Occ benchmarks, particularly excelling in novel classes, demonstrating significant advantages in cross-scene generalization and semantic scalability.

This research not only advances the universality of 3D world modeling in academia but also provides a safer and more comprehensive scene perception solution for the industry. However, O3N has certain computational resource requirements, and future work could include further optimizing its performance on resource-constrained devices.

Deep Analysis

Background

With the rapid development of autonomous driving and intelligent robotics, omnidirectional perception has become a key to achieving autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods often rely on limited perspective inputs and predefined training distributions, making it difficult to meet the demand for recognizing complex dynamic objects during open-world exploration. In recent years, researchers have attempted to enhance semantic understanding and spatial geometry modeling capabilities through multi-sensor, multi-view approaches, but these methods are typically limited to fixed vocabularies and cannot recognize unknown semantic categories.

Core Problem

Existing 3D occupancy prediction methods face numerous challenges during open-world exploration, particularly in recognizing complex dynamic objects. Due to the limitations of perspective inputs and predefined training distributions, these methods struggle to provide comprehensive and safe scene perception. Additionally, traditional methods often assume that scene understanding is the recognition of a limited set of labels, which restricts the model's ability to handle unknown object categories in open-world environments.

Innovation

The O3N framework introduces several innovations in the field of omnidirectional open-vocabulary occupancy prediction:


  • �� Polar-Spiral Topology: Achieves 360° spatial representation through polar-spiral scanning, capturing long-range context.

  • �� Occupancy Cost Aggregation Module: Unifies geometric and semantic supervision within the voxel space, enhancing robustness in the open-vocabulary space.

  • �� Natural Modality Alignment Module: Achieves gradient-free alignment of visual features, voxel embeddings, and text semantics, narrowing the semantic gap across modalities.

Methodology

The implementation of the O3N framework includes the following key steps:


  • �� Polar-Spiral Topology: Achieves 360° spatial representation through polar-spiral scanning, capturing long-range context.

  • �� Occupancy Cost Aggregation Module: Unifies geometric and semantic supervision within the voxel space, enhancing robustness in the open-vocabulary space.

  • �� Natural Modality Alignment Module: Achieves gradient-free alignment of visual features, voxel embeddings, and text semantics, narrowing the semantic gap across modalities.

Experiments

The experimental design includes testing on the QuadOcc and Human360Occ datasets, with comparisons to various baseline models. The primary performance metric is the mean Intersection over Union (mIoU), with separate evaluations for novel and known classes. Ablation studies are conducted to validate the contributions of the PsM, OCA, and NMA modules.

Results

Experimental results show that O3N achieves state-of-the-art performance on the QuadOcc and Human360Occ benchmarks, particularly excelling in novel classes, demonstrating significant advantages in cross-scene generalization and semantic scalability. Ablation studies demonstrate that the combination of PsM, OCA, and NMA modules significantly enhances the model's generalization ability and semantic learning effectiveness.

Applications

The O3N framework has broad application prospects in fields such as autonomous driving, intelligent robotics, and virtual reality. By providing more comprehensive and safe scene perception, O3N can significantly enhance the intelligence level in these fields.

Limitations & Outlook

O3N has certain computational resource requirements, which may not be suitable for resource-constrained devices. Additionally, in extremely complex dynamic scenes, O3N's semantic alignment may face challenges, especially in generalizing to unseen semantics. Future work could include further optimizing its performance on resource-constrained devices.

Plain Language Accessible to non-experts

Imagine you're in a giant maze with high walls all around, and you need to know what's around every corner to find the exit. Traditional methods are like using a flashlight to illuminate a small part of the maze, while O3N is like a panoramic camera that can see the entire maze at once. It can not only see the position of the walls but also recognize patterns painted on the walls, like arrows or marks. This way, you can find the exit faster without taking the wrong path. O3N uses a method called polar-spiral to divide the maze into many small sections, each of which can be carefully observed. It's like using a giant magnifying glass to enlarge the details of every corner, allowing you to see more clearly.

ELI14 Explained like you're 14

Hey there! Have you ever played a game where you need to find hidden objects? Imagine if you had a super cool panoramic camera that could see every corner of the room—that would be awesome! That's what O3N does. It's like a panoramic camera that can see everything at once, not only spotting the position of objects but also recognizing what they are, like chairs, tables, or even a cute kitty. This way, you can find all the hidden objects faster without having to look for them one by one. O3N uses a method called polar-spiral to divide the room into many small sections, each of which can be carefully observed. It's like using a super magnifying glass to enlarge the details of every corner, allowing you to see more clearly. Isn't that cool?

Glossary

O3N Framework

O3N is a purely visual, end-to-end omnidirectional open-vocabulary occupancy prediction framework that utilizes polar-spiral topology for 360° spatial representation.

Used for achieving omnidirectional scene perception and semantic alignment.

Polar-Spiral Topology

A method for capturing long-range context through polar-spiral scanning, achieving 360° spatial representation.

Used in the O3N framework for spatial representation.

Occupancy Cost Aggregation Module

A module that unifies geometric and semantic supervision within the voxel space, enhancing robustness in the open-vocabulary space.

Used in the O3N framework for semantic alignment.

Natural Modality Alignment Module

Achieves gradient-free alignment of visual features, voxel embeddings, and text semantics, narrowing the semantic gap across modalities.

Used in the O3N framework for modality alignment.

QuadOcc Dataset

A real-world dataset for omnidirectional occupancy prediction, containing data from a quadruped robot in a campus environment.

Used for experimental validation of the O3N framework.

Human360Occ Dataset

A CARLA-based simulated human-ego occupancy dataset for omnidirectional occupancy prediction.

Used for experimental validation of the O3N framework.

Mean Intersection over Union (mIoU)

A metric used to evaluate model performance, representing the overlap between predicted results and ground truth labels.

Used to evaluate the performance of the O3N framework.

Ablation Study

A method for evaluating the contribution of individual model components by progressively removing them.

Used to validate the contributions of modules in the O3N framework.

Open Vocabulary

Refers to the ability of a model to recognize and predict unseen object categories without prior labeling.

Describes the semantic scalability of the O3N framework.

Omnidirectional Perception

Refers to achieving comprehensive scene perception and understanding through a 360° view.

Describes the core capability of the O3N framework.

Open Questions Unanswered questions from this research

  • 1 How can O3N's performance be optimized on resource-constrained devices? Currently, O3N has certain computational resource requirements, limiting its application on mobile devices or embedded systems. Further research is needed to reduce computational complexity without sacrificing performance.
  • 2 In extremely complex dynamic scenes, O3N's semantic alignment may face challenges. How can its generalization ability on unseen semantics be enhanced? This requires exploring new semantic alignment mechanisms and richer training data.
  • 3 How can O3N's generalization capabilities be validated across a wider range of scenarios and datasets? Current experiments focus primarily on the QuadOcc and Human360Occ datasets, and further expansion to other domains is needed.
  • 4 O3N still needs optimization for specific semantic categories in certain domains. How can its performance in these areas be improved? This requires targeted adjustments to the model architecture and training strategy.
  • 5 How can integration with other perception modalities, such as LiDAR, further enhance O3N's scene understanding capabilities? This requires exploring methods and techniques for multimodal fusion.

Applications

Immediate Applications

Autonomous Driving

O3N can be used for omnidirectional perception in autonomous vehicles, providing more comprehensive and safe scene understanding to help vehicles make more accurate decisions in complex environments.

Intelligent Robotics

O3N can be used for navigation and task execution in intelligent robots, helping robots complete tasks more efficiently by recognizing and predicting objects in the surrounding environment.

Virtual Reality

O3N can be used for scene modeling and interaction in virtual reality systems, providing a more realistic and immersive environment to enhance user experience.

Long-term Vision

Smart Cities

O3N can be used for omnidirectional monitoring and management in smart cities, providing more intelligent urban planning and management solutions by sensing changes in the urban environment in real-time.

Human-Computer Interaction

O3N can be used for intelligent human-computer interaction systems, providing more natural and efficient interaction experiences by recognizing and understanding user behaviors and intentions.

Abstract

Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.

cs.CV cs.RO eess.IV

References (20)

MonoScene: Monocular 3D Semantic Scene Completion

Anh-Quan Cao, Raoul de Charette

2021 433 citations ⭐ Influential View Analysis →

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Seokju Cho, Heeseong Shin, Sung‐Jin Hong et al.

2023 207 citations ⭐ Influential View Analysis →

OVO: Open-Vocabulary Occupancy

Zhiyu Tan, Zichao Dong, Cheng-Jun Zhang et al.

2023 23 citations ⭐ Influential View Analysis →

OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera

Hao Shi, Ze Wang, Shangwei Guo et al.

2025 3 citations ⭐ Influential View Analysis →

One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

Xin Lin, Xian Ge, Dizhe Zhang et al.

2025 12 citations ⭐ Influential View Analysis →

A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision

Hao Ai, Zidong Cao, Lin Wang

2025 20 citations View Analysis →

SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

Xuewei Li, Tao Wu, Zhongang Qi et al.

2023 27 citations View Analysis →

RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots

Zhang Zhang, Qiang Zhang, Wei Cui et al.

2025 7 citations View Analysis →

QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction

Sicheng Zuo, Wenzhao Zheng, Han Xiao et al.

2025 8 citations View Analysis →

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

Pin Tang, Zhongdao Wang, Guoqing Wang et al.

2024 92 citations View Analysis →

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Liang-Chieh Chen, Yukun Zhu, G. Papandreou et al.

2018 15975 citations View Analysis →

GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-Aware Panoramic Semantic Segmentation

Weiming Zhang, Yexin Liu, Xueye Zheng et al.

2024 28 citations View Analysis →

FishBEV: Distortion-Resilient Bird's Eye View Segmentation with Surround-View Fisheye Cameras

Hang Li, Dianmo Sheng, Qiankun Dong et al.

2025 1 citations View Analysis →

ArticuBEVSeg: Road Semantic Understanding and its Application in Bird's Eye View From Panoramic Vision System of Long Combination Vehicles

Weimin Liu, Wenjun Wang

2025 2 citations

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou, Yueru Luo, Han Zhang et al.

2026 1 citations View Analysis →

DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

Ziyu Zhao, Xiaoguang Li, Lin Shi et al.

2025 7 citations View Analysis →

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Antonín Vobecký, Oriane Sim'eoni, David Hurych et al.

2024 56 citations View Analysis →

GoodSAM++: Bridging Domain and Capacity Gaps via Segment Anything Model for Panoramic Semantic Segmentation

Weiming Zhang, Yexin Liu, Xueye Zheng et al.

2024 11 citations View Analysis →

OneBEV: Using One Panoramic Image for Bird's-Eye-View Semantic Mapping

Jiale Wei, Junwei Zheng, Ruiping Liu et al.

2024 10 citations View Analysis →

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

Chaodong Xiao, Ming-hui Li, Zhengqiang Zhang et al.

2024 40 citations View Analysis →