SENSE: Stereo OpEN Vocabulary SEmantic Segmentation

TL;DR

SENSE leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation, achieving a 2.9% precision improvement on PhraseStereo.

cs.CV 🔴 Advanced 2026-04-17 38 views
Thomas Campagnolo Ezio Malis Philippe Martinet Gaétan Bahl
stereo vision open vocabulary semantic segmentation vision-language model autonomous driving

Key Findings

Methodology

SENSE is an innovative stereo open-vocabulary semantic segmentation method. By combining stereo image pairs and vision-language models like CLIP, SENSE introduces geometric cues to improve spatial reasoning and segmentation accuracy. Its architecture builds upon frozen CLIP features and the CLIPSeg framework, adding a stereo fusion module and a lightweight decoder that processes intermediate representations, enabling natural language queries without retraining the backbone.

Key Results

  • On the PhraseStereo dataset, SENSE shows a 2.9% improvement in Average Precision over the baseline method and a 0.76% improvement over the best competing method.
  • SENSE provides a relative improvement of 3.5% mIoU on Cityscapes compared to the baseline work.
  • SENSE offers an 18% mIoU relative improvement on KITTI compared to the baseline work.

Significance

By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems. Its innovative application in open-vocabulary semantic segmentation overcomes the spatial precision limitations of traditional methods, especially under occlusions and near object boundaries.

Technical Contribution

SENSE's technical contribution lies in being the first to combine stereo vision with open-vocabulary semantic segmentation, utilizing geometric information from stereo image pairs to enhance spatial reasoning. Its architecture uses intermediate CLIP activations for stereo fusion and lightweight decoding without retraining the CLIP backbone.

Novelty

SENSE is the first method to apply stereo vision to open-vocabulary semantic segmentation. Compared to existing single-view methods, SENSE significantly improves segmentation accuracy and spatial reasoning by introducing geometric cues from stereo image pairs.

Limitations

  • SENSE may perform poorly under extreme lighting conditions or when stereo matching fails.
  • Relying on CLIP's pretrained features may limit SENSE's performance when dealing with unseen visual features.

Future Work

Future research directions include exploring more efficient stereo fusion methods and training on larger and more diverse datasets to improve the model's generalization and robustness.

AI Executive Summary

In autonomous driving and Intelligent Transportation Systems, scene understanding is a crucial task. However, existing semantic segmentation models often rely on fixed class sets, making them inflexible in dynamic environments. SENSE proposes a novel approach by combining stereo vision and vision-language models, overcoming the spatial precision limitations of traditional methods.

The core of SENSE lies in leveraging geometric information provided by stereo image pairs to enhance spatial reasoning. Its architecture builds upon frozen CLIP features and the CLIPSeg framework, adding a stereo fusion module and a lightweight decoder that processes intermediate representations, enabling natural language queries without retraining the backbone. This method performs excellently on the PhraseStereo dataset, demonstrating strong generalization capabilities.

In experiments, SENSE shows a 2.9% improvement in Average Precision over the baseline method on the PhraseStereo dataset and provides a relative improvement of 3.5% mIoU on Cityscapes and 18% on KITTI compared to the baseline work. These results indicate that SENSE has significant advantages in handling complex scenes and unseen categories.

The innovation of SENSE lies in being the first to apply stereo vision to open-vocabulary semantic segmentation, utilizing geometric cues from stereo image pairs to significantly improve segmentation accuracy and spatial reasoning. This opens up new possibilities for scene understanding in autonomous driving and Intelligent Transportation Systems.

However, SENSE may perform poorly under extreme lighting conditions or when stereo matching fails. Additionally, relying on CLIP's pretrained features may limit SENSE's performance when dealing with unseen visual features. Future research directions include exploring more efficient stereo fusion methods and training on larger and more diverse datasets to improve the model's generalization and robustness.

Deep Analysis

Background

Semantic segmentation is a fundamental task in computer vision, aiming to assign class labels to every pixel in an image. Traditional semantic segmentation models typically rely on dense annotations and operate on a fixed, closed set of categories, making them inflexible in dynamic environments. Recently, open-vocabulary semantic segmentation has emerged as a promising alternative, enabling models to segment images based on arbitrary class names or natural language expressions. However, existing methods primarily rely on single-view images, struggling with spatial precision, especially under occlusions and near object boundaries.

Core Problem

Existing open-vocabulary semantic segmentation methods face limitations in spatial precision, particularly when dealing with occlusions and object boundaries. This is because these methods typically rely on single-view images, ignoring the geometric cues available in stereo vision. Moreover, current vision-language models are primarily designed for image-level classification, lacking the spatial granularity needed for pixel-wise segmentation.

Innovation

The core innovation of SENSE lies in being the first to apply stereo vision to open-vocabulary semantic segmentation. By combining stereo image pairs and vision-language models, SENSE introduces geometric cues that significantly improve spatial reasoning and segmentation accuracy. Its architecture builds upon frozen CLIP features and the CLIPSeg framework, adding a stereo fusion module and a lightweight decoder that processes intermediate representations, enabling natural language queries without retraining the backbone.

Methodology

  • �� SENSE leverages geometric information provided by stereo image pairs to enhance spatial reasoning.
  • �� Its architecture builds upon frozen CLIP features and the CLIPSeg framework, adding a stereo fusion module and a lightweight decoder.
  • �� By introducing intermediate CLIP activations for stereo fusion and lightweight decoding, SENSE enables natural language queries without retraining the backbone.
  • �� A sliding-window strategy is used on large-scale datasets to address the resolution limits of CLIP encoders, generating fine-grained predictions while preserving global context.

Experiments

SENSE was trained and evaluated on the PhraseStereo dataset, which is specifically designed for phrase-grounded tasks. In experiments, SENSE shows a 2.9% improvement in Average Precision over the baseline method on the PhraseStereo dataset and provides a relative improvement of 3.5% mIoU on Cityscapes and 18% on KITTI compared to the baseline work. The experimental setup includes a sliding-window strategy and CRF refinement to handle multi-label segmentation tasks.

Results

Experimental results show that SENSE achieves a 2.9% improvement in Average Precision over the baseline method on the PhraseStereo dataset and provides a relative improvement of 3.5% mIoU on Cityscapes and 18% on KITTI compared to the baseline work. These results indicate that SENSE has significant advantages in handling complex scenes and unseen categories.

Applications

SENSE has broad application prospects in autonomous driving and Intelligent Transportation Systems. It can be flexibly applied in dynamic environments, supporting accurate scene understanding from natural language. This provides new possibilities for decision-making in complex environments for autonomous vehicles.

Limitations & Outlook

SENSE may perform poorly under extreme lighting conditions or when stereo matching fails. Additionally, relying on CLIP's pretrained features may limit SENSE's performance when dealing with unseen visual features. Future research directions include exploring more efficient stereo fusion methods and training on larger and more diverse datasets to improve the model's generalization and robustness.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a big meal. You have an assistant who can understand your instructions and help you find the ingredients and tools you need. This assistant is like the vision-language model in SENSE, capable of understanding natural language and finding corresponding objects in images.

Now, you need to find a specific spice bottle in the kitchen, but it's blocked by other bottles. You put on a special pair of glasses that allow you to see the depth and position of the bottles. This is like the stereo vision in SENSE, providing additional geometric information to help you locate the target more accurately.

By combining the language understanding ability of the assistant and the geometric information from the glasses, you can quickly and accurately find the spice bottle you need. This is how SENSE works: by combining vision-language models and stereo vision, SENSE can perform semantic segmentation accurately in complex scenes.

This method is particularly suitable for autonomous driving and Intelligent Transportation Systems because it can be flexibly applied in dynamic environments, supporting accurate scene understanding from natural language.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where your task is to find hidden treasures on a map. But the problem is, there are lots of obstacles blocking your view!

That's when you have a magical assistant who not only understands what you say but also helps you find the treasure's location. This assistant is like the vision-language model in SENSE, capable of understanding natural language and finding corresponding objects in images.

But sometimes, the treasure might be hidden in a very tricky spot, and even your assistant gets confused. So, you put on a special pair of glasses that let you see what's behind the obstacles. This is like the stereo vision in SENSE, providing additional geometric information to help you locate the target more accurately.

By combining the language understanding ability of the assistant and the geometric information from the glasses, you can quickly and accurately find the treasure. That's how SENSE works: by combining vision-language models and stereo vision, SENSE can perform semantic segmentation accurately in complex scenes. Isn't that cool?

Glossary

SENSE (Stereo Open Vocabulary Semantic Segmentation)

SENSE is a method combining stereo vision and vision-language models for semantic segmentation, enabling flexible application in dynamic environments.

In the paper, SENSE is used to improve spatial precision in open-vocabulary semantic segmentation.

Stereo Vision

Stereo vision combines images from two perspectives to provide geometric information about the depth and position of objects.

In SENSE, stereo vision is used to provide geometric cues to enhance spatial reasoning.

Vision-Language Model

A vision-language model can understand natural language and find corresponding objects in images.

In SENSE, CLIP is used as the vision-language model for natural language queries.

CLIP

CLIP is a vision-language model that aligns visual and textual modalities in a shared embedding space.

In SENSE, CLIP is used to provide visual and textual features.

CLIPSeg

CLIPSeg is a semantic segmentation framework based on CLIP features, capable of dense prediction.

In SENSE, CLIPSeg is used for lightweight decoding.

PhraseStereo

PhraseStereo is a dataset designed for phrase-grounded tasks, containing rich object, attribute, and spatial queries.

In SENSE, PhraseStereo is used for training and evaluation.

mIoU (Mean Intersection over Union)

mIoU is a metric for evaluating semantic segmentation model performance, representing the overlap between predicted results and ground truth labels.

In SENSE experiments, mIoU is used to evaluate model performance on Cityscapes and KITTI datasets.

CRF (Conditional Random Field)

CRF is a post-processing strategy for multi-label segmentation tasks, capable of refining segmentation masks.

In SENSE experiments, CRF is used to refine multi-label segmentation masks.

Lightweight Decoder

A lightweight decoder processes intermediate feature representations and generates the final segmentation output.

In SENSE, the lightweight decoder processes stereo-fused features.

Sliding-Window Strategy

The sliding-window strategy is used to process large-scale datasets, addressing resolution limits of encoders.

In SENSE experiments, the sliding-window strategy is used to generate fine-grained predictions.

Open Questions Unanswered questions from this research

  • 1 Although SENSE performs well in open-vocabulary semantic segmentation, it may perform poorly under extreme lighting conditions or when stereo matching fails. Further research is needed to improve the model's robustness in these scenarios.
  • 2 SENSE relies on CLIP's pretrained features, which may limit its performance when dealing with unseen visual features. Future research could explore how to improve the model's generalization without relying on pretrained features.
  • 3 Currently, training and evaluating SENSE on large-scale datasets is limited by computational resources. Future research could explore more efficient training methods to improve the model's performance and efficiency.
  • 4 The stereo fusion module and lightweight decoder in SENSE may face computational bottlenecks when processing complex scenes. Future research could explore more efficient module designs to improve the model's computational efficiency.
  • 5 In open-vocabulary semantic segmentation, effectively handling multi-label segmentation tasks remains an open question. Future research could explore more effective multi-label segmentation strategies to improve model performance.

Applications

Immediate Applications

Autonomous Driving

SENSE can be used for scene understanding in autonomous vehicles, aiding decision-making in complex environments.

Intelligent Transportation Systems

SENSE can be used for scene understanding in Intelligent Transportation Systems, supporting accurate scene recognition from natural language.

Robotic Navigation

SENSE can be used for scene understanding in robotic navigation, helping robots plan paths in dynamic environments.

Long-term Vision

Smart Cities

SENSE can be used for scene understanding in smart cities, supporting intelligent management of urban infrastructure.

Augmented Reality

SENSE can be used for scene understanding in augmented reality, supporting more natural human-computer interaction.

Abstract

Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.

cs.CV cs.RO

References (20)

PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset

Thomas Campagnolo, E. Malis, Philippe Martinet et al.

2025 1 citations ⭐ Influential View Analysis →

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy et al.

2021 46813 citations ⭐ Influential View Analysis →

HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching

V. Tankovich, Christian Häne, S. Fanello et al.

2020 286 citations ⭐ Influential View Analysis →

Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching

Xianqi Wang, Gangwei Xu, Hao Jia et al.

2024 136 citations ⭐ Influential View Analysis →

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos et al.

2016 13341 citations ⭐ Influential View Analysis →

Image Segmentation Using Text and Image Prompts

Timo Lüddecke, Alexander S. Ecker

2021 732 citations ⭐ Influential View Analysis →

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui et al.

2021 558 citations ⭐ Influential View Analysis →

PhraseCut: Language-Based Image Segmentation in the Wild

Chenyun Wu, Zhe Lin, Scott D. Cohen et al.

2020 145 citations ⭐ Influential View Analysis →

MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching

Faranak Shamsafar, Samuel Woerz, Rafia Rahim et al.

2021 136 citations ⭐ Influential View Analysis →

Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

2025 3 citations ⭐ Influential View Analysis →

Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

Hassan Abu Alhaija, Siva Karthik Mustikovela, L. Mescheder et al.

2017 479 citations ⭐ Influential View Analysis →

Extract Free Dense Labels from CLIP

Chong Zhou, Chen Change Loy, Bo Dai

2021 744 citations ⭐ Influential View Analysis →

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren et al.

2023 3902 citations View Analysis →

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Junjie Wang, Bin Chen, Yulin Li et al.

2025 22 citations View Analysis →

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Liang-Chieh Chen, Yukun Zhu, G. Papandreou et al.

2018 16264 citations View Analysis →

One-Stage Deep Stereo Network

Ziming Liu, E. Malis, Philippe Martinet

2024 3 citations

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala et al.

2017 16031 citations

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

J. Lafferty, A. McCallum, Fernando Pereira

2001 15499 citations

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Shiting Xiao, Rishabh Kabra, Yuhang Li et al.

2025 4 citations View Analysis →

Feature-wise transformations

Vincent Dumoulin, Ethan Perez, Nathan Schucher et al.

2018 216 citations