Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

TL;DR

Bi-CMPStereo framework significantly improves accuracy and generalization in event-frame asymmetric stereo matching.

cs.CV πŸ”΄ Advanced 2026-04-17 36 views
Ninghui Xu Fabio Tosi Lihui Wang Jiawei Han Luca Bartolomei Zhiting Yao Matteo Poggi Stefano Mattoccia
event camera stereo matching deep learning cross-modal computer vision

Key Findings

Methodology

The paper introduces a novel framework called Bi-CMPStereo, which utilizes bidirectional cross-modal prompting to bridge the modality gap between event and frame cameras. The approach involves learning finely aligned stereo representations within a target canonical space and projecting each modality into both event and frame domains to integrate complementary representations. Key components include the Stereo Canonicalization Constraint (SCC) and Cross-Domain Embedding Adapter (CDEA), which enhance target domain features and achieve high-fidelity cross-modal alignment.

Key Results

  • On the DSEC dataset, Bi-CMPStereo achieved a mean absolute error (MAE) of 0.532, significantly outperforming state-of-the-art methods like ZEST and SEVFI-Net.
  • In cross-dataset generalization tests on the MVSEC dataset, Bi-CMPStereo excelled across all test scenarios, demonstrating its robust generalization capabilities.
  • Ablation studies showed that removing the CDEA and SCC modules led to significant performance drops, underscoring their importance in the framework.

Significance

This research holds significant implications for 3D perception under fast motion and challenging illumination conditions. By effectively combining the strengths of event and frame cameras, the Bi-CMPStereo framework achieves higher accuracy and better generalization in stereo matching. This approach offers reliable 3D perception solutions for fields such as robotics, autonomous driving, and augmented reality.

Technical Contribution

The Bi-CMPStereo framework addresses the modality gap between events and frames through bidirectional cross-modal prompting. Its technical contributions include: 1) introducing the Stereo Canonicalization Constraint (SCC) for high-fidelity cross-modal alignment; 2) designing the Cross-Domain Embedding Adapter (CDEA) to enhance target domain features; 3) achieving robust asymmetric stereo matching through bidirectional cost volumes.

Novelty

Bi-CMPStereo is the first framework to employ bidirectional cross-modal prompting in event-frame asymmetric stereo matching. Compared to existing methods, it not only significantly improves accuracy but also excels in generalization, addressing long-standing issues of information loss in cross-modal alignment.

Limitations

  • The sparsity of events in static or low-texture regions may lead to insufficiently dense depth estimation.
  • The method's high computational cost may limit its applicability in real-time applications.
  • In extreme lighting conditions, frame cameras may still suffer from blurring issues.

Future Work

Future research directions include: 1) further optimizing the algorithm to reduce computational costs; 2) exploring applications in more complex scenarios, such as nighttime driving; 3) investigating integration with other sensors, like LiDAR, to enhance depth perception capabilities.

AI Executive Summary

In the field of computer vision, stereo matching is a crucial technique widely used in robotics, autonomous driving, and augmented reality. However, traditional frame-based cameras often suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras, as a novel type of visual sensor, can detect per-pixel illumination changes with microsecond temporal resolution, offering higher dynamic range and extremely low latency.

This paper introduces a novel framework called Bi-CMPStereo, designed to address the modality gap between event and frame cameras. The framework learns finely aligned stereo representations within a target canonical space and projects each modality into both event and frame domains to integrate complementary representations. Its core components include the Stereo Canonicalization Constraint (SCC) and Cross-Domain Embedding Adapter (CDEA), which enhance target domain features and achieve high-fidelity cross-modal alignment.

In experiments, Bi-CMPStereo demonstrated outstanding performance on both the DSEC and MVSEC datasets, significantly outperforming existing state-of-the-art methods. Notably, on the DSEC dataset, Bi-CMPStereo achieved a mean absolute error (MAE) of 0.532, showcasing its strong performance under challenging illumination conditions. Additionally, the framework excelled in cross-dataset generalization tests, validating its robust generalization capabilities.

The successful application of the Bi-CMPStereo framework opens new possibilities for 3D perception, especially under fast motion and challenging illumination conditions. By effectively combining the strengths of event and frame cameras, this framework offers reliable 3D perception solutions for fields such as robotics, autonomous driving, and augmented reality.

However, the method may face challenges in static or low-texture regions due to the sparsity of events, leading to insufficiently dense depth estimation. Moreover, although Bi-CMPStereo significantly improves accuracy, its high computational cost may limit its applicability in real-time applications. Future research directions include further optimizing the algorithm to reduce computational costs and exploring integration with other sensors to enhance depth perception capabilities.

Deep Analysis

Background

Stereo matching holds a significant position in computer vision, primarily tasked with establishing pixel-wise correspondences between stereo images to compute dense disparity maps for depth estimation. In recent years, deep learning has driven remarkable progress in stereo matching for conventional RGB cameras, with iterative refinement-based methods demonstrating particular prominence. However, traditional frame-based cameras often suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras, as a novel bio-inspired neuromorphic sensor, can asynchronously detect per-pixel illumination changes with microsecond temporal resolution, offering higher dynamic range and extremely low latency. These properties make event cameras compelling for stereo matching in highly dynamic scenarios with challenging illumination.

Core Problem

Despite the complementary characteristics of event and frame cameras, the significant modality gap often leads to the marginalization of domain-specific cues essential for cross-modal stereo matching. Existing methods attempt to mitigate this discrepancy through domain-level or feature-level alignment, but often overlook discriminative domain-specific features, leading to information loss. Therefore, the key challenge is learning expressive representations without information-lossy marginalization, achieving high-fidelity cross-modal alignment.

Innovation

This paper proposes a novel framework called Bi-CMPStereo, designed to address the modality gap between event and frame cameras. β€’ Stereo Canonicalization Constraint (SCC): Enhances target domain discriminative features by learning finely aligned stereo representations within a target canonical space. β€’ Cross-Domain Embedding Adapter (CDEA): Achieves fine-grained feature alignment by explicitly activating discriminative target-domain cues latent in source-domain representations. β€’ Bidirectional cost volumes: Achieves robust asymmetric stereo matching by simultaneously exploiting bidirectional cost volumes across domains.

Methodology

The core methodology of the Bi-CMPStereo framework includes: β€’ Using the Stereo Canonicalization Constraint (SCC) to learn finely aligned stereo representations within a target canonical space. β€’ Designing the Cross-Domain Embedding Adapter (CDEA) to explicitly activate discriminative target-domain cues latent in source-domain representations. β€’ Achieving robust asymmetric stereo matching through bidirectional cost volumes. β€’ Adopting Hierarchical Visual Transformation (HVT) to enhance robustness and generalization of context features. β€’ Using cascaded ConvGRUs for iterative refinement of disparity.

Experiments

Experiments were conducted on the DSEC and MVSEC datasets, using metrics such as mean absolute error (MAE), root mean square error (RMSE), and n-pixel error (nPE) for evaluation. Baseline methods included ZEST, SEVFI-Net, SE-CFF, and DTC-SPADE. Ablation studies validated the importance of the Stereo Canonicalization Constraint (SCC) and Cross-Domain Embedding Adapter (CDEA) in the framework.

Results

On the DSEC dataset, Bi-CMPStereo achieved a mean absolute error (MAE) of 0.532, significantly outperforming existing state-of-the-art methods. In cross-dataset generalization tests on the MVSEC dataset, Bi-CMPStereo excelled across all test scenarios, demonstrating its robust generalization capabilities. Ablation studies showed that removing the CDEA and SCC modules led to significant performance drops, underscoring their importance in the framework.

Applications

The Bi-CMPStereo framework has broad applications in fields such as robotics, autonomous driving, and augmented reality. Its high accuracy and strong generalization capabilities make it significant for 3D perception under fast motion and challenging illumination conditions. Future work could explore integration with other sensors, like LiDAR, to enhance depth perception capabilities.

Limitations & Outlook

Despite significant improvements in accuracy, the high computational cost of Bi-CMPStereo may limit its applicability in real-time applications. Additionally, the sparsity of events in static or low-texture regions may lead to insufficiently dense depth estimation. Future research directions include further optimizing the algorithm to reduce computational costs and exploring integration with other sensors.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. A traditional frame-based camera is like a regular camera that can take very clear pictures, but it might get blurry when things move quickly. An event camera is like a high-speed camera that can capture every tiny change, even when you're stirring the pot. The Bi-CMPStereo framework is like a smart kitchen assistant that combines the strengths of both cameras, helping you keep track of every step in a fast-changing kitchen environment. It uses a technique called bidirectional cross-modal prompting to make sure you don't miss any critical details while cooking. Even when the lighting changes dramatically, it helps you maintain high precision. It's like having a super assistant that makes you a master chef in the kitchen.

ELI14 Explained like you're 14

Hey there! Did you know there's a super cool camera called an event camera? It's like a superhero that can capture every quick change, like when you're playing a video game and every move you make. A regular camera is like a normal camera, which takes clear pictures but might get a bit blurry when things move fast. Now, there's a super technology called Bi-CMPStereo that combines the best of both cameras, like having Spider-Man and Iron Man team up to solve those fast-changing and tricky lighting problems. It's like a smart assistant that helps you see the clearest picture in any situation. Even in dark places, it helps you see everything clearly. Isn't that amazing?

Glossary

Event Camera

An event camera is a sensor that detects per-pixel illumination changes with microsecond temporal resolution, offering high dynamic range and extremely low latency.

In this paper, event cameras are used to capture dynamic scene changes.

Stereo Matching

Stereo matching is a method of establishing pixel-wise correspondences between stereo images to compute disparity maps for depth estimation.

In this paper, stereo matching is used for depth perception between events and frames.

Cross-Modal

Cross-modal refers to the process of integrating and aligning information between different types of data sources.

In this paper, cross-modal is used to combine the strengths of event and frame cameras.

Bidirectional Cross-Modal Prompting

Bidirectional cross-modal prompting is a technique for prompting and aligning information between different modalities to achieve high-fidelity cross-modal alignment.

In this paper, this technique addresses the modality gap between events and frames.

Stereo Canonicalization Constraint

The Stereo Canonicalization Constraint is a method of enhancing target domain discriminative features by learning finely aligned stereo representations within a target canonical space.

In this paper, this constraint is used for high-fidelity cross-modal alignment.

Cross-Domain Embedding Adapter

The Cross-Domain Embedding Adapter is a technique for explicitly activating discriminative target-domain cues latent in source-domain representations to achieve fine-grained feature alignment.

In this paper, this adapter enhances target domain features.

Hierarchical Visual Transformation

Hierarchical Visual Transformation is a technique for learning context features through generating multi-level visual transformations to enhance robustness.

In this paper, this technique prevents shortcut learning of context features.

Cascaded ConvGRU

Cascaded ConvGRU is a technique for iterative refinement of disparity, achieving fine alignment of multi-scale features through a cascaded structure.

In this paper, this technique is used for iterative refinement of disparity.

DSEC Dataset

The DSEC dataset is a high-quality event stereo dataset capturing event streams and synchronized intensity frames in outdoor driving scenarios.

In this paper, this dataset is used to evaluate the performance of Bi-CMPStereo.

MVSEC Dataset

The MVSEC dataset is a standard dataset for event stereo matching, containing event and frame data for indoor and outdoor scenes.

In this paper, this dataset is used for cross-dataset generalization testing.

Open Questions Unanswered questions from this research

  • 1 The sparsity of events in static or low-texture regions remains a challenge, and how to achieve dense depth estimation in these areas is an open question.
  • 2 How to improve the real-time performance of Bi-CMPStereo without increasing computational costs is a crucial direction for future research.
  • 3 In extreme lighting conditions, frame cameras still suffer from blurring issues, and maintaining high accuracy in these conditions is challenging.
  • 4 Existing cross-modal alignment methods still have room for improvement in terms of information loss, and achieving higher fidelity alignment without losing information is a research hotspot.
  • 5 Integration with other sensors, such as LiDAR, needs further exploration to enhance depth perception capabilities.

Applications

Immediate Applications

Autonomous Driving

Bi-CMPStereo can be used in autonomous driving for 3D perception, helping vehicles achieve accurate environmental perception under fast motion and challenging illumination conditions.

Robotic Navigation

In robotic navigation, this framework can provide high-precision depth information, helping robots move safely in dynamic environments.

Augmented Reality

In augmented reality applications, Bi-CMPStereo can provide more accurate depth perception, enhancing user experience.

Long-term Vision

Smart City Surveillance

By integrating Bi-CMPStereo, future smart city surveillance systems can achieve more efficient dynamic scene monitoring and event detection.

Drone Navigation

In drone navigation, this technology can help drones achieve autonomous flight and obstacle avoidance in complex environments.

Abstract

Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

cs.CV

References (20)

DSEC: A Stereo Event Camera Dataset for Driving Scenarios

Mathias Gehrig, Willem Aarents, Daniel Gehrig et al.

2021 496 citations ⭐ Influential View Analysis β†’

Stereo Depth from Events Cameras: Concentrate and Focus on the Future

Yeongwoo Nam, Mohammad Mostafavi, Kuk-Jin Yoon et al.

2022 76 citations ⭐ Influential

Video Frame Interpolation With Stereo Event and Intensity Cameras

Chao Ding, Mingyuan Lin, Haijian Zhang et al.

2023 13 citations ⭐ Influential View Analysis β†’

Zero-Shot Event-Intensity Asymmetric Stereo via Visual Prompting from Image Domain

Hanyue Lou, Jinxiu Liang, Minggui Teng et al.

2024 10 citations ⭐ Influential

Discrete time convolution for fast event-based stereo

Kai Zhang, Kaiwei Che, Jianguo Zhang et al.

2022 44 citations ⭐ Influential

NeRF-Supervised Deep Stereo

Fabio Tosi, A. Tonioni, Daniele De Gregorio et al.

2023 66 citations View Analysis β†’

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014 164949 citations View Analysis β†’

Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation

Luca Bartolomei, Enrico Mannocci, Fabio Tosi et al.

2025 7 citations View Analysis β†’

GA-Net: Guided Aggregation Net for End-To-End Stereo Matching

Feihu Zhang, V. Prisacariu, Ruigang Yang et al.

2019 776 citations View Analysis β†’

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

Luca Bartolomei, Fabio Tosi, Matteo Poggi et al.

2024 45 citations View Analysis β†’

On the Synergies Between Machine Learning and Binocular Stereo for Depth Estimation From Images: A Survey

Matteo Poggi, Fabio Tosi, Konstantinos Batsos et al.

2021 167 citations

Event-Based Stereo Depth Estimation: A Survey

Suman Ghosh, Guillermo Gallego

2024 26 citations View Analysis β†’

GraftNet: Towards Domain Generalized Stereo Matching with a Broad-Spectrum and Task-Oriented Feature

Biyang Liu, Huimin Yu, Guodong Qi

2022 73 citations View Analysis β†’

Learning to Reconstruct HDR Images from Events, with Applications to Depth and Flow Prediction

Mohammad Mostafavi, Lin Wang, Kuk-Jin Yoon

2021 77 citations

ITSA: An Information-Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks

Weiqin Chuah, Ruwan Tennakoon, R. Hoseinnezhad et al.

2022 59 citations View Analysis β†’

AANet: Adaptive Aggregation Network for Efficient Stereo Matching

Haofei Xu, Juyong Zhang

2020 576 citations View Analysis β†’

Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation

Haihao Zhang, Yunjian Zhang, Jianing Li et al.

1 citations

Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation

Jiankun Li, Peisen Wang, Pengfei Xiong et al.

2022 371 citations View Analysis β†’

MonSter: Marry Monodepth to Stereo Unleashes Power

Junda Cheng, Longliang Liu, Gangwei Xu et al.

2025 50 citations

BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Tongfan Guan, Jiaxin Guo, Chen Wang et al.

2025 12 citations View Analysis β†’