AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference
AMNet introduces modality-agnostic inference for low-light video enhancement, maintaining high performance even with missing auxiliary modalities, outperforming state-of-the-art methods.
Key Findings
Methodology
The proposed AMNet framework employs a Spatial-Spectral Dual-Gated (S2DG) Translator to implicitly generate auxiliary modality representations from low-light RGB inputs. During training, large-scale synthetic multimodal video datasets are used to pretrain the model, enabling it to learn cross-modal correspondences. The architecture consists of an RGB encoder, S2DG Translator, temporal modeling modules, and a decoder, supporting flexible inference with any combination of available modalities. The S2DG Translator combines an Illumination-Aware Detail Selector (IADS) and a Frequency-Band Selector (FBS) to extract and emphasize robust high-frequency details from degraded RGB features. The training objective includes reconstruction loss, modality absence simulation loss, and feature distillation loss, ensuring robustness across different modality configurations.
Key Results
- On RGB-only datasets DID and SDSD, AMNet achieves PSNR scores of 31.57 and 29.03 respectively, surpassing previous methods by significant margins (e.g., +1.47dB PSNR on DID). When auxiliary modalities are available, performance improves further, with PSNR reaching up to 33.2 and SSIM close to 0.95, demonstrating state-of-the-art results. Under modality-missing scenarios, the performance drop is minimal, confirming the model’s robustness. Extensive ablation studies validate the effectiveness of the S2DG Translator components and the pretraining strategy. The model maintains high-quality detail restoration even under severe low-light and noisy conditions, outperforming existing RGB-only and multimodal approaches.
- The experimental results highlight that AMNet not only excels in standard enhancement metrics but also exhibits remarkable stability when auxiliary modalities are absent, making it highly practical for real-world applications. Its ability to generate implicit auxiliary representations from RGB inputs underpins its robustness, enabling consistent performance across diverse scenarios, including extreme low-light environments and sensor failures. The large-scale synthetic pretraining further enhances its generalization, allowing it to adapt to unseen data distributions effectively.
Significance
This research addresses a critical bottleneck in low-light video enhancement—dependence on auxiliary modalities. By enabling modality-agnostic inference, AMNet significantly broadens the practical deployment of multimodal enhancement systems, reducing hardware and synchronization costs. Its robustness under modality absence conditions makes it suitable for autonomous driving, surveillance, and consumer electronics, where sensor failures or environmental constraints are common. The approach also advances the theoretical understanding of cross-modal correspondence learning, opening new avenues for research in multimodal fusion under challenging conditions. Overall, it paves the way for more resilient, flexible, and scalable low-light imaging solutions, with profound implications for both academia and industry.
Technical Contribution
AMNet introduces a novel S2DG Translator that explicitly models the correspondence between RGB features and auxiliary modalities, enabling implicit modality generation solely from RGB inputs. This mechanism leverages spectral analysis and adaptive gating to extract reliable high-frequency cues, even under severe degradation. The integration of illumination-aware detail selection and frequency-band modulation ensures the preservation of critical structural details. The large-scale synthetic pretraining strategy, utilizing generative models to produce pseudo auxiliary modalities, enhances the cross-modal learning capability. The overall architecture supports arbitrary modality combinations during inference, representing a significant step forward in robust multimodal video enhancement. The method also provides theoretical guarantees of cross-modal correspondence learning through feature distillation and joint optimization.
Novelty
This work is the first to propose a truly modality-agnostic framework for low-light video enhancement, allowing high-quality results regardless of auxiliary modality availability. Unlike prior methods that rely on complete multimodal inputs, AMNet can generate implicit auxiliary representations from RGB data, addressing the practical challenge of modality absence. The S2DG Translator’s spectral gating and frequency-band selection introduce a new paradigm for extracting and emphasizing informative details from degraded inputs. Additionally, the large-scale synthetic pretraining approach to learn cross-modal correspondence is a novel contribution, significantly improving robustness and generalization in real-world scenarios.
Limitations
- Despite its robustness, the model's performance may still degrade under extremely noisy or occluded scenes where critical details are entirely lost or heavily corrupted.
- The reliance on synthetic data for pretraining introduces domain gaps, and real-world auxiliary modalities may differ from generated counterparts, potentially affecting accuracy.
- The computational complexity of spectral analysis and dual-gating mechanisms may limit real-time deployment in resource-constrained environments.
Future Work
Future research will focus on reducing computational overhead to enable real-time processing, possibly through model compression or more efficient spectral gating. Additionally, integrating self-supervised learning could further improve the model's adaptability to diverse real-world conditions. Expanding the framework to incorporate other modalities such as depth or audio signals could enhance robustness and scene understanding. Moreover, developing domain adaptation techniques to bridge the gap between synthetic pretraining data and real-world data remains an important direction.
AI Executive Summary
Enhancing videos captured under low-light conditions is a longstanding challenge in computer vision, with significant implications for safety, security, and consumer applications. Traditional approaches relying solely on RGB data often struggle to recover fine details and structures when illumination is severely limited. Recent multimodal methods have introduced auxiliary sensors such as infrared cameras and event-based sensors to supplement visual information, leading to notable improvements. However, these methods typically assume that auxiliary modalities are always available during inference, which is rarely the case in practical scenarios due to hardware costs, calibration complexity, and synchronization issues.
This dependency limits the deployment of multimodal systems in real-world environments, where sensor failures or environmental constraints frequently cause modality absence. Addressing this gap, the authors propose AMNet, a novel framework that supports modality-agnostic inference for low-light video enhancement. The core innovation lies in the Spatial-Spectral Dual-Gated (S2DG) Translator, which learns to implicitly generate auxiliary modality representations from low-light RGB inputs. This mechanism leverages spectral analysis and adaptive gating to extract and emphasize robust high-frequency details, even under severe degradation.
During training, the model is pretrained on large-scale synthetic multimodal datasets, where auxiliary modalities are simulated using generative models conditioned on RGB videos. This pretraining enables AMNet to learn cross-modal correspondences, which are then utilized during inference to produce high-quality enhanced videos regardless of auxiliary modality availability. Extensive experiments on datasets such as DID, SDSD, and SDE demonstrate that AMNet outperforms existing state-of-the-art methods, achieving PSNR scores of 31.57 and 29.03 in RGB-only settings, and further improvements when auxiliary modalities are present.
Remarkably, even when auxiliary modalities are missing at inference, AMNet maintains high enhancement quality with minimal performance drop, showcasing its robustness and practical value. The approach not only advances the theoretical understanding of cross-modal correspondence learning but also offers a scalable, flexible solution for real-world low-light video enhancement tasks. Future work aims to optimize computational efficiency, incorporate additional modalities, and enhance real-time capabilities, promising broader impact across autonomous systems, surveillance, and consumer electronics.
Deep Dive
Abstract
Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.
References (20)
Low-Light Video Enhancement with Synthetic Event Guidance
Lin Liu, Junfeng An, Jianzhuang Liu et al.
EvLight++: Low-Light Video Enhancement With an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More
Kanghao Chen, Guoqiang Liang, Yunfan Lu et al.
Dancing in the Dark: A Benchmark towards General Low-light Video Enhancement
Huiyuan Fu, Wenkai Zheng, Xicong Wang et al.
Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition
Xiaogang Xu, Kun Zhou, Tao Hu et al.
Event-Guided Low-Light Video Semantic Segmentation
Zhen Yao, Mooi Choo Choo Chuah
YouTube-VOS: Sequence-to-Sequence Video Object Segmentation
N. Xu, L. Yang, Yuchen Fan et al.
MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation
Chongjian Ge, Junsong Chen, Enze Xie et al.
Cross-Modal Alignment and Translation for Missing Modality Action Recognition
Yeonju Park, Sangmin Woo, Sumin Lee et al.
A Physics-Based Noise Formation Model for Extreme Low-Light Raw Denoising
Kaixuan Wei, Ying Fu, Jiaolong Yang et al.
Event Enhanced High-Quality Image Recovery
Bishan Wang, Jingwei He, Lei Yu et al.
MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices
Hailong Yan, Ao Li, Xiangtao Zhang et al.
Frequency Dynamic Convolution for Dense Image Prediction
Linwei Chen, Lin Gu, Liang Li et al.
AMM-Diff: Adaptive Multi-Modality Diffusion Network for Missing Modality Imputation
Aghiles Kebaili, J. Lapuyade-Lahorgue, Pierre Vera et al.
RT-X Net: RGB-Thermal cross attention network for Low-Light Image Enhancement
Raman Jha, Adithya Lenka, Mani Ramanagopal et al.
Unbiased Missing-Modality Multimodal Learning
Ruiting Dai, Chenxi Li, Yandong Yan et al.
Low-Light Image Enhancement Using Event-Based Illumination Estimation
Lei Sun, Yuhan Bao, Jiajun Zhai et al.
A Joint Network for Low-Light Image Enhancement Based on Retinex
Yonglong Jiang, Jiahe Zhu, Liangliang Li et al.
Events-To-Video: Bringing Modern Computer Vision to Event Cameras
Henri Rebecq, René Ranftl, V. Koltun et al.
Event-Based Low-Illumination Image Enhancement
Yu Jiang, Yuehang Wang, Siqi Li et al.
Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach
Guoqiang Liang, Kanghao Chen, Hangyu Li et al.