AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

TL;DR

AMNet introduces modality-agnostic inference for low-light video enhancement, maintaining high performance even with missing auxiliary modalities, outperforming state-of-the-art methods.

cs.CV 🔴 Advanced 2026-06-10 88 views

Hangfeng Liang Yutao Hu Yanhan Hu Xiaohan Wu Wenqi Shao Ying Fu

AI Reader Arxiv Page Download PDF

Low-Light Video Enhancement Multimodal Learning Modality-Agnostic Inference Deep Learning Synthetic Data

Key Findings

Methodology

The proposed AMNet framework employs a Spatial-Spectral Dual-Gated (S2DG) Translator to implicitly generate auxiliary modality representations from low-light RGB inputs. During training, large-scale synthetic multimodal video datasets are used to pretrain the model, enabling it to learn cross-modal correspondences. The architecture consists of an RGB encoder, S2DG Translator, temporal modeling modules, and a decoder, supporting flexible inference with any combination of available modalities. The S2DG Translator combines an Illumination-Aware Detail Selector (IADS) and a Frequency-Band Selector (FBS) to extract and emphasize robust high-frequency details from degraded RGB features. The training objective includes reconstruction loss, modality absence simulation loss, and feature distillation loss, ensuring robustness across different modality configurations.

Key Results

On RGB-only datasets DID and SDSD, AMNet achieves PSNR scores of 31.57 and 29.03 respectively, surpassing previous methods by significant margins (e.g., +1.47dB PSNR on DID). When auxiliary modalities are available, performance improves further, with PSNR reaching up to 33.2 and SSIM close to 0.95, demonstrating state-of-the-art results. Under modality-missing scenarios, the performance drop is minimal, confirming the model’s robustness. Extensive ablation studies validate the effectiveness of the S2DG Translator components and the pretraining strategy. The model maintains high-quality detail restoration even under severe low-light and noisy conditions, outperforming existing RGB-only and multimodal approaches.
The experimental results highlight that AMNet not only excels in standard enhancement metrics but also exhibits remarkable stability when auxiliary modalities are absent, making it highly practical for real-world applications. Its ability to generate implicit auxiliary representations from RGB inputs underpins its robustness, enabling consistent performance across diverse scenarios, including extreme low-light environments and sensor failures. The large-scale synthetic pretraining further enhances its generalization, allowing it to adapt to unseen data distributions effectively.

Significance

This research addresses a critical bottleneck in low-light video enhancement—dependence on auxiliary modalities. By enabling modality-agnostic inference, AMNet significantly broadens the practical deployment of multimodal enhancement systems, reducing hardware and synchronization costs. Its robustness under modality absence conditions makes it suitable for autonomous driving, surveillance, and consumer electronics, where sensor failures or environmental constraints are common. The approach also advances the theoretical understanding of cross-modal correspondence learning, opening new avenues for research in multimodal fusion under challenging conditions. Overall, it paves the way for more resilient, flexible, and scalable low-light imaging solutions, with profound implications for both academia and industry.

Technical Contribution

AMNet introduces a novel S2DG Translator that explicitly models the correspondence between RGB features and auxiliary modalities, enabling implicit modality generation solely from RGB inputs. This mechanism leverages spectral analysis and adaptive gating to extract reliable high-frequency cues, even under severe degradation. The integration of illumination-aware detail selection and frequency-band modulation ensures the preservation of critical structural details. The large-scale synthetic pretraining strategy, utilizing generative models to produce pseudo auxiliary modalities, enhances the cross-modal learning capability. The overall architecture supports arbitrary modality combinations during inference, representing a significant step forward in robust multimodal video enhancement. The method also provides theoretical guarantees of cross-modal correspondence learning through feature distillation and joint optimization.

Novelty

This work is the first to propose a truly modality-agnostic framework for low-light video enhancement, allowing high-quality results regardless of auxiliary modality availability. Unlike prior methods that rely on complete multimodal inputs, AMNet can generate implicit auxiliary representations from RGB data, addressing the practical challenge of modality absence. The S2DG Translator’s spectral gating and frequency-band selection introduce a new paradigm for extracting and emphasizing informative details from degraded inputs. Additionally, the large-scale synthetic pretraining approach to learn cross-modal correspondence is a novel contribution, significantly improving robustness and generalization in real-world scenarios.

Limitations

Despite its robustness, the model's performance may still degrade under extremely noisy or occluded scenes where critical details are entirely lost or heavily corrupted.
The reliance on synthetic data for pretraining introduces domain gaps, and real-world auxiliary modalities may differ from generated counterparts, potentially affecting accuracy.
The computational complexity of spectral analysis and dual-gating mechanisms may limit real-time deployment in resource-constrained environments.

Future Work

Future research will focus on reducing computational overhead to enable real-time processing, possibly through model compression or more efficient spectral gating. Additionally, integrating self-supervised learning could further improve the model's adaptability to diverse real-world conditions. Expanding the framework to incorporate other modalities such as depth or audio signals could enhance robustness and scene understanding. Moreover, developing domain adaptation techniques to bridge the gap between synthetic pretraining data and real-world data remains an important direction.

AI Executive Summary

Enhancing videos captured under low-light conditions is a longstanding challenge in computer vision, with significant implications for safety, security, and consumer applications. Traditional approaches relying solely on RGB data often struggle to recover fine details and structures when illumination is severely limited. Recent multimodal methods have introduced auxiliary sensors such as infrared cameras and event-based sensors to supplement visual information, leading to notable improvements. However, these methods typically assume that auxiliary modalities are always available during inference, which is rarely the case in practical scenarios due to hardware costs, calibration complexity, and synchronization issues.

This dependency limits the deployment of multimodal systems in real-world environments, where sensor failures or environmental constraints frequently cause modality absence. Addressing this gap, the authors propose AMNet, a novel framework that supports modality-agnostic inference for low-light video enhancement. The core innovation lies in the Spatial-Spectral Dual-Gated (S2DG) Translator, which learns to implicitly generate auxiliary modality representations from low-light RGB inputs. This mechanism leverages spectral analysis and adaptive gating to extract and emphasize robust high-frequency details, even under severe degradation.

During training, the model is pretrained on large-scale synthetic multimodal datasets, where auxiliary modalities are simulated using generative models conditioned on RGB videos. This pretraining enables AMNet to learn cross-modal correspondences, which are then utilized during inference to produce high-quality enhanced videos regardless of auxiliary modality availability. Extensive experiments on datasets such as DID, SDSD, and SDE demonstrate that AMNet outperforms existing state-of-the-art methods, achieving PSNR scores of 31.57 and 29.03 in RGB-only settings, and further improvements when auxiliary modalities are present.

Remarkably, even when auxiliary modalities are missing at inference, AMNet maintains high enhancement quality with minimal performance drop, showcasing its robustness and practical value. The approach not only advances the theoretical understanding of cross-modal correspondence learning but also offers a scalable, flexible solution for real-world low-light video enhancement tasks. Future work aims to optimize computational efficiency, incorporate additional modalities, and enhance real-time capabilities, promising broader impact across autonomous systems, surveillance, and consumer electronics.

Deep Dive

Abstract

Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.

cs.CV

References (20)

Low-Light Video Enhancement with Synthetic Event Guidance

Lin Liu, Junfeng An, Jianzhuang Liu et al.

2022 55 citations ⭐ Influential View Analysis →

EvLight++: Low-Light Video Enhancement With an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More

Kanghao Chen, Guoqiang Liang, Yunfan Lu et al.

2024 13 citations ⭐ Influential View Analysis →

Dancing in the Dark: A Benchmark towards General Low-light Video Enhancement

Huiyuan Fu, Wenkai Zheng, Xicong Wang et al.

2023 42 citations ⭐ Influential

Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition

Xiaogang Xu, Kun Zhou, Tao Hu et al.

2024 5 citations ⭐ Influential View Analysis →

Event-Guided Low-Light Video Semantic Segmentation

Zhen Yao, Mooi Choo Choo Chuah

2024 18 citations View Analysis →

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

N. Xu, L. Yang, Yuchen Fan et al.

2018 542 citations View Analysis →

MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation

Chongjian Ge, Junsong Chen, Enze Xie et al.

2023 70 citations

Cross-Modal Alignment and Translation for Missing Modality Action Recognition

Yeonju Park, Sangmin Woo, Sumin Lee et al.

2022 11 citations

A Physics-Based Noise Formation Model for Extreme Low-Light Raw Denoising

Kaixuan Wei, Ying Fu, Jiaolong Yang et al.

2020 255 citations View Analysis →

Event Enhanced High-Quality Image Recovery

Bishan Wang, Jingwei He, Lei Yu et al.

2020 145 citations View Analysis →

MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices

Hailong Yan, Ao Li, Xiangtao Zhang et al.

2025 16 citations View Analysis →

Frequency Dynamic Convolution for Dense Image Prediction

Linwei Chen, Lin Gu, Liang Li et al.

2025 54 citations View Analysis →

AMM-Diff: Adaptive Multi-Modality Diffusion Network for Missing Modality Imputation

Aghiles Kebaili, J. Lapuyade-Lahorgue, Pierre Vera et al.

2025 10 citations View Analysis →

RT-X Net: RGB-Thermal cross attention network for Low-Light Image Enhancement

Raman Jha, Adithya Lenka, Mani Ramanagopal et al.

2025 7 citations View Analysis →

Unbiased Missing-Modality Multimodal Learning

Ruiting Dai, Chenxi Li, Yandong Yan et al.

2025 20 citations

Low-Light Image Enhancement Using Event-Based Illumination Estimation

Lei Sun, Yuhan Bao, Jiajun Zhai et al.

2025 12 citations View Analysis →

A Joint Network for Low-Light Image Enhancement Based on Retinex

Yonglong Jiang, Jiahe Zhu, Liangliang Li et al.

2024 17 citations

Events-To-Video: Bringing Modern Computer Vision to Event Cameras

Henri Rebecq, René Ranftl, V. Koltun et al.

2019 469 citations View Analysis →

Event-Based Low-Illumination Image Enhancement

Yu Jiang, Yuehang Wang, Siqi Li et al.

2024 62 citations

Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach

Guoqiang Liang, Kanghao Chen, Hangyu Li et al.

2024 66 citations View Analysis →

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence