OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

TL;DR

OmniNFT enhances audio-video generation quality and synchronization through a modality-aware online diffusion RL framework.

cs.CV 🔴 Advanced 2026-05-13 193 views

Guohui Zhang XiaoXiao Ma Jie Huang Hang Xu Hu Yu Siming Fu Yuming Li Zeyue Xue Lin Song Haoyang Huang Nan Duan Feng Zhao

AI Reader Arxiv Page Download PDF

audio-video generation reinforcement learning modality-aware synchronization diffusion model

Key Findings

Methodology

OmniNFT introduces a modality-aware online diffusion RL framework with three key innovations: modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting. Modality-wise advantage routing allocates independent reward advantages to respective modality generation branches; layer-wise gradient surgery selectively detaches video-branch gradients on shallow audio layers; region-wise loss reweighting modulates policy optimization towards critical regions related to audio-video synchronization and fine-grained alignment.

Key Results

Extensive experiments on JavisBench and VBench demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization. Specifically, visual quality improved from 2.038 to 3.326 (+63.2%), and audio quality from 5.197 to 5.715 (+10.0%).
Compared to LTX-2 and GDPO, OmniNFT shows superior performance in cross-modal consistency and temporal synchronization. The synchronization metric DeSync reduced from 0.569 to 0.269 (-52.7%), significantly outperforming GDPO (0.412).
Ablation studies reveal that modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting each have significant impacts on cross-modal consistency, audio fidelity, and synchronization.

Significance

OmniNFT holds significant importance in the field of audio-video generation, addressing long-standing pain points such as modality consistency and synchronization issues. By extending reinforcement learning to multi-objective and multi-modal generation, OmniNFT not only enhances generation quality but also provides new solutions for academia and industry.

Technical Contribution

OmniNFT makes substantial technical contributions by addressing advantage inconsistency and gradient imbalance through modality-aware policy optimization. Compared to existing SOTA methods, OmniNFT offers new theoretical guarantees and engineering possibilities, particularly in optimizing complex objectives in multi-modal generation.

Novelty

OmniNFT is the first framework to extend reinforcement learning to multi-objective and multi-modal audio-video generation. Compared to the most related work, OmniNFT achieves finer reward allocation and gradient management through modality-wise advantage routing and gradient surgery.

Limitations

OmniNFT may encounter performance bottlenecks when handling very complex audio-video scenarios, as the model needs to process extensive modality interactions.
Due to computational complexity, OmniNFT's performance in real-time applications may be limited and require further optimization.
In some extreme cases, modality-wise advantage routing may not fully capture all cross-modal interaction details.

Future Work

Future research directions include optimizing OmniNFT's computational efficiency to support real-time applications, exploring more modality interaction mechanisms, and validating its performance in more complex audio-video scenarios.

AI Executive Summary

Deep Analysis

Background

Recent years have witnessed significant advancements in joint audio-video generation, particularly in modality fidelity and cross-modal consistency. However, existing generative models still struggle to simultaneously satisfy these multifaceted objectives. Reinforcement learning, as a powerful post-training paradigm, can optimize complex and highly subjective objectives, but its application in multi-objective and multi-modal generation remains under-explored.

Core Problem

The core problem in joint audio-video generation is achieving high modality fidelity, robust cross-modal semantic consistency, and fine-grained audio-video synchronization. These issues are not only important but also challenging due to the complexity of modality interactions and multi-objective optimization.

Innovation

OmniNFT addresses key challenges in audio-video generation through three core innovations. First, modality-wise advantage routing allocates independent reward advantages to respective modality generation branches, addressing advantage inconsistency. Second, layer-wise gradient surgery selectively detaches video-branch gradients on shallow audio layers, resolving gradient imbalance. Finally, region-wise loss reweighting modulates policy optimization towards critical regions related to audio-video synchronization and fine-grained alignment.

Methodology

�� Modality-wise advantage routing: allocates independent reward advantages to respective modality generation branches.
�� Layer-wise gradient surgery: selectively detaches video-branch gradients on shallow audio layers.
�� Region-wise loss reweighting: modulates policy optimization towards critical regions related to audio-video synchronization and fine-grained alignment.

Experiments

Experimental design includes testing on JavisBench and VBench using LTX-2 as the baseline. Evaluation metrics include visual quality, audio quality, cross-modal consistency, and audio-video synchronization. Ablation studies analyze the contribution of each component.

Results

Experimental results demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization. Visual quality improved from 2.038 to 3.326 (+63.2%), and audio quality from 5.197 to 5.715 (+10.0%). The synchronization metric DeSync reduced from 0.569 to 0.269 (-52.7%).

Applications

OmniNFT holds broad potential in practical applications, including film production, virtual reality, and augmented reality. Its high modality fidelity and synchronization make it suitable for scenarios requiring high-quality audio-video generation.

Limitations & Outlook

Despite OmniNFT's outstanding performance in audio-video generation, it may encounter performance bottlenecks when handling very complex scenarios. Additionally, due to computational complexity, its performance in real-time applications may be limited. Future research will focus on optimizing computational efficiency and exploring more modality interaction mechanisms.

Plain Language Accessible to non-experts

Imagine you are in a kitchen cooking multiple dishes at once. OmniNFT is like a smart chef who can cook several dishes simultaneously and ensure each dish complements the others perfectly. This chef uses a special technique to adjust the heat and seasoning according to each dish's needs. For example, when making soup, he ensures the flavor and color are just right, while for roasting meat, he ensures the meat is crispy on the outside and juicy on the inside. OmniNFT, like this chef, ensures each part of the audio-video generation process is optimized and perfectly combined through modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting.

ELI14 Explained like you're 14

Hey, imagine you're playing a super cool game that lets you make your own movie! OmniNFT is like a super smart assistant that helps you create both the visuals and sounds for the movie at the same time, making sure they sync perfectly. For example, when a character is talking, it ensures the mouth movements match the sound exactly. It's like a magician that makes all the elements come together perfectly, making your movie look super professional! Imagine making your own blockbuster, isn't that awesome?

Glossary

Modality-aware

Refers to a system's ability to recognize and process information from different modalities, such as audio and video.

Used in OmniNFT to optimize audio-video generation strategies.

Diffusion Model

A generative model that produces data by progressively denoising.

OmniNFT uses diffusion models for audio-video generation.

Reinforcement Learning

A machine learning method that guides model learning through rewards and penalties.

Used to optimize OmniNFT's generation quality.

Gradient Surgery

A technique that optimizes model training by selectively detaching gradients.

Used to address gradient imbalance in OmniNFT.

Loss Reweighting

A strategy that optimizes model training by adjusting the weights of the loss function.

Used to focus on critical regions for audio-video synchronization.

Modality-wise Advantage Routing

A technique that allocates reward advantages to respective modality generation branches to optimize the model.

Used to address advantage inconsistency in OmniNFT.

Audio-Video Synchronization

Refers to the temporal consistency between audio and video.

A key optimization target for OmniNFT.

Cross-modal Alignment

Refers to the semantic consistency between different modalities.

An important optimization target for OmniNFT.

Ablation Study

An experimental method that analyzes the impact of removing certain components on the overall system.

Used to analyze the contribution of each component in OmniNFT.

JavisBench

A benchmark used to evaluate audio-video generation quality.

OmniNFT is tested on this benchmark.

Open Questions Unanswered questions from this research

1 How can OmniNFT's computational efficiency be further optimized to support real-time applications? Current methods may encounter performance bottlenecks when handling complex scenarios, requiring exploration of new optimization strategies.
2 How does OmniNFT perform in more complex audio-video scenarios? More experiments are needed to validate its performance across different scenarios.
3 How can the modality-wise advantage routing issue in OmniNFT be resolved in extreme cases? Current methods may not fully capture all cross-modal interaction details.
4 What potential improvements can be made in addressing long-standing pain points in multi-modal generation using OmniNFT? Exploration of new modality interaction mechanisms is needed.
5 How can OmniNFT's cross-modal consistency be further enhanced? Current methods may not fully achieve semantic consistency in certain cases.

Applications

Immediate Applications

Film Production

OmniNFT can be used for audio-video generation in film production, ensuring high-quality modality fidelity and synchronization.

Virtual Reality

In virtual reality, OmniNFT can be used to generate realistic audio-video content, enhancing user experience.

Augmented Reality

OmniNFT can be used for audio-video generation in augmented reality applications, ensuring real-time consistency.

Long-term Vision

Intelligent Media Generation

OmniNFT can be used to develop intelligent media generation systems, enabling automated high-quality audio-video content creation.

Multi-modal Interaction Systems

OmniNFT can be used to develop multi-modal interaction systems, enhancing the naturalness and fluidity of human-computer interaction.

Abstract

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

cs.CV cs.AI

References (20)

ACE-Step: A Step Towards Music Generation Foundation Model

Junmin Gong, S. Zhao, Sen Wang et al.

2025 51 citations ⭐ Influential View Analysis →

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao et al.

2025 253 citations ⭐ Influential View Analysis →

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Adam Suma, Sam Dauncey

2025 2998 citations ⭐ Influential

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1582 citations ⭐ Influential View Analysis →

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut et al.

2026 66 citations ⭐ Influential View Analysis →

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng et al.

2022 1050 citations View Analysis →

ImageBind One Embedding Space to Bind Them All

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu et al.

2023 1508 citations View Analysis →

Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

A. Blattmann, Robin Rombach, Huan Ling et al.

2023 1628 citations View Analysis →

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye et al.

2025 90 citations View Analysis →

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Siyan Chen, Yanfei Chen, Ying Chen et al.

2025 37 citations View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 4387 citations View Analysis →

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

Yujin Jeong, Won-Wha Ryoo, Seunghyun Lee et al.

2023 42 citations View Analysis →

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

Xiaoxuan He, Siming Fu, Yuke Zhao et al.

2025 61 citations View Analysis →

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo et al.

2025 135 citations View Analysis →

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Yazhou Xing, Yin-Yin He, Zeyue Tian et al.

2024 125 citations View Analysis →

Veo: a text-to-video generation system

31 citations

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Yusong Wu, K. Chen, Tianyu Zhang et al.

2022 1005 citations View Analysis →

HPSv3: Towards Wide-Spectrum Human Preference Score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun et al.

2025 127 citations View Analysis →

Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation

Yifu Luo, Xinhao Hu, Keyu Fan et al.

2025 7 citations View Analysis →

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski et al.

2024 386 citations View Analysis →

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Modality-aware

Diffusion Model

Reinforcement Learning

Gradient Surgery

Loss Reweighting

Modality-wise Advantage Routing

Audio-Video Synchronization

Cross-modal Alignment

Ablation Study

JavisBench

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Film Production

Virtual Reality

Augmented Reality

Long-term Vision

Intelligent Media Generation

Multi-modal Interaction Systems

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence