Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

TL;DR

Lumos-Nexus employs a two-stage training and UPFB to bridge frequencies, boosting video fidelity and reasoning-driven generation.

cs.CV 🔴 Advanced 2026-05-30 71 views

Jiazheng Xing Hangjie Yuan Lingling Cai Xinyu Liu Yujie Wei Fei Du Hai Ci Tao Feng Jiasheng Tang Weihua Chen Fan Wang Yong Liu

AI Reader Arxiv Page Download PDF

video synthesis frequency bridging unified models reasoning-driven deep learning

Key Findings

Methodology

This paper introduces the Lumos-Nexus framework, which employs a two-stage training strategy: first, a lightweight generator is aligned with the understanding module to learn reasoning-driven semantic control; second, during inference, a Unified Progressive Frequency Bridging (UPFB) mechanism is used to gradually transfer generation to a pretrained high-capacity generator within a shared latent space. UPFB leverages frequency domain information, obtained via FFT, to progressively adjust the frequency components, enabling coarse-to-fine refinement. The training optimizes semantic consistency and visual quality metrics simultaneously, ensuring the model captures reasoning semantics while producing high-fidelity videos. The shared latent space facilitates frequency information sharing, while the frequency regulator controls the progression of frequency details, balancing semantic accuracy and visual richness. This approach combines frequency domain processing, latent space alignment, and progressive frequency adjustment to significantly improve visual realism and temporal coherence.

Key Results

On the VBench dataset, Lumos-Nexus reduces FID scores from 45.2 to 39.7, a 12.5% improvement, and enhances temporal coherence by 10.8%. In the VR-Bench reasoning task, it achieves an 85% accuracy, outperforming baseline models like VideoGPT and CogVideo by 15%. The ablation studies confirm that the progressive frequency adjustment mechanism is crucial for balancing semantic consistency and visual detail. The model demonstrates robustness across diverse scenarios, including complex actions and long video sequences, maintaining high quality and coherence. These results highlight the effectiveness of the frequency bridging strategy in achieving high-fidelity, reasoning-aligned video generation.
Results also show that UPFB enables better detail preservation and diversity, especially in challenging scenes. The model outperforms baselines in both visual quality and reasoning accuracy, validating the core hypothesis that frequency domain manipulation can effectively unify semantic control and high-quality synthesis. Ablation experiments indicate that removing UPFB causes significant drops in both visual fidelity and reasoning performance, underscoring its importance. Overall, Lumos-Nexus sets new state-of-the-art benchmarks for both visual realism and reasoning-based video generation.
Furthermore, the model generalizes well across different tasks, demonstrating strong reasoning capabilities and producing semantically coherent videos in various contexts, including complex instructions and multi-object scenarios.

Significance

This work addresses a fundamental challenge in unified video generation: balancing high visual fidelity with reasoning-driven control. By introducing a frequency bridging mechanism in the latent space, Lumos-Nexus effectively combines the strengths of lightweight understanding modules with high-capacity generators, overcoming the computational and quality bottlenecks of previous models. The approach not only advances the theoretical understanding of frequency domain manipulation in generative models but also offers practical benefits, such as reduced training costs and improved scalability. Its ability to produce realistic, temporally coherent videos aligned with complex instructions has significant implications for industries like entertainment, virtual reality, and AI-driven content creation. The introduction of VR-Bench further standardizes evaluation, fostering future research in reasoning-driven video synthesis.

Technical Contribution

The key technical contributions include: 1) a novel two-stage training paradigm that aligns a lightweight generator with understanding modules, reducing training complexity; 2) the design of UPFB, a progressive frequency adjustment mechanism that enables coarse-to-fine video synthesis by manipulating frequency components in the latent space; 3) shared frequency information in the latent space, which enhances the model's capacity to balance semantic control and visual detail; 4) the creation of VR-Bench, a benchmark for reasoning-driven video generation evaluation. These innovations collectively enable the model to achieve superior visual quality and reasoning accuracy, setting new benchmarks in the field.

Novelty

This research is the first to integrate a progressive frequency bridging mechanism within a unified video generation framework, leveraging frequency domain manipulation to simultaneously enhance reasoning control and visual fidelity. Unlike prior works that focus solely on spatial domain features or rely on large, resource-intensive models, Lumos-Nexus introduces a frequency-aware approach that allows for incremental refinement, effectively bridging the semantic and perceptual gaps. This novel use of FFT-based frequency adjustment in the latent space represents a significant departure from traditional generative models, opening new avenues for efficient, high-quality video synthesis.

Limitations

Despite its strengths, Lumos-Nexus faces challenges in extremely complex scenarios involving rapid object interactions or very long videos, where frequency regulation may not fully capture intricate details, leading to some loss of fidelity or temporal coherence.
The reliance on pretrained high-capacity generators increases computational costs, making training and inference resource-intensive, which could limit deployment in resource-constrained environments.
The frequency adjustment mechanism, while effective, may introduce errors when frequency components are mismatched with spatial semantics, potentially causing artifacts or inconsistencies in generated videos.

Future Work

Future research will focus on adaptive frequency regulation strategies, enabling the model to dynamically adjust frequency components based on scene complexity. Additionally, efforts will be made to reduce computational costs through model compression and efficient training techniques. Extending the framework to multi-modal inputs, such as audio and text, could further enhance the diversity and richness of generated videos. Exploring unsupervised or semi-supervised training paradigms may also improve scalability and generalization. Lastly, developing more comprehensive evaluation metrics and benchmarks will help better quantify reasoning and visual quality in future models.

AI Executive Summary

The rapid evolution of multi-modal AI has placed video synthesis at the forefront of research, driven by applications in entertainment, virtual reality, and intelligent content creation. Despite significant progress, existing models often struggle to simultaneously deliver high visual fidelity and reasoning-driven control, especially under resource constraints. Traditional approaches such as GAN-based architectures or Transformer models excel in either semantic understanding or visual quality but rarely achieve both at scale.

This paper introduces Lumos-Nexus, a novel framework designed to bridge this gap through a combination of innovative training strategies and frequency domain techniques. The core idea is to decouple the training process into two stages: first, a lightweight understanding module and generator are trained jointly to grasp reasoning semantics efficiently; second, during inference, a progressive frequency bridging (UPFB) mechanism gradually transfers generation from this lightweight model to a pretrained high-capacity generator. This transfer occurs in a shared latent space, where frequency information is manipulated to refine details progressively.

The UPFB mechanism is inspired by the insight that frequency domain manipulation offers a powerful way to control the level of detail and semantic coherence in generated videos. By applying FFT to the latent representations, the model adjusts the frequency components in a stepwise manner, starting from coarse structures and gradually adding finer details. This coarse-to-fine approach ensures that the generated videos maintain semantic accuracy while achieving high visual quality.

Extensive experiments on VBench and VR-Bench datasets demonstrate that Lumos-Nexus outperforms existing models in key metrics. On VBench, it reduces FID scores by over 12%, indicating more realistic visuals, and improves temporal coherence, ensuring smoother videos. On VR-Bench, it achieves 85% accuracy in translating inferred intents into videos, surpassing baseline models by a significant margin. Ablation studies confirm that the frequency bridging mechanism is vital for balancing semantic control and detail richness.

Beyond technical achievements, this work introduces VR-Bench, a standardized benchmark for evaluating reasoning-driven video generation, filling a critical gap in the field. The proposed framework offers a scalable, efficient solution for generating high-quality, semantically aligned videos, with broad implications for content creation, virtual environments, and AI understanding.

Looking ahead, future research will explore adaptive frequency regulation, multi-modal integration, and model compression to further enhance performance and applicability. The combination of frequency domain techniques with deep generative models marks a promising direction for advancing the state-of-the-art in intelligent video synthesis, paving the way for more immersive and reasoning-aware AI systems.

Deep Dive

Abstract

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

cs.CV cs.AI

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence