VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA introduces low-rank latent KV cache, reducing memory by 92.7% for minute-scale video diffusion while maintaining high quality.
Key Findings
Methodology
This paper pioneers the integration of Multi-Head Latent Attention (MLA) into autoregressive video diffusion models. The core idea is to replace the dense per-head key-value (KV) matrices with a shared low-rank content latent vector (c_KV) and a decoupled 3D Rotary Position Embedding (3D-RoPE) for positional encoding. The process involves projecting each video token's dense KV into a low-dimensional latent space via a down-projection matrix (W_KV↓), storing this compact representation, and reconstructing per-head keys and values through up-projections (W_K↑, W_V↑). The positional information is encoded separately using a shared rotational position key (k_R) with 3D-RoPE, which rotates the positional vectors during attention computation. During training, the model optimizes the latent vectors to preserve content fidelity over long sequences, effectively compressing the KV cache by 92.7% without sacrificing quality. This approach leverages architectural bottlenecks rather than spectral low-rank assumptions, enabling efficient long-horizon video generation.
Key Results
- On the VBench dataset, VideoMLA achieved the highest overall scores of 0.859 at 30 seconds and 0.713 at 60 seconds, outperforming baseline models with dense KV caches. The model maintained visual fidelity, motion coherence, and scene consistency while reducing memory usage by 92.7%. It also increased inference throughput by 1.23 times, demonstrating significant efficiency gains. Ablation studies confirmed that the bottleneck dimension (dc) is the primary factor limiting performance, with training preserving the effective rank of the latent representations. The model effectively handles complex motion and scene variations, outperforming other compression-based methods in stability and detail retention.
- The experiments showed that the low-rank latent KV cache does not rely on the spectral low-rank structure of pretrained attention weights, which are far from low-rank in spectral energy. Instead, the bottleneck dimension (dc) constrains the rank, and training adapts the model within this limit. The results indicate that the architecture-induced bottleneck, rather than the spectral properties, determines the effective rank. The model's ability to generate high-quality, long-duration videos with reduced memory footprint demonstrates the practical viability of the approach.
- Overall, VideoMLA surpasses existing streaming video diffusion methods in long-horizon tasks, achieving better visual quality, motion dynamics, and computational efficiency. It offers a scalable solution to the memory bottleneck problem, enabling minute-scale autoregressive video generation on standard hardware, with potential applications in content creation, virtual reality, gaming, and multimedia industries.
Significance
This research addresses the critical challenge of long-duration video generation by drastically reducing memory requirements through low-rank latent KV caching. Its significance lies in enabling scalable, high-quality, minute-scale video synthesis on accessible hardware, which was previously infeasible due to the enormous memory footprint of dense KV caches. The approach fundamentally shifts the understanding of low-rank assumptions in pretrained attention, emphasizing architectural bottlenecks over spectral properties. This breakthrough opens new avenues for deploying large-scale video diffusion models in real-world applications, including entertainment, education, and virtual environments. Moreover, it provides a theoretical framework for future model compression strategies, balancing efficiency and quality in long-horizon generative tasks.
Technical Contribution
The paper's main technical contributions are: • Introducing a low-rank shared latent vector (c_KV) to replace dense per-head KV matrices in video diffusion models, achieving 92.7% memory reduction; • Designing a decoupled 3D-RoPE positional encoding that separates spatial-temporal position information from content representations; • Demonstrating that the effective rank is determined by the architectural bottleneck (dc), not the spectral properties of pretrained weights; • Showing that training preserves the near-full rank utilization of the latent space, enabling high-fidelity long-horizon generation; • Providing empirical evidence that the low-rank bottleneck suffices for maintaining visual quality and motion dynamics over minute-scale sequences, challenging traditional spectral low-rank assumptions.
Novelty
This work is the first to incorporate MLA-style latent KV caching into autoregressive video diffusion, fundamentally altering the memory architecture from dense per-head storage to shared low-rank representations. Unlike prior methods that rely on spectral low-rank assumptions, this approach reveals that the bottleneck dimension (dc) constrains the effective rank, regardless of the spectral energy distribution. The insight that the architectural bottleneck, rather than spectral low-rankness, governs the model's capacity to compress and generate long videos is a key novelty, opening new perspectives on model design and efficiency.
Limitations
- The current implementation's effectiveness diminishes when the latent dimension (dc) is too small (e.g., 64), leading to loss of fine details and motion fidelity. Further research is needed to optimize the trade-off between compression and quality.
- While the approach significantly reduces memory, the computational cost during training and inference remains high, especially for higher resolutions or longer sequences. Hardware constraints may limit deployment in resource-limited environments.
- The method's robustness across diverse video genres, complex scenes, and multi-actor scenarios requires further validation. Potential failure in highly dynamic or cluttered scenes may limit its applicability without additional enhancements.
Future Work
Future directions include exploring adaptive latent dimension strategies to dynamically balance quality and efficiency, extending the approach to higher resolutions and longer sequences, and integrating multi-modal inputs for richer content generation. Additionally, developing more hardware-efficient training algorithms and investigating the theoretical underpinnings of the architectural bottleneck could further enhance the scalability and robustness of long-horizon video diffusion models.
AI Executive Summary
长时长视频生成一直是人工智能领域的核心挑战之一。传统的扩散模型在短视频和静态图像生成方面取得了显著突破,但在长时间连续生成中,存储和计算成本成为主要瓶颈。尤其是,逐帧存储每个Token的密集KV缓存,随着视频时长的增加,内存需求呈线性增长,严重限制了模型的实际应用范围。
为解决这一难题,本文提出了VideoMLA,一种基于多头潜在注意力(MLA)的低秩KV缓存方案。核心思想是用一个共享的低秩潜在向量(c_KV)替代每个头的密集KV,从而大幅度压缩存储空间。具体实现包括:引入解耦的3D旋转位置编码(3D-RoPE)作为共享偏置位置键,结合下采样投影(W_KV↓)将每个视频帧的密集KV映射到潜在空间,再通过上采样投影(W_K↑、W_V↑)重建每个头的键值对。这一设计使得每个Token的KV存储空间从原来的数千维降低到约224维,压缩率达92.7%。
实验结果显示,VideoMLA在VBench数据集上,在30秒和60秒的长视频生成任务中,分别达到了最高的整体评分0.859和0.713,优于传统的密集KV缓存方法。模型在保持视觉质量和运动连贯性的同时,实现了1.23倍的推理吞吐提升,显著降低了GPU内存占用。通过消融分析,作者发现模型的性能主要受潜在空间容量(dc)限制,而非预训练的谱结构,验证了潜在瓶颈的关键作用。
这一创新不仅突破了长时长视频生成的内存瓶颈,也为未来大规模、多模态、多任务的视频AI提供了新的技术路径。虽然在极端压缩条件下仍存在细节丢失的风险,但整体而言,VideoMLA展现出极高的潜力,推动了视频生成技术的边界。未来,结合更高效的潜在编码和动态调整机制,有望实现更高分辨率、更长时长的无缝视频生成,开启智能内容创作的新纪元。
Deep Dive
Abstract
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.
References (20)
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Yunhong Lu, Yanhong Zeng, Haobo Li et al.
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu et al.
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Junsong Chen, Yuyang Zhao, Jincheng Yu et al.
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
Jung Yi, Wooseok Jang, Paul Hyunbin Cho et al.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He et al.
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Hongzhou Zhu, Min Zhao, Guande He et al.
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Kunhao Liu, Wenbo Hu, Jiale Xu et al.
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
Hidir Yesiltepe, Tuna Meral, Adil Kaan Akan et al.
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
Tianwei Yin, Qiang Zhang, Richard Zhang et al.
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
J. Cui, Jie Wu, Ming Li et al.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Haodong Li, Shaoteng Liu, Zhe L. Lin et al.
LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
Jianxiong Gao, Zhaoxi Chen, Xian Liu et al.
Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion
Yang Yang, Tianyi Zhang, Wei Huang et al.
Improved Distribution Matching Distillation for Fast Image Synthesis
Tianwei Yin, Michael Gharbi, Taesung Park et al.
Causality in Video Diffusers is Separable from Denoising
Xingjian Bai, Guande He, Zhengqi Li et al.
Attention is All you Need
Ashish Vaswani, Noam Shazeer, Niki Parmar et al.
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Roberto Henschel, L. Khachatryan, Daniil Hayrapetyan et al.
Wan: Open and Advanced Large-Scale Video Generative Models
Ang Wang, Baole Ai, Bin Wen et al.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan, Rui Xie, Penghao Zhou et al.