VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

TL;DR

VideoMLA introduces low-rank latent KV cache, reducing memory by 92.7% for minute-scale video diffusion while maintaining high quality.

cs.CV 🔴 Advanced 2026-05-29 119 views
Hidir Yesiltepe Jiazhen Hu Tuna Han Salih Meral Adil Kaan Akan Kaan Oktay Hoda Eldardiry Pinar Yanardag
video diffusion long-horizon generation low-rank representation attention mechanism model compression

Key Findings

Methodology

This paper pioneers the integration of Multi-Head Latent Attention (MLA) into autoregressive video diffusion models. The core idea is to replace the dense per-head key-value (KV) matrices with a shared low-rank content latent vector (c_KV) and a decoupled 3D Rotary Position Embedding (3D-RoPE) for positional encoding. The process involves projecting each video token's dense KV into a low-dimensional latent space via a down-projection matrix (W_KV↓), storing this compact representation, and reconstructing per-head keys and values through up-projections (W_K↑, W_V↑). The positional information is encoded separately using a shared rotational position key (k_R) with 3D-RoPE, which rotates the positional vectors during attention computation. During training, the model optimizes the latent vectors to preserve content fidelity over long sequences, effectively compressing the KV cache by 92.7% without sacrificing quality. This approach leverages architectural bottlenecks rather than spectral low-rank assumptions, enabling efficient long-horizon video generation.

Key Results

  • On the VBench dataset, VideoMLA achieved the highest overall scores of 0.859 at 30 seconds and 0.713 at 60 seconds, outperforming baseline models with dense KV caches. The model maintained visual fidelity, motion coherence, and scene consistency while reducing memory usage by 92.7%. It also increased inference throughput by 1.23 times, demonstrating significant efficiency gains. Ablation studies confirmed that the bottleneck dimension (dc) is the primary factor limiting performance, with training preserving the effective rank of the latent representations. The model effectively handles complex motion and scene variations, outperforming other compression-based methods in stability and detail retention.
  • The experiments showed that the low-rank latent KV cache does not rely on the spectral low-rank structure of pretrained attention weights, which are far from low-rank in spectral energy. Instead, the bottleneck dimension (dc) constrains the rank, and training adapts the model within this limit. The results indicate that the architecture-induced bottleneck, rather than the spectral properties, determines the effective rank. The model's ability to generate high-quality, long-duration videos with reduced memory footprint demonstrates the practical viability of the approach.
  • Overall, VideoMLA surpasses existing streaming video diffusion methods in long-horizon tasks, achieving better visual quality, motion dynamics, and computational efficiency. It offers a scalable solution to the memory bottleneck problem, enabling minute-scale autoregressive video generation on standard hardware, with potential applications in content creation, virtual reality, gaming, and multimedia industries.

Significance

This research addresses the critical challenge of long-duration video generation by drastically reducing memory requirements through low-rank latent KV caching. Its significance lies in enabling scalable, high-quality, minute-scale video synthesis on accessible hardware, which was previously infeasible due to the enormous memory footprint of dense KV caches. The approach fundamentally shifts the understanding of low-rank assumptions in pretrained attention, emphasizing architectural bottlenecks over spectral properties. This breakthrough opens new avenues for deploying large-scale video diffusion models in real-world applications, including entertainment, education, and virtual environments. Moreover, it provides a theoretical framework for future model compression strategies, balancing efficiency and quality in long-horizon generative tasks.

Technical Contribution

The paper's main technical contributions are: • Introducing a low-rank shared latent vector (c_KV) to replace dense per-head KV matrices in video diffusion models, achieving 92.7% memory reduction; • Designing a decoupled 3D-RoPE positional encoding that separates spatial-temporal position information from content representations; • Demonstrating that the effective rank is determined by the architectural bottleneck (dc), not the spectral properties of pretrained weights; • Showing that training preserves the near-full rank utilization of the latent space, enabling high-fidelity long-horizon generation; • Providing empirical evidence that the low-rank bottleneck suffices for maintaining visual quality and motion dynamics over minute-scale sequences, challenging traditional spectral low-rank assumptions.

Novelty

This work is the first to incorporate MLA-style latent KV caching into autoregressive video diffusion, fundamentally altering the memory architecture from dense per-head storage to shared low-rank representations. Unlike prior methods that rely on spectral low-rank assumptions, this approach reveals that the bottleneck dimension (dc) constrains the effective rank, regardless of the spectral energy distribution. The insight that the architectural bottleneck, rather than spectral low-rankness, governs the model's capacity to compress and generate long videos is a key novelty, opening new perspectives on model design and efficiency.

Limitations

  • The current implementation's effectiveness diminishes when the latent dimension (dc) is too small (e.g., 64), leading to loss of fine details and motion fidelity. Further research is needed to optimize the trade-off between compression and quality.
  • While the approach significantly reduces memory, the computational cost during training and inference remains high, especially for higher resolutions or longer sequences. Hardware constraints may limit deployment in resource-limited environments.
  • The method's robustness across diverse video genres, complex scenes, and multi-actor scenarios requires further validation. Potential failure in highly dynamic or cluttered scenes may limit its applicability without additional enhancements.

Future Work

Future directions include exploring adaptive latent dimension strategies to dynamically balance quality and efficiency, extending the approach to higher resolutions and longer sequences, and integrating multi-modal inputs for richer content generation. Additionally, developing more hardware-efficient training algorithms and investigating the theoretical underpinnings of the architectural bottleneck could further enhance the scalability and robustness of long-horizon video diffusion models.

AI Executive Summary

长时长视频生成一直是人工智能领域的核心挑战之一。传统的扩散模型在短视频和静态图像生成方面取得了显著突破,但在长时间连续生成中,存储和计算成本成为主要瓶颈。尤其是,逐帧存储每个Token的密集KV缓存,随着视频时长的增加,内存需求呈线性增长,严重限制了模型的实际应用范围。

为解决这一难题,本文提出了VideoMLA,一种基于多头潜在注意力(MLA)的低秩KV缓存方案。核心思想是用一个共享的低秩潜在向量(c_KV)替代每个头的密集KV,从而大幅度压缩存储空间。具体实现包括:引入解耦的3D旋转位置编码(3D-RoPE)作为共享偏置位置键,结合下采样投影(W_KV↓)将每个视频帧的密集KV映射到潜在空间,再通过上采样投影(W_K↑、W_V↑)重建每个头的键值对。这一设计使得每个Token的KV存储空间从原来的数千维降低到约224维,压缩率达92.7%。

实验结果显示,VideoMLA在VBench数据集上,在30秒和60秒的长视频生成任务中,分别达到了最高的整体评分0.859和0.713,优于传统的密集KV缓存方法。模型在保持视觉质量和运动连贯性的同时,实现了1.23倍的推理吞吐提升,显著降低了GPU内存占用。通过消融分析,作者发现模型的性能主要受潜在空间容量(dc)限制,而非预训练的谱结构,验证了潜在瓶颈的关键作用。

这一创新不仅突破了长时长视频生成的内存瓶颈,也为未来大规模、多模态、多任务的视频AI提供了新的技术路径。虽然在极端压缩条件下仍存在细节丢失的风险,但整体而言,VideoMLA展现出极高的潜力,推动了视频生成技术的边界。未来,结合更高效的潜在编码和动态调整机制,有望实现更高分辨率、更长时长的无缝视频生成,开启智能内容创作的新纪元。

Deep Dive

Abstract

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

cs.CV cs.AI

References (20)

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li et al.

2025 43 citations ⭐ Influential View Analysis →

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu et al.

2025 129 citations ⭐ Influential View Analysis →

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Junsong Chen, Yuyang Zhao, Jincheng Yu et al.

2025 64 citations ⭐ Influential View Analysis →

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Jung Yi, Wooseok Jang, Paul Hyunbin Cho et al.

2025 30 citations ⭐ Influential View Analysis →

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He et al.

2025 332 citations ⭐ Influential View Analysis →

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He et al.

2026 43 citations ⭐ Influential View Analysis →

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu et al.

2025 107 citations ⭐ Influential View Analysis →

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe, Tuna Meral, Adil Kaan Akan et al.

2025 33 citations ⭐ Influential View Analysis →

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Tianwei Yin, Qiang Zhang, Richard Zhang et al.

2024 279 citations ⭐ Influential View Analysis →

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

J. Cui, Jie Wu, Ming Li et al.

2025 106 citations View Analysis →

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe L. Lin et al.

2026 8 citations View Analysis →

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Jianxiong Gao, Zhaoxi Chen, Xian Liu et al.

2025 18 citations View Analysis →

Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

Yang Yang, Tianyi Zhang, Wei Huang et al.

2026 4 citations View Analysis →

Improved Distribution Matching Distillation for Fast Image Synthesis

Tianwei Yin, Michael Gharbi, Taesung Park et al.

2024 488 citations View Analysis →

Multi-head Temporal Latent Attention

Keqi Deng, Phil Woodland

2025 2 citations View Analysis →

Causality in Video Diffusers is Separable from Denoising

Xingjian Bai, Guande He, Zhengqi Li et al.

2026 3 citations View Analysis →

Attention is All you Need

Ashish Vaswani, Noam Shazeer, Niki Parmar et al.

2017 178243 citations View Analysis →

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Roberto Henschel, L. Khachatryan, Daniil Hayrapetyan et al.

2024 202 citations View Analysis →

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1730 citations View Analysis →

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou et al.

2024 283 citations View Analysis →