WorldKV: Efficient World Memory with World Retrieval and Compression
WorldKV enables efficient world memory via KV cache retrieval and compression, doubling throughput while maintaining revisit fidelity.
Key Findings
Methodology
WorldKV is a training-free framework designed to enable efficient long-term memory in autoregressive video diffusion world models. It consists of two core components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks based on camera pose and action correspondence, reinserting them into the native attention window without re-encoding. World Compression prunes redundant tokens within each 3-frame chunk by computing cosine similarity between keys of non-anchor frames and an anchor frame, effectively halving per-chunk storage and allowing twice the history under fixed memory constraints. This modular framework supports multiple retrieval strategies and requires no fine-tuning of the backbone model.
Key Results
- On Matrix-Game-2.0 and LingBot-World-Fast, WorldKV matches or surpasses full KV-cache attention in revisit fidelity while achieving roughly 2× throughput improvements. For instance, on LingBot-World-Fast, WorldKV attains 4.78 FPS compared to 2.36 FPS for full KV attention and 5.05 FPS for sliding-window baseline.
- WorldKV outperforms both sliding-window and full KV attention on Matrix-Game-2.0, a model trained only on short sequences, where full KV attention suffers from compounded errors due to out-of-distribution cached states. Selective retrieval avoids this degradation.
- Ablation studies reveal that retaining 25% of non-anchor tokens during compression balances memory savings and fidelity, enabling broader historical coverage that improves revisit consistency compared to uncompressed retrieval.
Significance
This work addresses the critical challenge in autoregressive video diffusion models of balancing long-term memory consistency with real-time inference constraints. By leveraging the model's intrinsic KV cache as a latent world memory and managing it through training-free retrieval and compression, WorldKV circumvents the linear memory and computation growth bottleneck of full-history attention. This advances both academic understanding and practical deployment of interactive video world models, with implications for gaming, embodied AI, and robotic simulation.
Technical Contribution
WorldKV fundamentally reinterprets the KV cache as an emergent world memory and introduces a novel, training-free cache management framework combining camera/action-based retrieval and key similarity-based compression. Unlike prior methods relying on external memory modules or 3D scene reconstructions requiring additional training or inference overhead, WorldKV operates directly on the frozen backbone's KV cache, preserving or improving memory fidelity while doubling throughput. This enables new engineering possibilities for scalable, real-time world generation.
Novelty
WorldKV is the first to systematically exploit the intrinsic KV cache of autoregressive video diffusion models as a persistent world memory through training-free retrieval and compression. It departs from prior work that either discards long-term caches or requires dedicated memory training, offering a new paradigm for efficient long-horizon video generation without architectural changes or fine-tuning.
Limitations
- WorldKV’s visual fidelity is bounded by the underlying pretrained model and does not mitigate error accumulation over very long rollouts, limiting stability for multi-minute generation.
- While CPU offloading reduces GPU memory usage, host-device transfer latency currently prevents real-time multi-minute inference, posing a practical bottleneck.
- Camera/action-based retrieval may degrade in scenarios with imprecise or ambiguous action-to-viewpoint mappings, reducing retrieval accuracy and revisit consistency.
Future Work
Future directions include integrating WorldKV with training strategies to enhance stability and visual quality over extended rollouts, optimizing CPU offloading pipelines to minimize transfer latency for real-time multi-minute inference, and exploring more robust and adaptive retrieval signals to improve cache selection in complex dynamic scenes.
AI Executive Summary
Autoregressive video diffusion models have recently enabled real-time, action-conditioned world generation, opening new possibilities in gaming, embodied AI, and robotic simulation. However, sustaining a persistent world—where revisiting a previously seen viewpoint yields consistent content—remains a formidable challenge. Existing full KV-cache attention methods preserve consistency but suffer from linearly growing memory and computational costs, breaking real-time constraints. Conversely, sliding-window inference maintains throughput but sacrifices long-term consistency, causing content drift upon revisits. Addressing this fundamental trade-off, the authors propose WorldKV, a training-free framework comprising two complementary components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks based on camera pose and action correspondence, reinserting them into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk by measuring key-key cosine similarity to an anchor frame, halving per-chunk storage and enabling twice the history under fixed memory budgets. Evaluated on two state-of-the-art autoregressive video world models—Matrix-Game-2.0 and LingBot-World-Fast—WorldKV matches or exceeds full KV-cache attention in revisit fidelity while roughly doubling inference throughput. Notably, on Matrix-Game-2.0, which was trained on short sequences, WorldKV outperforms full KV attention by avoiding compounding errors from out-of-distribution cached states. Ablation studies confirm that moderate compression balances memory savings and fidelity, and broader historical coverage improves revisit consistency. This approach circumvents the need for additional memory training or architectural modifications, offering a scalable and practical solution for long-horizon video world generation. Despite its advantages, WorldKV inherits limitations from the underlying pretrained models, including error accumulation over very long rollouts and latency challenges in CPU offloading. Future work aims to integrate training-based stability improvements, optimize hardware pipelines, and develop more robust retrieval mechanisms. Overall, WorldKV represents a significant advance in efficient, consistent world memory management, with broad implications for interactive video generation and virtual environment construction.
Deep Analysis
Background
Autoregressive video diffusion models have emerged as a powerful paradigm for real-time, action-conditioned video generation. By sequentially generating frames conditioned on previous frames and actions, these models enable interactive applications such as gaming, embodied AI agents, and robotic simulation. Representative works include LingBot-World, which scales to minute-level videos with full KV-cache attention, and Matrix-Game-2.0, which operates on shorter sequences with sliding-window inference. Despite advances in generation quality, maintaining long-term spatial and temporal consistency—critical for persistent, explorable worlds—remains challenging. Full-history KV-cache attention preserves consistency by attending to all past latent states but incurs linearly growing memory and computational costs, limiting real-time feasibility. Sliding-window inference bounds these costs but sacrifices access to long-term context, causing hallucination and drift. Prior solutions involve external memory banks, learnable compression, or explicit 3D scene representations, which require additional training or incur inference latency. This paper builds on the insight that the model's own KV cache functions as an emergent memory, proposing a training-free framework to efficiently manage this cache for long-term consistency.
Core Problem
The core problem addressed is how to enable persistent, consistent world memory in autoregressive video diffusion models without sacrificing real-time inference speed or exceeding memory constraints. Specifically, the challenges include: 1) Full KV-cache attention’s memory footprint and attention cost grow linearly with rollout length, leading to GPU VRAM exhaustion and degraded throughput; 2) Sliding-window inference discards historical KV caches, losing long-term memory and causing content drift upon revisits; 3) Existing memory-augmented architectures require dedicated training and architectural modifications, increasing complexity; 4) 3D scene reconstruction methods introduce inference latency incompatible with real-time applications. Efficiently managing the KV cache to selectively retrieve relevant historical context and compress redundant information is essential to balancing memory fidelity and computational efficiency.
Innovation
This work introduces several key innovations: 1) World Retrieval: a training-free mechanism that stores evicted KV-cache chunks indexed by camera pose and action state, and selectively retrieves the most relevant chunks based on similarity metrics, reinserting them into the attention window without re-encoding. This leverages the model’s intrinsic memory without architectural changes. 2) World Compression: a novel pruning strategy that uses key-key cosine similarity within each 3-frame chunk to identify and remove redundant tokens relative to an anchor frame, roughly halving per-chunk storage and enabling twice the historical coverage under fixed memory budgets. 3) A modular framework supporting multiple retrieval strategies (camera/action-based and query-based), demonstrating generality. 4) Empirical validation on two distinct world models with different training regimes, showing that WorldKV matches or surpasses full KV attention and memory-trained baselines in revisit fidelity while maintaining real-time throughput.
Methodology
- �� World Retrieval: During sliding-window inference, KV-cache chunks evicted from the active attention window are stored in GPU/CPU memory, each indexed by the camera pose or cumulative action state at generation time. At generation, given the current camera/action state, a similarity function combining normalized squared L2 translation distance and geodesic rotation distance ranks stored chunks. The top-k most relevant chunks are retrieved and inserted into the attention window, preserving native attention computation without re-encoding.
- �� World Compression: For each 3-frame chunk, the first frame is designated as the anchor. Key vectors from non-anchor frames are compared via cosine similarity to anchor keys. Tokens with high similarity are pruned as redundant, while low-similarity tokens—representing newly revealed or dynamic regions—are retained. Compression is applied independently per transformer layer, resulting in compressed chunks approximately half the original size, enabling storage of roughly twice the number of chunks under fixed memory.
- �� The framework is agnostic to retrieval strategy; camera/action-based retrieval is used in experiments, but query-based importance scoring is also supported.
- �� Experiments involve evaluation on Matrix-Game-2.0 and LingBot-World-Fast, using metrics including LPIPS, PSNR, SSIM, FID, and throughput (FPS). Ablations explore compression ratios and retrieval budgets.
Experiments
Experiments are conducted on two autoregressive video world models: LingBot-World-Fast (14B parameters, distilled from long-video teacher, full KV attention native) and Matrix-Game-2.0 (1.3B parameters, trained on short sequences, sliding-window native). The benchmark consists of 60 scene-trajectory pairs spanning diverse domains (indoor, outdoor, urban, natural), with manually designed long-horizon trajectories featuring repetitive revisits and loop closures for direct revisit consistency evaluation. Baselines include sliding-window and full KV attention, as well as memory-trained models WorldPlay and Yume-1.5. Metrics include LPIPS, PSNR, SSIM, FID for visual fidelity, and FPS for throughput. Ablation studies vary intra-chunk compression ratios and inter-chunk retrieval budgets to assess trade-offs between memory savings and revisit fidelity.
Results
WorldKV achieves approximately 4.78 FPS on LingBot-World-Fast, close to sliding-window’s 5.05 FPS and significantly higher than full KV attention’s 2.36 FPS, while matching or surpassing full KV in LPIPS, PSNR, and FID. On Matrix-Game-2.0, WorldKV outperforms both sliding-window and full KV attention across all metrics, avoiding the compounded errors seen in full KV due to short-sequence training. Ablations show that retaining 25% of non-anchor tokens balances compression and fidelity, and increasing retrieval budget to cover more chunks improves revisit consistency. These results demonstrate that WorldKV effectively balances memory fidelity and computational efficiency without additional training.
Applications
WorldKV is applicable to real-time interactive video generation scenarios such as gaming, where consistent world states enhance player immersion; robotic simulation, enabling agents to explore and interact with stable virtual environments; and virtual reality systems requiring persistent scene memory for user navigation. Its training-free, modular design lowers deployment barriers and supports diverse domains and modalities. By improving long-term memory fidelity and inference speed, WorldKV facilitates more responsive and realistic interactive experiences, advancing applications in digital twins, intelligent agents, and immersive media.
Limitations & Outlook
WorldKV’s performance is inherently limited by the quality of the underlying pretrained world model, and does not address error accumulation over very long rollouts, which can degrade visual fidelity. CPU offloading reduces GPU memory usage but introduces host-device transfer latency that currently impedes real-time multi-minute inference. Additionally, camera/action-based retrieval may be less effective in scenarios with ambiguous or imprecise mappings between actions and viewpoints, potentially reducing memory recall accuracy. Addressing these limitations requires integration with training-based stabilization methods, hardware pipeline optimizations, and more robust retrieval strategies.
Abstract
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/
References (20)
Advancing Open-source World Models
R. Gao, Qiuyu Wang, Yanhong Zeng et al.
RELIC: Interactive Video World Model with Long-Horizon Memory
Yicong Hong, Yiqun Mei, Chongjian Ge et al.
Yume-1.5: A Text-Controlled Interactive World Generation Model
Xiaofeng Mao, Zhen Li, Chuanhao Li et al.
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
Jung Yi, Wooseok Jang, Paul Hyunbin Cho et al.
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang et al.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He et al.
Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model
Xianglong He, Chunli Peng, Zexiang Liu et al.
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li, Yingbing Huang, Bowen Yang et al.
Solaris: Building a Multiplayer Video World Model in Minecraft
George Savva, Oscar Michel, Daohan Lu et al.
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen et al.
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu et al.
Grounding World Simulation Models in a Real-World Metropolis
Junyoung Seo, Hyunwook Choi, Min-Joon Kwon et al.
Image quality assessment: from error visibility to structural similarity
Zhou Wang, A. Bovik, H. Sheikh et al.
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang et al.
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu (Allen) Zhang, Ying Sheng, Tianyi Zhou et al.
VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
Runjia Li, Philip H. S. Torr, Andrea Vedaldi et al.
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
Tianwei Yin, Qiang Zhang, Richard Zhang et al.
Lost in the Middle: How Language Models Use Long Contexts
Nelson F. Liu, Kevin Lin, John Hewitt et al.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
Shenyuan Gao, William Liang, Kaiyuan Zheng et al.
WORLDMEM: Long-term Consistent World Simulation with Memory
Zeqi Xiao, Yushi Lan, Yifan Zhou et al.