CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

TL;DR

CausalCine achieves real-time multi-shot video generation using a causal autoregressive framework, significantly enhancing cross-shot coherence and interactivity.

cs.CV 🔴 Advanced 2026-05-13 157 views

Yihao Meng Zichen Liu Hao Ouyang Qiuyu Wang Ka Leong Cheng Yue Yu Hanlin Wang Haobo Li Jiapeng Zhu Yanhong Zeng Xing Zhu Yujun Shen Qifeng Chen Huamin Qu

AI Reader Arxiv Page Download PDF

causal autoregressive multi-shot video generation content-aware memory routing real-time interaction video narratives

Key Findings

Methodology

Key Results

CausalCine significantly outperforms autoregressive baselines in shot-level quality, prompt alignment, identity preservation, and transition structure, approaching the visual quality of bidirectional models.
In a 100-prompt multi-shot benchmark, CausalCine excels in visual quality, prompt following, temporal consistency, long-range consistency, and shot structure.
Ablation studies confirm the effectiveness of multi-shot causal tuning and content-aware memory routing, significantly improving cross-shot coherence.

Significance

CausalCine addresses the limitations of existing autoregressive models in long-sequence generation, such as motion stagnation and semantic drift, significantly enhancing the interactivity and efficiency of video generation. Its causal generation framework allows users to dynamically add new prompts during generation, supporting real-time online directing, which holds substantial academic and industrial value.

Technical Contribution

CausalCine technically overcomes the limitations of traditional autoregressive models by implementing content-aware memory routing to achieve cross-shot coherence and distilling the causal base model for real-time interactive generation. This method significantly enhances generation efficiency and interactivity without sacrificing visual quality.

Novelty

CausalCine is the first to apply a causal autoregressive framework to multi-shot video generation, achieving cross-shot coherence and real-time interactive generation through content-aware memory routing, offering higher generation efficiency compared to existing bidirectional models.

Limitations

CausalCine may still face memory capacity limitations when handling extremely long sequences, affecting generation quality.
Handling complex scene transitions may require higher computational resources.
In some cases, the generated content may lack detail.

Future Work

Future research directions include optimizing memory routing mechanisms to handle longer sequences, exploring more efficient computational methods to support complex scene transitions, and enhancing the detail of generated content.

AI Executive Summary

In the field of video generation, existing autoregressive models often face issues of motion stagnation and semantic drift when dealing with long sequences. This is because these models are primarily trained for short-horizon continuation, treating long sequences as extensions of a single shot, leading to a decline in generation quality.

To address this issue, CausalCine introduces a causal autoregressive framework that transforms multi-shot video generation into an online directing process. This framework can generate causally across shot changes, accept dynamic prompts on the fly, and reuse context without regenerating previous shots.

The core technology of CausalCine includes Content-Aware Memory Routing (CAMR), a mechanism that dynamically retrieves historical KV entries based on attention-based relevance scores rather than temporal proximity, maintaining cross-shot coherence under bounded active memory. Additionally, by distilling the causal base model into a few-step generator, real-time interactive generation is achieved.

However, CausalCine may still face memory capacity limitations when handling extremely long sequences, affecting generation quality. Handling complex scene transitions may require higher computational resources. Future research directions include optimizing memory routing mechanisms to handle longer sequences, exploring more efficient computational methods to support complex scene transitions, and enhancing the detail of generated content.

Deep Analysis

Background

Video generation technology has made significant progress in recent years, particularly in terms of visual fidelity. However, existing bidirectional attention models are computationally expensive for long-sequence generation, limiting their interactivity. Autoregressive generation with KV caching offers a natural alternative for streaming video synthesis, but existing causal video models are still largely trained and evaluated as short-horizon continuation systems, leading to stagnation, looping, or semantic drift in long-sequence generation. Multi-shot video generation is not merely an extended single shot; it requires evolving events, viewpoint changes, discrete shot boundaries, and persistent story context.

Core Problem

Existing autoregressive models often face issues of motion stagnation and semantic drift when dealing with long sequences. This is because these models are primarily trained for short-horizon continuation, treating long sequences as extensions of a single shot, leading to a decline in generation quality. Additionally, multi-shot video generation requires evolving events, viewpoint changes, discrete shot boundaries, and persistent story context, posing higher demands on existing models.

Innovation

CausalCine employs a causal autoregressive framework for multi-shot video generation, utilizing Content-Aware Memory Routing (CAMR) to dynamically retrieve historical KV entries, maintaining cross-shot coherence. • Initially, a causal base model is trained on native multi-shot sequences to learn complex shot transitions. • CAMR dynamically retrieves historical KV entries based on attention-based relevance scores rather than temporal proximity, maintaining cross-shot coherence under bounded active memory. • The causal base model is then distilled into a few-step generator for real-time interactive generation. • Distribution Matching Distillation (DMD) and an adversarial objective are used to distill the multi-step flow-matching teacher into a four-step autoregressive generator.

Methodology

�� Initially, a causal base model is trained on native multi-shot sequences to learn complex shot transitions. • Content-Aware Memory Routing (CAMR) dynamically retrieves historical KV entries based on attention-based relevance scores rather than temporal proximity, maintaining cross-shot coherence under bounded active memory. • The causal base model is then distilled into a few-step generator for real-time interactive generation. • Distribution Matching Distillation (DMD) and an adversarial objective are used to distill the multi-step flow-matching teacher into a four-step autoregressive generator.

Experiments

The experimental design includes chunk-wise teacher forcing training on 100k long multi-shot videos, with each chunk containing three latent frames. A 100-prompt multi-shot benchmark is constructed using Gemini 2.5 Pro. Evaluation metrics include visual quality, prompt following, temporal consistency, long-range consistency, and shot structure. Ablation studies confirm the effectiveness of multi-shot causal tuning and content-aware memory routing.

Results

Experimental results demonstrate that CausalCine significantly outperforms autoregressive baselines in shot-level quality, prompt alignment, identity preservation, and transition structure, approaching the visual quality of bidirectional models. In a 100-prompt multi-shot benchmark, CausalCine excels in visual quality, prompt following, temporal consistency, long-range consistency, and shot structure. Ablation studies confirm the effectiveness of multi-shot causal tuning and content-aware memory routing, significantly improving cross-shot coherence.

Applications

CausalCine can be directly applied in film production, advertising creation, and game development, supporting real-time online directing and dynamic prompt updates. Its causal generation framework allows users to dynamically add new prompts during generation, supporting real-time online directing, which holds substantial academic and industrial value.

Limitations & Outlook

CausalCine may still face memory capacity limitations when handling extremely long sequences, affecting generation quality. Handling complex scene transitions may require higher computational resources. Future research directions include optimizing memory routing mechanisms to handle longer sequences, exploring more efficient computational methods to support complex scene transitions, and enhancing the detail of generated content.

Plain Language Accessible to non-experts

Imagine you are in a kitchen cooking, and CausalCine is like a smart kitchen assistant. It not only remembers the dishes you've cooked before but can quickly adjust recipes based on your new requests. For instance, you're making a complex multi-step dish, with each step like a video shot. CausalCine can maintain consistency in each step without forgetting previous steps over time. It can also quickly adapt to your immediate requests, such as adding new ingredients or changing cooking methods. This is like being able to change the recipe at any time during cooking, and CausalCine can quickly adapt to these changes, maintaining the overall consistency and deliciousness of the dish.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game that lets you create your own movie. CausalCine is like your game assistant, helping you break the movie into different scenes, each with its own story. You can change scenes anytime, like switching from a forest to a city, and CausalCine helps keep the story consistent, just like a smart director's assistant. It also remembers previous scenes, so when you want to go back to an earlier story, it can quickly find and continue it. Isn't that awesome? It's like in a game where you can change your character's outfit anytime, and the game assistant helps keep the character's personality and style intact!

Glossary

Causal Autoregressive

A generative model that generates sequence data step by step through causal relationships, suitable for long-sequence generation.

Used in CausalCine to achieve multi-shot video generation.

Content-Aware Memory Routing

A mechanism that dynamically retrieves historical KV entries based on attention-based relevance scores rather than temporal proximity.

Used to maintain cross-shot coherence.

KV Caching

A technique used to store and retrieve historical information during generation, enhancing generation efficiency.

Used in CausalCine for real-time interactive generation.

Teacher Forcing

A training strategy that guides the model's generation process using real data, narrowing the gap between training and inference.

Used to train the causal base model.

Distribution Matching Distillation

A technique that compresses a pretrained teacher model into a few-step student model, maintaining generation quality.

Used to distill the causal base model into a few-step generator.

Bidirectional Model

A generative model that generates data by considering both past and future context, with higher computational cost.

Compared with CausalCine.

Visual Fidelity

The visual quality and realism of generated videos, measuring the performance of generative models.

CausalCine approaches bidirectional models in visual fidelity.

Shot-Level Quality

The quality and consistency of each shot in the generated video, reflecting the model's attention to detail.

CausalCine outperforms autoregressive baselines in shot-level quality.

Prompt Alignment

The consistency of generated content with input prompts, measuring the model's responsiveness.

CausalCine excels in prompt alignment.

Identity Preservation

The consistency of character identity in generated videos, reflecting the model's memory capability.

CausalCine outperforms autoregressive baselines in identity preservation.

Open Questions Unanswered questions from this research

1 Existing autoregressive models often face memory capacity limitations when handling extremely long sequences, affecting generation quality. Further optimization of memory routing mechanisms is needed to handle longer sequences.
2 Handling complex scene transitions may require higher computational resources. More efficient computational methods are needed to support complex scene transitions.
3 The detail of generated content may be lacking in some cases, requiring enhancement of the detail of generated content.
4 Current models may experience response delays when handling dynamic prompt updates, requiring further optimization of generation efficiency.
5 In handling multi-character interactions, existing models may struggle to maintain consistency in character relationships, necessitating further research into character relationship modeling.

Applications

Immediate Applications

Film Production

CausalCine can be used in real-time online directing in film production, supporting dynamic prompt updates, enhancing production efficiency and creative freedom.

Advertising Creation

In advertising creation, CausalCine can help creators quickly generate multi-shot ad segments, enhancing the visual appeal and storytelling of advertisements.

Game Development

Game developers can use CausalCine to generate dynamic game scenes, allowing players to freely explore and interact within the game, enhancing the gaming experience.

Long-term Vision

Virtual Reality

CausalCine can be applied to the generation of virtual reality content, supporting users in freely exploring and interacting within virtual environments, providing an immersive experience.

Automated Video Editing

In the future, CausalCine is expected to achieve automated video editing, helping users quickly generate and edit video content, enhancing video creation efficiency.

Abstract

Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/

cs.CV

References (20)

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu et al.

2025 90 citations ⭐ Influential View Analysis →

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu et al.

2025 104 citations ⭐ Influential View Analysis →

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Tianwei Yin, Qiang Zhang, Richard Zhang et al.

2024 243 citations ⭐ Influential View Analysis →

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He et al.

2026 25 citations ⭐ Influential View Analysis →

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He et al.

2025 282 citations ⭐ Influential View Analysis →

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Yihao Meng, Ouyang Hao, Yue Yu et al.

2025 23 citations ⭐ Influential View Analysis →

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al.

2025 2877 citations ⭐ Influential View Analysis →

Improved Distribution Matching Distillation for Fast Image Synthesis

Tianwei Yin, Michael Gharbi, Taesung Park et al.

2024 439 citations ⭐ Influential View Analysis →

One-Step Diffusion with Distribution Matching Distillation

Tianwei Yin, Michael Gharbi, Richard Zhang et al.

2023 743 citations ⭐ Influential View Analysis →

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Canyu Zhao, Mingyu Liu, Wen Wang et al.

2024 72 citations View Analysis →

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li, Philip H. S. Torr, Andrea Vedaldi et al.

2025 57 citations View Analysis →

VideoStudio: Generating Consistent-Content and Multi-scene Videos

Fuchen Long, Zhaofan Qiu, Ting Yao et al.

2024 62 citations View Analysis →

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

Yining Hong, Bei Liu, Maxine Wu et al.

2024 22 citations View Analysis →

Diffusion Adversarial Post-Training for One-Step Video Generation

Shanchuan Lin, Xin Xia, Yuxi Ren et al.

2025 91 citations View Analysis →

MotionStream: Real-Time Video Generation with Interactive Motion Controls

Joonghyuk Shin, Zhengqi Li, Richard Zhang et al.

2025 35 citations View Analysis →

Captain Cinema: Towards Short Movie Generation

Junfei Xiao, Ceyuan Yang, Lvmin Zhang et al.

2025 36 citations View Analysis →

Genie: Generative Interactive Environments

Jake Bruce, Michael Dennis, Ashley Edwards et al.

2024 532 citations View Analysis →

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Yawen Luo, Xiaoyu Shi, Junhao Zhuang et al.

2026 3 citations View Analysis →

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra et al.

2021 8976 citations View Analysis →

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

S. A. Jacobs, Masahiro Tanaka, Chengming Zhang et al.

2023 217 citations View Analysis →

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Causal Autoregressive

Content-Aware Memory Routing

KV Caching

Teacher Forcing

Distribution Matching Distillation

Bidirectional Model

Visual Fidelity

Shot-Level Quality

Prompt Alignment

Identity Preservation

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Film Production

Advertising Creation

Game Development

Long-term Vision

Virtual Reality

Automated Video Editing

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence