Demystifing Video Reasoning

TL;DR

Video models exhibit reasoning via Chain-of-Steps mechanism during diffusion denoising steps.

cs.CV 🔴 Advanced 2026-03-18 49 views

Ruisi Wang Zhongang Cai Fanyi Pu Junxiang Xu Wanqi Yin Maijunxian Wang Ran Ji Chenyang Gu Bo Li Ziqi Huang Hokin Deng Dahua Lin Ziwei Liu Lei Yang

AI Reader Arxiv Page Download PDF

video generation diffusion models reasoning capabilities machine learning artificial intelligence

Key Findings

Methodology

This study employs qualitative analysis and targeted probing experiments to reveal that reasoning capabilities in video generation models primarily emerge during diffusion denoising steps, rather than across video frames. The Chain-of-Steps (CoS) mechanism is proposed, where models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer. Additionally, several key reasoning behaviors are identified, including working memory, self-correction and enhancement, and perception before action strategy.

Key Results

The study finds that video generation models explore multiple potential solutions in early denoising steps and progressively converge to a final answer in later steps, termed as Chain-of-Steps (CoS). Noise perturbation analysis shows that disruptions at specific denoising steps significantly degrade performance, while frame-wise perturbations have a much weaker impact.
Experiments on the VBVR-Wan2.2 model demonstrate that reasoning capabilities are significantly enhanced by ensembling latent trajectories with different random seeds, improving the final score from 0.685 to 0.716.
Fine-grained analysis of the Diffusion Transformer reveals that early layers encode dense perceptual structures, middle layers execute reasoning, and later layers consolidate latent representations. This self-evolved functional specialization plays a critical role in the model's reasoning process.

Significance

This study systematically uncovers the emergent reasoning mechanism in video generation models, challenging the traditional Chain-of-Frames hypothesis and proposing the Chain-of-Steps (CoS) mechanism. This finding provides a theoretical foundation for better exploiting the inherent reasoning dynamics of video models, potentially impacting academic research and industrial applications, especially in scenarios requiring complex reasoning capabilities, such as autonomous driving and intelligent surveillance.

Technical Contribution

The technical contribution of this study lies in revealing the emergent reasoning mechanism in video generation models and proposing the Chain-of-Steps (CoS) mechanism, challenging the traditional Chain-of-Frames hypothesis. Through fine-grained analysis of the Diffusion Transformer, the study uncovers self-evolved functional specialization during denoising steps. Additionally, a simple training-free strategy is proposed to enhance reasoning capabilities by ensembling latent trajectories, offering new engineering possibilities for video generation models.

Novelty

This study is the first to systematically uncover the emergent reasoning mechanism in video generation models, proposing the Chain-of-Steps (CoS) mechanism and challenging the traditional Chain-of-Frames hypothesis. Unlike existing research, this study reveals that reasoning capabilities primarily emerge during diffusion denoising steps, rather than across video frames, through qualitative analysis and targeted probing experiments.

Limitations

The study primarily bases its experiments on the VBVR-Wan2.2 model, which may limit the generalizability of the results to other model architectures and training datasets.
Although a training-free strategy is proposed to enhance reasoning capabilities, its actual effectiveness may be limited by the model's initial settings and the choice of random seeds.
The study mainly focuses on the reasoning capabilities of video generation models, without delving into other factors that may affect model performance, such as dataset diversity and complexity.

Future Work

Future research could further explore the impact of different architectures and datasets on the emergence of reasoning capabilities. Additionally, the application of the Chain-of-Steps (CoS) mechanism to other types of generative models, such as text and image generation models, could be investigated. Further exploration of optimizing model initial settings and random seed selection to enhance reasoning capabilities is also warranted.

AI Executive Summary

Recent advances in video generation models have significantly transformed the landscape of the movie, gaming, and entertainment industries. However, most research has primarily focused on their ability to produce high-fidelity, realistic, and visually appealing videos. Recent studies have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities in spatiotemporally consistent visual environments. Prior work attributes this behavior to a Chain-of-Frames (CoF) mechanism, suggesting that reasoning unfolds sequentially across video frames. Despite this intriguing discovery, the underlying mechanisms of video reasoning remain largely unexplored. With the recent release of large-scale video reasoning datasets and open-source foundation models, we now have the opportunity to systematically investigate this capability. Leveraging these resources, we conduct the first comprehensive dissection of video reasoning and uncover a fundamentally different mechanism: reasoning in diffusion-based video models primarily emerges along the denoising process rather than across frames.

Our key discovery challenges the prevailing Chain-of-Frames (CoF) hypothesis, which assumes that video reasoning unfolds sequentially across frames. Instead, we find that reasoning does not primarily operate along the temporal dimension. Rather, it emerges along the diffusion denoising steps, progressing throughout generation. We term this mechanism Chain-of-Steps (CoS). This finding suggests a fundamentally different view of how diffusion-based video models reason. Due to bidirectional attention over the entire sequence, reasoning is performed across all frames simultaneously at each denoising step, with intermediate hypotheses progressively refined as the process unfolds. Qualitative analysis reveals intriguing dynamics. In early denoising steps, the model often entertains multiple possibilities (populating alternative trajectories or superimposing candidate outcomes) before gradually converging to a final solution in later steps. Moreover, noise perturbation analysis shows that disruptions at specific denoising steps significantly degrade performance, whereas frame-wise perturbations have a much weaker impact. Further information propagation analysis identifies that the conclusion primarily solidifies during the middle diffusion steps.

Furthermore, we uncover several surprising emergent behaviors in video reasoning models that are strikingly similar to those observed in early studies of Large Language Models (LLMs). First, these models exhibit a form of working memory, enabling persistent reference. Second, we observe that video models can self-correct and enhance throughout generation. Third, video models exhibit a perception before action strategy, where early steps establish semantic grounding and later steps perform structured manipulation.

We further conduct a fine-grained analysis of the Diffusion Transformer by examining token representations within a single diffusion step. This reveals the self-evolved, diverse, task-agnostic functional layers throughout the network. Within a diffusion step, early layers focus on dense perceptual understanding (e.g., separating foreground from background and identifying basic geometric structures), while a set of critical middle layers performs the bulk of the reasoning. The final layers then consolidate the latent representation to produce the video state for the next step.

Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. This approach encourages the model to retain a richer set of candidate reasoning trajectories during generation. As a result, the model explores more diverse reasoning paths and is more likely to converge to the correct solution, illustrating a way to utilize our findings to design more effective video reasoning systems.

Deep Dive

Abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

cs.CV cs.AI

References (20)

Similarity of Neural Network Representations Revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee et al.

2019 1921 citations ⭐ Influential View Analysis →

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, P. Abbeel

2020 28306 citations ⭐ Influential View Analysis →

Video models are zero-shot learners and reasoners

Thaddaus Wiedemer, Yuxuan Li, Paul Vicol et al.

2025 96 citations ⭐ Influential View Analysis →

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang, Juyi Lin et al.

2026 2 citations ⭐ Influential View Analysis →

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.

2024 1559 citations View Analysis →

What Is a Cognitive Map? Organizing Knowledge for Flexible Behavior.

T. Behrens, Timothy H. Muller, James C. R. Whittington et al.

2018 872 citations

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie et al.

2025 114 citations View Analysis →

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu et al.

2022 6412 citations View Analysis →

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1175 citations View Analysis →

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Sijie Zhao, Yong Zhang, Xiaodong Cun et al.

2024 56 citations View Analysis →

Generating Images with Multimodal Language Models

Jing Yu Koh, Daniel Fried, R. Salakhutdinov

2023 352 citations View Analysis →

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan, Rongyao Fang, Yuqing Wang et al.

2025 36 citations View Analysis →

Scalable Diffusion Models with Transformers

William S. Peebles, Saining Xie

2022 5014 citations View Analysis →

Planning in the brain.

M. Mattar, M. Lengyel

2022 92 citations

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Luozheng Qin, Jia Gong, Yuqing Sun et al.

2025 19 citations View Analysis →

Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu et al.

2025 2 citations View Analysis →

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Weijia Shi, Xiaochuang Han, Chunting Zhou et al.

2024 97 citations View Analysis →

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Weichen Fan, Chenyang Si, Junhao Song et al.

2025 50 citations View Analysis →

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Xuming He, Zehao Fan, Hengjia Li et al.

2025 2 citations View Analysis →

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu et al.

2024 159 citations View Analysis →

Demystifing Video Reasoning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock