YoCausal: How Far is Video Generation from World Model? A Causality Perspective

TL;DR

YoCausal employs a two-level causality benchmark using real-world videos and natural reversal, evaluating 13 SOTA video diffusion models' causal understanding via RSI and CCI metrics.

cs.CV 🔴 Advanced 2026-05-29 71 views

You-Zhe Xie Yu-Hsuan Li Jie-Ying Lee Kaipeng Zhang Yu-Lun Liu Zhixiang Wang

AI Reader Arxiv Page Download PDF

causal reasoning video generation cognitive science model evaluation diffusion models

Key Findings

Methodology

This study introduces YoCausal, a causality evaluation framework inspired by the cognitive science VoE paradigm. It leverages zero-cost temporal reversal of real-world videos to generate natural counterfactuals, establishing an extensible evaluation protocol. The first level, Reverse Surprise Index (RSI), quantifies arrow-of-time perception via denoising loss differences between forward and reversed videos. The second level, Causality Cognition Index (CCI), employs a vision-language model (VLM) to automatically partition datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal biases. This approach circumvents synthetic data limitations, enabling large-scale, real-world applicability. The framework assesses 13 state-of-the-art diffusion models, revealing that perceiving temporal direction does not equate to causal understanding, with significant gaps compared to human performance.

Key Results

In Level 1 RSI evaluation, several models (e.g., Wan2.2-A14B) surpass 50% accuracy, indicating some sensitivity to temporal arrow, but overall performance remains well below human upper bounds (human RSI ~87.3%).
In Level 2 CCI analysis, only a few models (e.g., Wan2.2-A14B, CogVideoX1.5-5B) show signs of causal cognition, with CCI scores around 0.45, still far from human levels (~0.78), highlighting the gap in deep causal understanding.
Scaling parameters and architectural improvements (e.g., from UNet to DiT) correlate positively with causal cognition metrics, suggesting that larger, more complex models tend to better grasp causal relationships.

Significance

This work pioneers a real-world, scalable causality benchmark for video diffusion models, moving beyond synthetic or controlled datasets. It provides a rigorous, quantitative measure of models’ causal perception and reasoning, revealing that current models mainly capture superficial temporal patterns rather than true causal understanding. The methodology bridges the gap between cognitive science and AI, offering a new lens to evaluate and improve models’ scene comprehension. The insights gained can guide future architecture design, training strategies, and the development of models capable of genuine causal reasoning, which is essential for autonomous decision-making, scene understanding, and AI alignment.

Technical Contribution

The paper introduces a novel dual-metric framework: RSI quantifies arrow-of-time perception via denoising loss differences, while CCI leverages a VLM to distinguish causal from non-causal videos automatically. The zero-cost temporal reversal technique allows large-scale, real-world data utilization without synthetic scene generation. The combination of these metrics disentangles temporal bias from causal reasoning, providing a nuanced evaluation of models’ causal cognition. Additionally, the study demonstrates that model scale and architecture evolution (e.g., from UNet to DiT) enhance causal understanding, establishing scaling laws in this higher-order reasoning domain. This framework sets a new standard for causality assessment in generative models.

Novelty

This research is the first to adapt the VoE paradigm for evaluating causal understanding in real-world video diffusion models, utilizing natural, zero-cost temporal reversal instead of synthetic scenes. It introduces a two-level, scalable benchmark that disentangles arrow-of-time perception from genuine causal reasoning, addressing the limitations of prior physics-based or synthetic benchmarks. The integration of a VLM for dataset partitioning and the design of RSI and CCI metrics constitute significant innovations, enabling large-scale, real-world applicability. These contributions collectively push forward the frontier of causal cognition evaluation in generative AI, bridging cognitive science principles with modern deep learning techniques.

Limitations

In scenarios where forward and reversed videos are visually indistinguishable (e.g., Newton’s cradle), RSI fails to differentiate models’ sensitivity to time arrow, limiting its universality.
The reliance on model weights for denoising loss computation restricts external evaluation to open-source models, hindering assessment of proprietary or closed models.
Current models show limited performance in complex, long-term causal chains, indicating the need for more sophisticated causal inference mechanisms and multi-modal integration in future work.

Future Work

未来将探索多模态信息（如声音、文本）融合以增强因果推理能力，扩展到更复杂场景和长序列中。同时，结合因果推理的自监督训练策略，提升模型的深层因果理解。还计划引入可解释性机制，揭示模型内部的因果推理路径，推动构建更具自主性和可信度的智能系统。此外，研究将关注模型在长时序、多因果关系场景中的表现，优化算法效率和泛化能力，为实现更接近人类的因果理解提供技术支撑。

AI Executive Summary

在人工智能的长远发展中，构建具有深层因果理解能力的模型一直是核心追求。尽管近年来视频生成技术取得了巨大突破，尤其是基于扩散机制的模型（如Stable Diffusion、Denoising Diffusion Probabilistic Models）在生成逼真连续视频方面表现出色，但其是否真正理解事件背后的因果关系，仍是一个悬而未决的问题。传统的评估方法多依赖合成数据或受控场景，难以反映模型在复杂真实环境中的因果推理能力。

本文提出了YoCausal，一个基于认知科学VoE范式的双层因果认知评估体系。通过对真实视频进行零成本的时间逆转，生成自然的反事实样本，建立了可扩展的评估协议。这一创新使得模型在真实多样场景中被检验其对时间箭头和因果关系的敏感性。第一层指标是逆惊讶指数（RSI），利用去噪损失量化模型对时间方向的感知能力；第二层指标是因果认知指数（CCI），结合视觉-语言模型（VLM）自动划分因果关系，区分模型对因果关系的真正理解与统计偏差。

通过对13个主流扩散模型的系统评估，结果显示：虽然部分模型（如Wan2.2-A14B）在RSI指标上超过50%的随机猜测线，表现出一定的时间箭头感知，但整体仍远低于人类水平（人类最高RSI达87.3%）。在CCI指标方面，少数模型表现出初步的因果认知能力，但差距依然明显。这表明当前模型主要捕捉统计时间偏差，缺乏深层次的因果理解。参数规模和架构的演进（如从UNet到DiT）对因果认知具有正向促进作用，验证了模型复杂度的提升有助于因果推理能力的增强。

本研究的意义在于：首次提出了面向真实场景的因果认知评估体系，为理解和提升视频生成模型的因果推理能力提供了科学工具。突破了以往依赖合成或受控数据的局限，极大拓展了评估场景的多样性，为未来构建具有深层因果理解的自主智能系统提供了理论基础和实践路径。该方法的引入，有望推动AI在因果推理、场景理解和自主决策等核心能力上的突破，具有重要的学术价值和实际应用前景。

Deep Analysis

Background

随着深度学习技术的发展，视频生成模型（如扩散模型）在图像合成、动画制作等领域取得了巨大成功。特别是基于扩散机制的模型（如Stable Diffusion、Denoising Diffusion Probabilistic Models）在生成高质量、连续性强的视频方面表现出色。然而，这些模型的核心能力仍停留在统计模拟层面，是否真正理解因果关系、掌握事件之间的因果逻辑，仍未得到系统验证。此前的研究多关注模型的物理一致性（如Physion、PhysWorld等物理场景基准），但这些方法多依赖合成或受控的实验场景，难以反映模型在真实复杂环境中的因果推理能力。认知科学中的VoE范式提供了一种有效的检测因果认知的方法，即通过观察个体对反事实事件的“惊讶”程度，判断其是否具备因果理解。将这一思想引入视频生成模型的评估中，成为本研究的创新核心。

Core Problem

当前视频生成模型在时间感知方面表现尚可，但在深层次的因果理解方面存在明显不足。传统评估方法多依赖合成数据或受控场景，缺乏对模型在真实多样环境中的因果推理能力的检验。模型是否能区分因果事件与随机事件，是否能理解事件的因果链条，仍未有统一、科学的评估体系。此外，现有指标难以区分模型对时间箭头的感知与对因果关系的理解，导致评估结果具有一定的偏差。这些问题限制了模型在自主场景理解、决策推理等方面的应用潜力，也阻碍了因果推理机制的深入研究。

Innovation

本研究的核心创新在于：• 利用零成本的时间逆转，生成自然的反事实视频样本，突破合成场景的局限；• 设计逆惊讶指数（RSI），通过去噪损失量化模型对时间箭头的敏感性，提供定量的时间感知指标；• 引入因果认知指数（CCI），结合视觉-语言模型（VLM）自动划分因果关系的子集，有效区分模型对因果关系的真正理解与统计偏差；• 构建多域真实视频数据集，涵盖日常生活、物理场景、动物行为等，提升评估的广泛适用性。这些创新使得模型的因果推理能力得到了更全面、科学的检验。

Methodology

�� 数据采集：从现有真实视频库（如Moment in Time、Physics IQ、Kinetics）中抽取多域场景，确保场景多样性。• 逆向视频生成：对每个视频进行时间逆转，生成反事实样本，零成本实现。• 计算去噪损失：将正向和逆向视频输入预训练的扩散模型（如DiT、Stable Diffusion）中，计算每个样本的去噪损失，作为模型对时间箭头的感知指标。• RSI指标：统计模型在正向与逆向视频中，逆向视频的去噪损失高于正向的比例，反映模型对时间方向的敏感性。• 数据集划分：利用视觉-语言模型（如CLIP）自动检测视频中的因果关系，划分为因果子集（Dc）和非因果子集（Dnc）。• CCI指标：在两个子集上分别计算RSI，差值即为因果认知指数，衡量模型对因果关系的理解深度。• 评估体系：结合两个指标，进行模型排序和性能分析，验证模型的因果推理能力。• 实验验证：对13个主流扩散模型进行系统测试，分析参数规模、架构变化对因果认知的影响。

Experiments

�� 数据集：采用多域真实视频数据，包括日常生活、物理场景、动物行为等，确保场景多样性。• 模型：评估13个主流视频扩散模型（如Wan2.2-A14B、CogVideoX1.5-5B、AnimateDiff-SDXL等），涵盖不同规模和架构。• 指标：使用Level 1 RSI衡量时间箭头感知，Level 2 CCI衡量因果认知能力。• 过程：对每个模型在每个子集上进行多次随机采样，计算平均去噪损失，得出指标值。• 评估标准：模型在不同场景下的表现差异，分析参数规模、架构变化对因果理解的影响。• 人类基准：由人类评审对1200个视频进行因果判断，作为最高水平的参考。• 统计分析：采用Bootstrap方法计算指标置信区间，验证模型性能的统计显著性。

Results

�� 多数模型在RSI指标上超过50%的随机猜测线，但整体仍远低于人类（人类最高RSI为87.3%），显示模型对时间箭头的敏感性不足。• 在CCI指标上，少数模型表现出一定的因果认知（如Wan2.2-A14B，CCI值为0.45），但大部分模型仍未达到人类水平（CCI最高为0.78）。• 参数规模和架构（如DiT的引入）对因果理解有明显提升，规模越大，表现越优。• 通过多域测试，模型在日常场景表现较好，但在复杂因果关系中仍存在明显不足。• 结合两个指标的综合排名显示，模型整体因果认知能力仍有较大提升空间，特别是在深层次的因果关系理解方面。

Applications

�� 立即应用：该评估体系可用于训练更具因果理解能力的生成模型，提升自动驾驶、机器人导航、视频内容审核等领域的场景理解能力。• 长远愿景：未来，结合因果推理机制，推动自主智能体在复杂环境中自主学习因果关系，实现更高水平的场景理解和决策能力，助力智能系统的可信度和自主性提升。

Limitations & Outlook

�� 在某些场景（如牛顿摆）中，前后序列几乎无差异，RSI指标难以区分模型对时间箭头的敏感性，限制了评估的普适性。• 去噪损失的计算依赖模型的权重信息，难以对闭源模型进行外部评估，限制了方法的普适性和可扩展性。• 当前模型在复杂因果关系和长序列中的表现仍有限，未来需引入更丰富的因果推理机制和多模态信息融合。

Plain Language Accessible to non-experts

想象你在一个工厂里工作，工厂每天都在生产各种商品。有一天，你发现某个机器突然开始生产出不同的商品，你会想知道为什么会这样。其实，这就像视频中的事件一样：某个动作（比如推倒积木）会引发后续的变化。现在，假设你能用一台特别的相机，观察到每个事件发生的顺序和原因。当你用这台相机拍摄一段视频时，你可以把它倒转，看看是否还能理解事件的因果关系。科学家们用一种叫做VoE的方法，就是看你是否会对倒转的视频感到惊讶，来判断你是否理解了事件的因果关系。本文提出的方法，就是让AI模型也用类似的方式，观察视频的时间方向，判断它是否真正理解事件之间的因果关系。通过让模型分析真实世界中的视频，研究发现，虽然一些模型能感知时间的流向，但真正理解因果关系还差得远。这就像工厂里的工人，知道某个机器故障的原因，但大部分AI还只是模仿表面现象，没有真正理解背后的因果逻辑。这项研究为未来让AI像人一样理解世界提供了新工具，也提醒我们，单纯的模仿还不足以实现真正的智能。

ELI14 Explained like you're 14

你知道吗？在学校里，我们学会了很多事情，比如为什么打篮球会得分，或者为什么天会变晴。这些都是因果关系，也就是说，一个事情会导致另一个事情发生。现在，科学家们想让电脑也能理解这些因果关系，但一直很难。就像你玩游戏时，知道按哪个按钮可以跳跃或攻击，电脑也需要学会这些规则。研究人员发明了一种特别的方法，让电脑看视频时，反着播放它，看看电脑是否会感到奇怪或者困惑。就像你看一部电影倒着播放，可能会觉得怪怪的，因为你知道故事的因果关系。这个方法叫VoE，就是看你是否会对倒放的视频感到惊讶，来判断你是否理解了事件的因果关系。科学家用这个方法测试了很多AI模型，发现它们虽然能感觉到时间在流动，但还不能真正理解事件之间的因果关系。就像你知道，推倒积木后会倒下，但还不知道为什么会倒。未来，这项研究可以帮助电脑更聪明，学会像人一样理解世界的因果关系，让它们在自动驾驶、机器人和智能助手等方面变得更厉害！

Abstract

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

cs.CV

References (20)

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1729 citations ⭐ Influential View Analysis →

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.

2024 1903 citations ⭐ Influential View Analysis →

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng et al.

2022 1072 citations ⭐ Influential View Analysis →

Scaling Laws For Diffusion Transformers

Zheng Liang, Hao He, Ceyuan Yang et al.

2024 33 citations ⭐ Influential View Analysis →

Towards Precise Scaling Laws for Video Diffusion Transformers

Yuanyang Yin, Yaqi Zhao, Mingwu Zheng et al.

2024 16 citations ⭐ Influential View Analysis →

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao et al.

2023 1540 citations ⭐ Influential View Analysis →

Scaling Laws for Neural Language Models

J. Kaplan, Sam McCandlish, T. Henighan et al.

2020 8063 citations ⭐ Influential View Analysis →

LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference

Jianhao Yuan, Fabio Pizzati, Francesco Pinto et al.

2025 7 citations ⭐ Influential View Analysis →

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky et al.

2025 97 citations ⭐ Influential View Analysis →

Ready to detect a reversal of time's arrow: a psychophysical study using short video clips in daily scenes

Nao Hanyu, Kei Watanabe, S. Kitazawa

2023 7 citations

Physion: Evaluating Physical Prediction from Vision in Humans and Machines

Daniel M. Bear, E. Wang, Damian Mrowca et al.

2021 142 citations View Analysis →

Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

Chenyu Zhang, D. Cherniavskii, Andrii Zadaianchuk et al.

2025 32 citations View Analysis →

Video Language Planning

Yilun Du, Mengjiao Yang, Peter R. Florence et al.

2023 166 citations View Analysis →

Mastering Atari with Discrete World Models

Danijar Hafner, T. Lillicrap, Mohammad Norouzi et al.

2020 1216 citations View Analysis →

Classifier-Free Diffusion Guidance

Jonathan Ho

2022 6347 citations View Analysis →

On the Content Bias in Fréchet Video Distance

Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar et al.

2024 45 citations View Analysis →

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li et al.

2024 624 citations View Analysis →

Learning to Model the World with Language

Jessy Lin, Yuqing Du, Olivia Watkins et al.

2023 82 citations View Analysis →

Impossible Videos

Zechen Bai, Hai Ci, Mike Zheng Shou

2025 13 citations View Analysis →

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu et al.

2024 307 citations View Analysis →

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence