Next Forcing: Causal World Modeling with Multi-Chunk Prediction

TL;DR

Next Forcing introduces multi-chunk prediction to accelerate training and improve accuracy in high-frame-rate video generation, achieving 94.1% success on RoboTwin.

cs.CV 🔴 Advanced 2026-06-10 65 views

Gangwei Xu Qihang Zhang Jiaming Zhou Xing Zhu Yujun Shen Xin Yang Yinghao Xu

AI Reader Arxiv Page Download PDF

video generation causal world modeling multi-chunk prediction training acceleration physical consistency

Key Findings

Methodology

This paper proposes the Next Forcing framework based on Multi-Chunk Prediction (MCP), which augments the main autoregressive model with lightweight auxiliary modules that predict multiple future video chunks simultaneously. These MCP modules form a causal chain across prediction depths, where features fused from various layers of the main model are used to predict future dynamics. During training, the MCP modules are shifted to higher timesteps, encouraging reliance on the main model’s representations. Multi-layer feature fusion allows the supervision signal to influence different depths, promoting trajectory-level temporal reasoning. The approach significantly accelerates convergence and enhances accuracy, especially at high frame rates like 50 fps. Experiments on RoboTwin and PhyWorld benchmarks demonstrate state-of-the-art results, with notable improvements in success rate, inference speed, and physical law adherence.

Key Results

On RoboTwin, Next Forcing reaches a success rate of 94.1% at 50k steps, outperforming LingBot-VA by 29.7 points at 5k steps, with 2.3× faster training convergence. It also achieves 2× inference acceleration by retaining MCP modules during inference.
In PhyWorld, Next Forcing reduces FVD scores to 4.7, surpassing LingBot-VA’s 5.3, and lowers the Abnormal Ratio to 8%, indicating better physical law adherence.
Pretraining on large-scale general videos (~3.5 million clips) shows over 50% reduction in FVD, confirming strong generalization beyond robotic data. Ablation studies highlight the importance of multi-layer feature fusion and causal chain design for performance gains.

Significance

This work addresses fundamental limitations of existing autoregressive video models, notably slow training and short-sighted predictions at high frame rates. By integrating multi-scale, multi-step future predictions, Next Forcing enables models to learn deeper causal dynamics, leading to more coherent, physically consistent videos. Its ability to accelerate training and inference while improving physical understanding has broad implications for robotics, simulation, and virtual environment generation. This approach paves the way for more intelligent visual systems capable of long-term scene reasoning, essential for autonomous agents and immersive media applications.

Technical Contribution

The core technical innovation lies in the design of a multi-chunk prediction framework that forms a causal chain across prediction depths, fused with multi-layer features from the main Transformer backbone. The approach introduces a shifted noise schedule for auxiliary MCP modules, promoting reliance on the main model’s representations. It also proposes a parallel inference mechanism where auxiliary modules are retained, enabling 2× inference speedup without retraining. This combination of multi-scale supervision, causal chaining, and parallel inference constitutes a novel paradigm for efficient, long-range video modeling, extending the capabilities of existing autoregressive models with minimal additional complexity.

Novelty

This is the first systematic application of multi-chunk prediction with causal chaining in continuous video generation, inspired by multi-token prediction in language models. Unlike prior works that focus solely on single-step denoising or short-term prediction, Next Forcing explicitly models multiple future horizons simultaneously, leveraging multi-layer feature fusion and a causal chain to enforce long-term temporal consistency. Its integration into a Transformer-based architecture and the ability to retain auxiliary modules during inference for speedup are key innovations that set it apart from existing methods.

Limitations

The added MCP modules increase model complexity and computational cost, which might be challenging for deployment in resource-constrained environments. Further optimization is needed for lightweight applications.
While effective at high frame rates, the model’s performance at low or variable frame rates remains less explored, and adaptation mechanisms for diverse temporal resolutions are necessary.
The reliance on large-scale pretraining and extensive data may limit applicability in domains with limited data availability. Additionally, the current causal chain design may face challenges in highly nonlinear or chaotic scenes, where long-term prediction errors accumulate.

Future Work

Future research could focus on optimizing the causal chain structure, exploring adaptive noise schedules, and integrating reinforcement learning to improve long-term scene understanding. Extending the framework to multi-modal data, such as combining vision with language or tactile inputs, could further enhance scene comprehension. Additionally, developing more efficient architectures for MCP modules to reduce computational overhead will be crucial for real-time applications. Investigating robustness in highly complex or unpredictable environments remains an open challenge, guiding future efforts toward more resilient models.

AI Executive Summary

Video is a fundamental medium for capturing and understanding the dynamic world around us. From robotics to virtual reality, the ability to generate realistic, coherent videos remains a central challenge in artificial intelligence. Traditional autoregressive models, which predict one frame or chunk at a time, often struggle with slow training convergence and limited accuracy, especially at high frame rates where adjacent frames are nearly identical. This short-sightedness, known as myopic supervision, causes models to rely heavily on appearance shortcuts, hindering their ability to learn meaningful scene dynamics.

To address these limitations, the present work introduces Next Forcing, a novel framework inspired by multi-token prediction strategies in large language models. By extending the prediction target from a single current chunk to multiple future chunks, Next Forcing enforces the model to learn deeper causal relationships governing scene evolution. This is achieved through auxiliary MCP modules that predict future video segments at different horizons, forming a causal chain across prediction depths. These modules are lightweight and fused with intermediate features from the main Transformer backbone, ensuring rich multi-scale supervision.

During training, the MCP modules are shifted to higher timesteps and exposed to higher noise levels, which encourages the model to rely on its internal representations rather than superficial appearance cues. This design significantly accelerates convergence, as evidenced by experiments on the RoboTwin benchmark, where Next Forcing reaches a success rate of 94.1% at just 50k steps—far faster than previous methods—and outperforms them in accuracy. The framework also demonstrates superior physical consistency in the PhyWorld benchmark, indicating better understanding of physical laws.

An important aspect of Next Forcing is its inference flexibility. The auxiliary MCP modules can be retained during inference to generate multiple future chunks in parallel, doubling the prediction speed without sacrificing quality. This parallel inference capability makes the approach highly suitable for real-time applications requiring high throughput.

Beyond robotics, the method shows promising results in large-scale general video pretraining, reducing FVD scores by over 50%, which underscores its broad applicability. The combination of improved training efficiency, enhanced long-term modeling, and inference acceleration marks a significant step forward in video generation technology. Future work will focus on further optimizing the causal chain structure, reducing computational costs, and extending the framework to multi-modal and multi-task scenarios, ultimately enabling more intelligent and physically grounded visual systems.

Deep Analysis

Background

视频作为模拟和理解现实世界动态的核心工具，经过多年的发展，已从简单的逐帧预测逐步演变为复杂的因果建模。早期方法如基于生成对抗网络（GAN）和变分自编码器（VAE）在静态图像生成中取得突破，但在连续视频生成中仍面临训练不稳定、生成质量有限的问题。近年来，基于自回归（autoregressive）模型的研究逐渐兴起，代表方法包括VideoGPT、DVD-GAN等，它们通过逐帧预测实现了较好的连续性，但在高帧率环境下容易出现短视问题，即模型过度依赖邻近帧的外观相似性，导致对场景演变的理解不足。

同时，Transformer架构的引入，如Video Transformer，极大提升了模型的表达能力，但训练速度依然缓慢，尤其在长时序建模方面存在瓶颈。为解决这些问题，研究者开始探索多块预测、多尺度融合等策略，以增强模型的因果推理能力。国内外诸多工作如LingBot-VA、DreamZero等，已在机器人操控和物理模拟任务中取得一定成果，但仍未充分解决高帧率下的训练效率和预测准确性问题。

Core Problem

当前视频生成模型在高帧率环境中表现出明显的瓶颈，主要源于“短视”或“myopic supervision”问题，即模型只关注当前块的预测，忽略了未来长距离的动态演变。这导致模型在训练中容易陷入外观复制的捷径，难以学习到深层次的因果关系，限制了其在复杂场景中的泛化能力。特别是在50fps等高速场景下，邻近帧几乎一致，模型更易陷入短视陷阱，训练收敛缓慢，生成质量受限。解决这一问题的关键在于引入多尺度、多时间偏移的未来预测目标，从而强制模型学习场景的深层动态规律。

Innovation

本文的创新点主要在于引入多块预测（MCP）机制，形成因果链条，显著改善模型对未来动态的建模能力。具体创新包括：

�� 多块预测：在训练中同时预测未来1、2、3块视频内容，增强模型的长时记忆和因果推理能力。
�� 多层特征融合：从模型不同深度提取中间特征，融合后作为预测输入，提升模型对不同尺度信息的利用。
�� 因果链设计：每个预测块依赖前一块的输出，形成因果关系链，有效引导模型学习场景演变的因果规律。
�� 并行推理机制：在推理阶段，保留辅助模块，实现多块视频的并行预测，大幅提升推理速度。
�� 训练策略优化：采用比主模型更高的时间偏移和噪声水平，强化模型对未来动态的依赖，避免短视。

Methodology

�� 构建基础模型：采用30层Transformer架构，编码视频潜在表示。
�� 多块预测设计：在训练中引入三个辅助MCP模块，分别预测未来1、2、3块视频内容。
�� 特征融合：在模型不同深度（第4、12、20、30层）提取隐藏状态，进行拼接和MLP压缩，作为多尺度特征融合的输入。
�� 时间偏移与噪声注入：对目标视频潜在进行时间偏移，采用Flow Matching方法加入不同水平的噪声，增强模型对未来动态的鲁棒性。
�� 因果链构建：每个预测深度依赖前一深度的输出，形成因果链条，利用轻量级Transformer块进行未来速度预测。
�� 损失函数设计：结合主模型的Flow Matching损失和每个预测块的辅助损失，整体优化模型。
�� 训练过程：采用大规模多样化数据集，逐步优化模型参数，确保多尺度、多时间偏移的预测能力。
�� 推理阶段：可选择性丢弃辅助模块实现纯模型推理，或保留实现多块并行预测，提升推理速度。

Experiments

实验在RoboTwin和PhyWorld两个基准上进行，前者涵盖50个机器人操控任务，后者测试模型对物理规律的遵循能力。训练数据包括2500个机器人示范和25000个随机场景，采用64GPU训练，最大训练步数50k。模型超参数包括：主模型时间偏移smain=5，辅助模块smcp=10，块大小M最大为4。对比方法包括LingBot-VA、DreamZero等，指标主要为成功率（RoboTwin）和FVD（PhyWorld）。同时进行多帧率（12fps、50fps）和不同训练步数的对比，评估训练速度和生成质量。还设计消融实验验证多层特征融合、因果链深度、噪声水平等设计的贡献。

Results

在RoboTwin基准上，Next Forcing在50fps下训练5k步即达成94.1%的成功率，较LingBot-VA提升29.7个百分点，且训练速度提升2.3倍。与传统单块预测模型相比，显著缩短收敛时间，且在高帧率环境中表现尤为优越。在PhyWorld测试中，Next Forcing的FVD指标降低至4.7，优于LingBot-VA的5.3，物理一致性明显增强。预训练实验显示，在大规模通用视频数据上，FVD指标降低超过50%，验证了模型的泛化能力。消融研究表明，多层特征融合和因果链设计对性能提升起到关键作用，噪声水平的调节也显著影响模型的长时预测能力。

Applications

该技术适用于机器人操控、虚拟现实、物理模拟等场景，尤其在需要高帧率、高精度动态预测的应用中表现突出。通过提升训练效率和推理速度，能够实现更自然、更真实的虚拟环境生成，为自主机器人、增强现实等行业带来革命性变革。未来还可结合多模态信息（如声音、触觉）扩展应用范围，推动多感知融合的智能场景理解。

Limitations & Outlook

尽管Next Forcing在训练速度和生成质量方面表现优异，但其引入的多块预测机制增加了模型复杂度和参数规模，可能在资源受限场景中难以部署。此外，模型在极端复杂或非线性动态场景中仍存在预测偏差，尤其在训练数据不足或多样性不足时，泛化能力有待提升。未来需探索更高效的特征融合策略和因果链优化方法，以降低计算成本并增强鲁棒性。

Plain Language Accessible to non-experts

想象你在做一道复杂的菜肴。每一步都需要根据前面做的内容来决定下一步，但如果你只关注眼前的步骤，很可能会忽略整体的味道变化。传统的视频生成模型就像只关注当前的步骤，只模仿眼前的场景，没有考虑未来会发生什么，导致生成的画面看起来很像复制粘贴，没有连贯性。

而Next Forcing就像一个聪明的厨师，他不仅关注当前的步骤，还会提前预测未来几步的变化，确保每一步都符合整体的菜肴风味。它通过在训练中让模型同时学习未来几步的内容，让模型变得更聪明，能理解场景的变化规律。这样，生成的视频就像一段连贯的故事，而不是一张静止的图片拼接起来的。它的核心思想是让模型学会“预见”未来，而不是只“看见”现在，从而让视频变得更真实、更自然。

ELI14 Explained like you're 14

想象你在玩一个超级复杂的拼图游戏，你不仅要拼出眼前的那一块，还要提前猜到接下来几块拼图会长什么样。传统的方法就像只专注于眼前的那一块拼图，拼好了就算完成，但你不知道接下来会发生什么。而新方法就像一个聪明的朋友，他会帮你预测未来几块拼图的样子，让你提前准备，拼得更快、更好。

在视频生成中也是一样，旧的方法只关注当前的画面，像只看眼前的拼图，容易陷入只复制邻近帧的陷阱，不能理解场景的变化。而新方法通过预测未来几帧，帮助模型理解场景的变化规律，就像提前知道拼图的样子一样。这不仅让生成的视频更连贯、更真实，还能大大加快训练速度。就像你有了这个聪明的朋友，拼图变得更快更好，视频也变得更自然、更像真实世界的样子。

Glossary

Multi-Chunk Prediction (多块预测)

一种在训练中同时预测多个未来时间块的视频生成方法，增强模型对长时动态的理解能力。

本文中通过多块预测形成因果链，改善模型对未来场景的推理。

因果链 (Causal Chain)

由多个预测模块依次连接，前一块的输出作为后一块的输入，模拟场景演变的因果关系。

设计中用以增强模型对时间序列的因果推理能力。

Flow Matching (流匹配)

一种生成模型训练方法，通过学习速度场将噪声样本变换为真实数据。

用于训练视频潜在表示的生成模型。

Transformer (变换器)

一种基于注意力机制的深度学习架构，擅长处理序列数据。

本文采用30层Transformer作为主模型架构。

Frechet Video Distance (FVD)

衡量生成视频与真实视频差异的指标，数值越低越好。

用于评估模型生成质量。

物理一致性 (Physical Law Adherence)

模型生成的视频是否符合自然界的物理规律。

在PhyWorld基准中作为评估指标。

噪声偏移 (Timestep Shift)

在训练中对噪声水平进行偏移，以增强模型对不同噪声水平的鲁棒性。

本文中用于多块预测的噪声注入策略。

多尺度特征融合 (Multi-Scale Feature Fusion)

结合不同深度层的特征信息，以丰富模型的表示能力。

提升多块预测的预测精度。

并行推理 (Parallel Inference)

在推理过程中同时预测多个时间块，显著提升生成速度。

通过保留辅助模块实现。

物理模拟 (Physical Simulation)

利用模型生成符合物理规律的视频，用于验证模型理解能力。

在PhyWorld中进行评估。

Abstract

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

cs.CV

References (20)

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo et al.

2026 79 citations ⭐ Influential View Analysis →

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu et al.

2024 183 citations ⭐ Influential View Analysis →

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng et al.

2026 111 citations ⭐ Influential View Analysis →

TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Junjie Wen, Yichen Zhu, Jinming Li et al.

2024 341 citations View Analysis →

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan et al.

2025 203 citations View Analysis →

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach et al.

2018 1275 citations View Analysis →

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu et al.

2024 90 citations View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 4854 citations View Analysis →

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He et al.

2025 371 citations View Analysis →

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu et al.

2025 125 citations View Analysis →

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang et al.

2025 134 citations View Analysis →

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo et al.

2023 567 citations View Analysis →

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Zhao, Vikash Kumar, S. Levine et al.

2023 1773 citations View Analysis →

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 2355 citations View Analysis →

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1834 citations View Analysis →

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

John Won, Kyungmin Lee, Huiwon Jang et al.

2025 14 citations View Analysis →

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo, Yicheng Feng, Wanpeng Zhang et al.

2025 73 citations View Analysis →

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1824 citations View Analysis →

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Nvidia, Johan Bjorck, Fernando Castañeda et al.

2025 853 citations View Analysis →

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Jonas Pai, Liam Achenbach, Victoriano Montesinos et al.

2025 58 citations View Analysis →

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Multi-Chunk Prediction (多块预测)

因果链 (Causal Chain)

Flow Matching (流匹配)

Transformer (变换器)

Frechet Video Distance (FVD)

物理一致性 (Physical Law Adherence)

噪声偏移 (Timestep Shift)

多尺度特征融合 (Multi-Scale Feature Fusion)

并行推理 (Parallel Inference)

物理模拟 (Physical Simulation)

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence