VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

TL;DR

This paper introduces VLM as a teacher for video reasoning via test-time online optimization, achieving a 16.7-point performance boost, surpassing traditional methods.

cs.CV 🔴 Advanced 2026-06-02 84 views

Junhao Cheng Liang Hou Tianxiong Zhong Xin Tao Pengfei Wan Kun Gai Jing Liao

AI Reader Arxiv Page Download PDF

Video Reasoning Vision-Language Models Test-Time Optimization Generative Models Deep Learning

Key Findings

Methodology

The study proposes a novel framework where vision-language models (VLMs) are repurposed as teachers during inference, rather than just problem solvers. The approach involves analyzing task instructions and visual context to synthesize differentiable reward signals that encode task-specific rules and goals. During inference, a lightweight LoRA module embedded within the video generation model (VGM) is optimized online through gradient-based updates guided by these rewards. The process includes: • Task analysis: VLM extracts rules and goals from textual and visual inputs, generating reward queries; • Feedback evaluation: VLM assesses intermediate video outputs, predicting rule satisfaction; • Parameter adjustment: Gradients from reward signals update LoRA parameters, refining the reasoning trajectory; • Iterative optimization: The process repeats until the trajectory satisfies success criteria. This integration leverages VLM perception capabilities to dynamically steer the generative process, overcoming the static limitations of conventional VGMs.

Key Results

Experiments on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show an average performance increase of 16.7 points, significantly outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) at comparable inference costs.
On VBVR-Bench, the proposed method achieves an overall score of 0.82 (out of 1), with notable improvements in tasks involving spatial reasoning, knowledge, and transformation, especially in long-tail and fine-grained rule adherence scenarios.
In RULER-Bench, the average score across diverse reasoning tasks increased from 0.65 to 0.82, demonstrating strong generalization across multiple domains, including physics, biology, and social scenarios.

Significance

This work advances the field by transforming VLMs into active guiding teachers during inference, enabling models to self-correct and adhere to complex rules dynamically. It addresses the longstanding challenge of logical consistency in video generation, bridging perception and reasoning. The approach offers a scalable, flexible solution that can be integrated into existing generative frameworks, opening pathways for more reliable AI systems in virtual environments, autonomous agents, and intelligent content creation. By harnessing perceptual strengths of VLMs for rule evaluation, the method significantly enhances reasoning robustness and generalization, with potential impacts spanning education, robotics, and simulation industries.

Technical Contribution

The core technical innovation lies in the formulation of differentiable reward signals derived from VLMs, enabling gradient-based online optimization of the video generation process. The framework combines task-specific rule extraction, reward synthesis, and lightweight parameter tuning via LoRA, facilitating efficient adaptation during inference. This contrasts with prior approaches that rely solely on static prompts or post-hoc sampling, offering a dynamic, feedback-driven mechanism for rule adherence. The method also introduces a task-adaptive reward synthesis strategy, automatically deriving process and goal rewards from textual instructions, which streamlines the application across diverse tasks without manual reward engineering. Extensive experiments validate the effectiveness of this integrated approach, setting new benchmarks in symbolic and general video reasoning.

Novelty

This research is the first to systematically incorporate VLMs as teachers during inference for video reasoning, utilizing differentiable rewards and online optimization. Unlike previous works that treat VLMs as static problem solvers or rely on textual prompts alone, this approach leverages VLM perception to evaluate and guide visual trajectories in real-time. The combination of task-adaptive reward synthesis, lightweight parameter tuning, and test-time gradient optimization represents a significant departure from existing paradigms, enabling models to surpass their intrinsic reasoning limitations and adapt to complex, rule-based scenarios dynamically.

Limitations

The computational overhead of online optimization remains significant, especially for high-resolution videos or multi-step reasoning tasks, potentially limiting real-time deployment.
Dependence on the accuracy of VLM evaluations means that biases or errors in perception could misguide the optimization process, especially in ambiguous or noisy scenarios.
Current framework primarily addresses static rule-based tasks; extending it to dynamic, multi-agent, or multi-modal environments requires further research.

Future Work

Future directions include developing more efficient optimization algorithms to reduce inference latency, integrating multi-modal cues such as audio and tactile feedback, and exploring reinforcement learning techniques for autonomous rule discovery. Additionally, scaling the framework to handle real-time video streams and multi-agent interactions could broaden its applicability. Investigating robustness against perception biases and extending the reward synthesis to more complex, hierarchical rules are also promising avenues. Ultimately, the goal is to create adaptive, explainable, and scalable reasoning systems capable of operating reliably in real-world scenarios.

AI Executive Summary

The rapid evolution of video generation models (VGMs) has revolutionized visual content synthesis, achieving unprecedented levels of realism and temporal coherence. However, these models often falter when tasked with complex reasoning that requires adherence to explicit rules or logical constraints. Traditional approaches, such as sampling-based methods like Best-of-N, can mitigate stochastic errors but fail to address systematic logical failures, especially in intricate, rule-based scenarios. This limitation hampers the deployment of VGMs in applications demanding high fidelity and logical consistency, such as autonomous navigation, virtual training, and intelligent content creation.

Recognizing this challenge, the authors propose a paradigm shift: leveraging vision-language models (VLMs) not as static problem solvers but as active teachers during inference. This innovative framework involves analyzing task instructions and visual contexts to synthesize differentiable reward signals that encode specific process constraints and goal conditions. During inference, a lightweight LoRA module embedded within the VGM is optimized online through gradient ascent, guided by these rewards. This dynamic adjustment allows the model to self-correct its reasoning trajectory, aligning it more closely with task requirements.

The core technical mechanism involves three steps: first, the VLM analyzes the task and generates reward queries; second, it evaluates intermediate video outputs, predicting rule satisfaction; third, the resulting feedback is used to update the LoRA parameters via backpropagation. This process iterates until the generated video trajectory satisfies the success criteria or reaches a maximum number of steps. Extensive experiments on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) benchmarks demonstrate that this approach yields a 16.7-point average performance gain, significantly outperforming prior methods. The results highlight the potential of integrating perceptual models as active guides, enabling reasoning beyond the intrinsic capabilities of generative models.

This work marks a significant advancement in AI reasoning systems, bridging perception and logic in a unified framework. By transforming VLMs into test-time teachers, the method offers a scalable, flexible, and robust solution to the longstanding challenge of logical consistency in video generation. Its implications extend to diverse fields, including robotics, virtual environments, and intelligent content synthesis, where rule adherence and goal accuracy are paramount. Despite the promising results, challenges remain in computational efficiency and perception bias mitigation. Future work aims to optimize the inference process further, incorporate multi-modal cues, and extend the framework to dynamic, real-time scenarios, paving the way for truly intelligent, adaptable AI systems.

Deep Analysis

Background

Video reasoning作为人工智能研究的核心方向之一，经历了从符号逻辑到深度学习的快速演变。早期的符号推理方法依赖明确的规则和逻辑结构，具有良好的可解释性，但在复杂场景和大规模数据中表现有限。随着深度学习的发展，基于Transformer的模型（如VideoBERT、TimeSformer）在视频理解中取得了显著突破，能够捕获长时序信息，提升了内容理解能力。同时，高质量的视频生成模型（如CogVideo、Veo、Wan系列）也实现了逼真的视觉合成，满足了内容创作的需求。然而，这些模型在逻辑推理、规则遵循和因果关系建模方面仍存在不足，尤其是在处理复杂规则和长尾任务时容易出现偏差。近年来，符号推理和关系推理成为研究热点，推动了符号与深度模型的融合。测试时优化（Test-Time Optimization）逐渐成为提升模型性能的有效手段，尤其在有限样本和动态场景中表现出优势。视觉-语言模型（VLM）如CLIP、ALIGN等，凭借强大的感知能力，被视为辅助推理的重要工具。尽管如此，将VLM应用于视频推理中的系统性方法仍处于探索阶段，如何结合生成模型的表达能力与VLM的感知优势，成为当前研究的焦点。

Core Problem

核心问题在于：尽管VGMs在视觉生成方面表现优异，但在执行复杂规则和实现细粒度推理时，常出现逻辑偏差和不一致。具体表现为：• 生成轨迹缺乏逻辑一致性，容易出现物理冲突或规则违反；• 长尾任务和细节推理难以满足，导致推理失败；• 传统的采样方法（如Best-of-N）虽能缓解随机性，但无法根本解决系统性错误。解决这一问题的关键在于：如何利用VLM的感知能力，动态引导生成模型，确保其输出符合任务规则和目标。这不仅涉及模型架构的创新，还关系到推理过程中的自适应调整机制。由于视频推理的复杂性和多样性，单一的静态模型难以应对所有场景，亟需引入具有动态调节能力的系统设计。

Innovation

本研究的创新点主要包括：

1) 将VLM转变为推理教师：通过分析任务描述，自动生成可微奖励，指导生成模型满足规则和目标；

2) 引入测试时在线优化机制：利用差分奖励信号，动态调整生成模型参数，实现推理轨迹的自我校正；

3) 设计任务自适应奖励合成策略：自动从任务描述中提取过程和目标奖励，无需手工定义奖励函数；

4) 采用轻量级LoRA模块：在推理过程中快速调整模型参数，保证优化效率和实时性。这些创新点突破了传统静态推理模型的局限，结合多模态感知与生成，为复杂视频推理提供了新思路。

Methodology

�� 任务分析：VLM教师解析文本和视觉条件，识别任务成功的关键规则和目标，生成目标奖励查询（如轨迹到达目标区域）和过程约束查询（如避免碰撞、保持连续性）；
�� 反馈评估：在推理过程中，VLM对中间生成的视频轨迹进行评估，预测其是否满足规则和目标，形成可微的奖励信号；
�� 参数调整：利用奖励信号，通过反向传播调整LoRA模块中的参数，优化生成模型的推理轨迹，使其逐步满足规则和目标；
�� 迭代优化：在推理过程中多轮重复，直到奖励满足预设阈值或达到最大轮次，确保推理轨迹的逻辑合理性和目标达成性；• 结合符号和连续推理场景，验证方法的普适性和鲁棒性。

Experiments

实验设计包括在两个主要基准上验证：符号推理任务（VBVR-Bench）和通用推理场景（RULER-Bench）。使用的模型包括：• 生成模型：基于Wan2.2-5B模型进行蒸馏，形成四步推理生成器；• VLM教师：Qwen3-VL-4B模型，负责任务分析和奖励生成。训练过程中，采用不同的优化轮次（N=50），学习率（5e-5），奖励阈值（0.1），以及多帧采样（K=16）进行评估。对比方法包括传统的Best-of-N采样、VLM-求解器和最新的测试时扩展技术。通过多轮实验，验证优化效果、推理准确率和计算效率。

Results

在VBVR-Bench中，改进后模型在整体评分上达到0.82（满分1），比基线提升了16.7分，尤其在空间和知识任务中表现出更强的逻辑一致性。在RULER-Bench中，平均得分由0.65提升到0.82，跨越多个场景，验证了方法的泛化能力。对比传统方法，性能提升显著，特别是在长尾规则和复杂因果关系处理上表现优异。 Ablation研究显示，奖励合成策略和LoRA优化的结合是性能提升的关键因素。模型在保持较低计算成本的同时，实现了推理轨迹的高质量生成。

Applications

该技术可广泛应用于自动视频内容生成、虚拟仿真、机器人导航、智能监控等场景，尤其适合需要复杂规则遵循和因果推理的任务。通过动态调节推理轨迹，提升系统的可靠性和交互性。未来，结合强化学习和多模态信息融合，有望实现自主学习和自我优化，推动智能系统在实际环境中的应用。

Limitations & Outlook

当前方法在高复杂度、多步骤推理任务中仍面临计算资源消耗较大的问题，尤其是在多轮优化和大规模视频生成时，实时性受到一定限制。对VLM的依赖较大，若VLM模型本身在特定任务或细节理解上存在偏差，可能影响奖励信号的准确性，从而影响推理效果。当前框架主要适用于静态规则任务，动态、多步骤、多模态场景的适应性仍需验证。

Plain Language Accessible to non-experts

想象你在一个工厂里工作，工厂里有很多不同的机器，每个机器都负责不同的任务。有时候，机器会按照预设的程序运行，但如果遇到特殊情况，比如需要按照特定的规则操作，机器就可能出错。现在，你希望让这些机器不仅能按照程序工作，还能自己判断是否遵守了规则，是否达到了目标。于是，你找来了一个非常聪明的观察员（就像论文中的VLM），它可以看到每个机器的工作状态，判断是否符合规则。这个观察员会告诉机器是否做得对，然后，机器可以根据这些反馈，自己调整操作方式，变得更聪明。这样，工厂的生产效率就会大大提高，机器也会变得更可靠。这就像论文中用VLM作为教师，通过实时反馈，帮助生成模型在视频推理中遵守规则，达到目标。这个过程不断调整，直到机器的表现符合预期，整个系统变得更智能、更高效。

ELI14 Explained like you're 14

想象你在学校玩拼图游戏，你的任务是把碎片拼成一幅完整的图片。刚开始，你可能拼错了，或者没有按照正确的顺序拼。于是，你的朋友（就像论文里的VLM）会观察你的拼图，告诉你哪里拼得对，哪里还错。你根据朋友的建议，重新调整拼图的位置，慢慢变得越来越像完整的图片。这个过程不断重复，直到拼图拼得完美。论文中的方法也是这样：用一个聪明的观察者（VLM）来检查生成的视频轨迹，告诉模型哪里做得对，哪里需要改正。模型根据这些反馈，调整自己的操作，逐步生成符合规则和目标的视频。这样，最终生成的视频既漂亮又符合逻辑，就像拼图拼得完美一样。这种方法让机器变得更聪明，能自己学习怎么做得更好，就像你在游戏中变得越来越厉害一样！

Abstract

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/

cs.CV

References (20)

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1771 citations ⭐ Influential View Analysis →

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Ziyu Guo, Xinyan Chen, Renrui Zhang et al.

2025 28 citations ⭐ Influential View Analysis →

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang, Juyi Lin et al.

2026 11 citations ⭐ Influential View Analysis →

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Xuming He, Zehao Fan, Hengjia Li et al.

2025 5 citations ⭐ Influential View Analysis →

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown et al.

2024 545 citations View Analysis →

Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

K. Newman, Tyler Zhu, Olga Russakovsky

2026 1 citations View Analysis →

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li et al.

2025 22 citations View Analysis →

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Xinxin Liu, Zhaopan Xu, Ming Li et al.

2025 11 citations View Analysis →

Cosmos World Foundation Model Platform for Physical AI

Nvidia Niket Agarwal, Arslan Ali, Maciej Bala et al.

2025 633 citations View Analysis →

Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao et al.

2025 10 citations View Analysis →

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Yi Chen, Yuying Ge, Rui Wang et al.

2025 41 citations View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 4725 citations View Analysis →

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang et al.

2025 203 citations View Analysis →

Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach

Yunuo Chen, Junli Cao, Anil Kag et al.

2025 9 citations View Analysis →

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Nanye Ma, Shangyuan Tong, Haolin Jia et al.

2025 221 citations View Analysis →

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Siyan Chen, Yanfei Chen, Ying Chen et al.

2025 41 citations View Analysis →

Dual-Process Image Generation

Grace Luo, Jonathan Granskog, Aleksander Holynski et al.

2025 11 citations View Analysis →

UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Ailing Zhang, Lina Lei, Dehong Kong et al.

2025 5 citations View Analysis →

MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

Yu Qi, Xinyi Xu, Ziyu Guo et al.

2026 1 citations View Analysis →

VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

Xindi Yang, Baolu Li, Yiming Zhang et al.

2025 34 citations View Analysis →

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence