MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

TL;DR

MotiMotion integrates VLM-based reasoning and confidence-aware control for motion-controlled video generation, outperforming baselines on MotiBench.

cs.CV 🔴 Advanced 2026-05-22 54 views

Lee Hsin-Ying Hanwen Jiang Yiqun Mei Jing Shi Ming-Hsuan Yang Zhixin Shu

AI Reader Arxiv Page Download PDF

motion control video generation visual language models causal reasoning confidence modulation

Key Findings

Methodology

This paper presents MotiMotion, a novel framework that reformulates motion-controlled video generation as a reasoning-then-generation pipeline. It comprises two core components: a training-free Visual Language Model (VLM)-based reasoning module that interprets sparse user trajectories and visual context to produce physically plausible, commonsense-consistent refined trajectories and hallucinated secondary motions; and a confidence-aware motion control mechanism that dynamically modulates the guidance strength based on trajectory confidence scores. The video generator is a flow-matching diffusion transformer conditioned on dense spatiotemporal heatmaps of trajectories embedded via a 3D VAE latent space. Training is performed on the OpenVid dataset with trajectory degradation to simulate input uncertainty. MotiBench, a new benchmark dataset with interaction-centric pre-event physical scenes, supports systematic evaluation of physical realism and causal consistency.

Key Results

On MotiBench, MotiMotion achieves physical realism, photorealism, and semantic consistency scores of 0.285, 0.493, and 0.641 respectively, significantly outperforming MagicMotion (0.157, 0.550, 0.343) and Wan-Move (0.218, 0.483, 0.511), demonstrating more plausible object behaviors and interactions.
Both automated evaluation using Gemini 3.1 Pro VLM and human 2AFC studies show MotiMotion wins over 70% of comparisons on object property and interaction criteria, with human preference rates up to 97.9%, indicating strong user favorability.
Ablation studies confirm the critical roles of VLM-based prompt and motion reasoning as well as confidence-aware control, each contributing substantial improvements in physical realism and semantic consistency.

Significance

This work addresses a fundamental limitation of existing motion-controlled video generation methods that rigidly execute sparse user trajectories without reasoning about physical causality or commonsense. By integrating VLMs as zero-shot visual reasoners, MotiMotion enables understanding of visual context and implicit causal effects, allowing generation of videos with realistic secondary motions and interactions. The confidence-aware control mechanism further enhances robustness to imperfect inputs. This advances the controllability and realism of video generation, reducing user burden in specifying detailed motion while enabling physically plausible and semantically consistent outputs. The approach has broad implications for interactive content creation, VR/AR, robotics, and simulation, tackling long-standing challenges in dynamic visual synthesis.

Technical Contribution

Technically, MotiMotion innovates by leveraging training-free VLMs to perform visual and causal reasoning for refining sparse user motion inputs and hallucinating secondary effects, a departure from prior methods that treat trajectories as ground truth. The confidence-aware control scheme introduces a novel mechanism to modulate the influence of motion conditioning based on confidence scores, balancing strict adherence and generative prior reliance, thus addressing input uncertainty. The use of flow-matching objectives with a diffusion transformer and 3D VAE latent space enables efficient and precise spatiotemporal motion control. Additionally, MotiBench provides a new standardized benchmark focusing on pre-event physical interactions, filling a critical gap in evaluation datasets.

Novelty

MotiMotion is the first to incorporate zero-shot visual language model reasoning into motion-controlled video generation, enabling causally consistent and commonsense-aligned motion planning beyond rigid trajectory execution. The confidence-aware control mechanism innovatively addresses the challenge of imperfect user inputs by dynamically adjusting conditioning strength. The introduction of MotiBench as a dedicated benchmark for causal physical interaction video generation further distinguishes this work from prior studies.

Limitations

The framework heavily relies on the reasoning capabilities of the underlying VLM; inaccuracies or misinterpretations by the VLM can lead to implausible motion planning and degraded generation quality.
The confidence scoring mechanism is based on simulated trajectory degradation during training; accurately estimating confidence for real user inputs remains an open challenge that may affect control effectiveness.
Current evaluations focus on relatively short sequences and limited object complexity; scaling to longer, multi-object interactions and higher-resolution videos may face computational and modeling challenges.

Future Work

Future directions include enhancing the reasoning accuracy and multimodal understanding of VLMs, integrating explicit physics simulation modules to improve causal motion inference, and developing robust confidence estimation methods for real-world user inputs. Extending MotiMotion to handle longer temporal horizons, complex multi-object interactions, and higher-resolution video generation with improved efficiency will be critical for practical deployment.

AI Executive Summary

The field of image-to-video generation has seen remarkable progress with the advent of diffusion models and large-scale foundation models capable of synthesizing high-fidelity temporal dynamics. However, precise and logical controllability of motion remains a critical bottleneck. Existing motion-controlled video generation methods rely heavily on user-provided trajectories, which are often sparse, imprecise, and lack causal completeness. This leads to unnatural or implausible video outputs, especially when secondary causal effects are omitted.

To address these challenges, the authors propose MotiMotion, a novel framework that reframes motion control as a reasoning-then-generation problem. At its core, MotiMotion integrates a training-free Visual Language Model (VLM) as a reasoning engine that interprets sparse user trajectories, visual context, and textual prompts to generate refined, physically plausible motion plans including secondary motions. Complementing this, a confidence-aware control mechanism dynamically modulates the adherence of the video generator to the input trajectories based on their confidence scores, allowing flexible balancing between strict trajectory following and reliance on the model’s generative priors.

Technically, MotiMotion builds upon a flow-matching diffusion transformer architecture with a 3D variational autoencoder latent space for video representation. The VLM, Gemini 3.1 Pro, provides zero-shot visual and causal reasoning without additional training. The authors also curate MotiBench, a new benchmark dataset featuring interaction-centric pre-event physical scenes that require causal and commonsense reasoning for plausible video generation.

Experimental results demonstrate that MotiMotion significantly outperforms state-of-the-art baselines such as MagicMotion and Wan-Move on MotiBench across physical realism, photorealism, and semantic consistency metrics. Both automated VLM-based evaluation and human 2AFC preference studies confirm the superiority of MotiMotion in generating videos with realistic object behaviors and interactions. Ablation studies highlight the critical contributions of the VLM reasoning module and confidence-aware control.

This work advances the controllability and realism of motion-controlled video generation by bridging the gap between sparse user inputs and physically plausible, causally consistent motion synthesis. It reduces the burden on users to specify detailed trajectories and opens new avenues for interactive video editing, virtual reality, robotics, and simulation. Limitations include reliance on VLM reasoning accuracy, challenges in confidence estimation for real inputs, and scalability to complex, long-horizon scenarios.

Looking forward, enhancing VLM reasoning capabilities, integrating physics simulation, improving confidence estimation, and scaling to more complex video generation tasks constitute promising directions to further elevate the fidelity and applicability of controllable video synthesis.

Deep Analysis

Background

Image-to-video generation has rapidly evolved with diffusion models such as DDPMs (Ho et al.) and large-scale foundation models like DeepMind's Gemini series, enabling high-quality, semantically aligned dynamic content synthesis. Despite these advances, precise temporal control remains challenging. Existing methods enable motion control via user inputs such as drag trajectories (Wu et al.), bounding box sequences (Wang et al.), or optical flow maps (Burgert et al.), bridging static prompts and dynamic outputs. However, these inputs are often sparse, imprecise, and lack causal completeness, making it difficult to generate physically plausible and semantically consistent videos. Users struggle to specify detailed motion dynamics like acceleration or mechanical linkages, increasing interaction complexity. Current models treat user trajectories as ground truth and mechanically execute them, ignoring visual context and physical commonsense, limiting realism and controllability.

Core Problem

The core problem is enabling motion-controlled video generation that accounts for physical causality and commonsense reasoning despite sparse, imprecise user inputs. Key challenges include: 1) sparse and coarse user trajectories lack temporal pacing and secondary causal effects; 2) strict trajectory adherence ignores implicit physical and semantic intent, resulting in unrealistic motions; 3) absence of mechanisms to handle input uncertainty and balance strict following with generative flexibility; 4) lack of dedicated benchmarks for evaluating physical realism and causal consistency. Addressing these is critical for improving video generation controllability, realism, and user experience.

Innovation

This work introduces several key innovations:

�� Incorporation of training-free Visual Language Models (VLMs) as zero-shot visual and causal reasoners to interpret sparse trajectories, visual context, and textual prompts, generating refined, physically plausible primary and secondary motion plans.

�� Development of a confidence-aware control mechanism that dynamically modulates the influence of motion conditioning based on trajectory confidence scores, enabling adaptive adherence to user inputs.

�� Utilization of a flow-matching diffusion transformer architecture with 3D VAE latent space encoding for efficient, precise spatiotemporal motion control.

�� Creation of MotiBench, a novel benchmark dataset focusing on pre-event physical interaction scenes, facilitating systematic evaluation of causal and physical plausibility in video generation.

Methodology

�� Base Video Generator: Adopts Wan 2.2 I2V-A14B framework, a diffusion transformer trained with a flow-matching objective to learn vector fields mapping noise to data distributions.

�� Motion Representation: Encodes sparse point trajectories as spatiotemporal heatmaps with 2D Gaussian kernels placed at normalized coordinates within video frames, replicated across channels to match VAE input.

�� Motion Conditioning: Projects motion heatmaps into latent space via a pretrained 3D VAE encoder, concatenated with noisy latent and reference image latent, input to the diffusion transformer.

�� VLM-Based Reasoning Module: Uses Gemini 3.1 Pro to jointly process trajectory text, trajectory visualization overlaid on input images, and optional textual prompts, generating detailed narrative prompts and refined trajectories including secondary motions.

�� Iterative Refinement: Allows multiple rounds of VLM reasoning and correction until motion plans satisfy physical plausibility and user intent.

�� Confidence-Aware Training: Simulates trajectory imperfections by degrading 50% of training samples with affine transformations, temporal linearization, and smoothing, associating confidence scores to guide model learning.

�� Signal Modulation: Scales Gaussian kernel amplitudes in motion heatmaps based on confidence scores, controlling the strength of conditioning during generation to balance strict adherence and generative prior reliance.

Experiments

�� Datasets: Trains on OpenVid for motion control; constructs MotiBench with pre-event physical interaction scenes, annotated with hand-drawn trajectories and textual prompts.

�� Baselines: Compares against MagicMotion (Li et al.) and Wan-Move (Chu et al.) motion-controlled video generation methods.

�� Metrics: Evaluates physical realism, photorealism, and semantic consistency using Gemini 3.1 Pro VLM; conducts human 2AFC preference studies.

�� Training: Initial training for 5K steps with learning rate 1e-5, batch size 16; confidence-aware fine-tuning for 3K steps with 50% trajectory degradation.

�� Ablations: Tests impact of VLM prompt reasoning, motion reasoning, and confidence-aware control modules.

�� Trajectory Extraction: Uses CoTracker3 with 64-grid for point trajectory extraction from training videos.

Results

�� MotiMotion achieves physical realism score of 0.285 on MotiBench, outperforming MagicMotion (0.157) by ~82% and Wan-Move (0.218) by ~30%, with semantic consistency reaching 0.641.

�� Automated VLM evaluation and human 2AFC tests show MotiMotion wins over 70% of comparisons on object properties and interactions, with human preference rates up to 97.9%, indicating superior physical and semantic fidelity.

�� Ablation studies demonstrate that VLM-based prompt and motion reasoning improve physical realism by ~0.07, with confidence-aware control further enhancing motion naturalness and robustness.

�� Applying the reasoning module to other baselines consistently improves their physical realism and semantic consistency.

Applications

�� Interactive Video Editing: Enables users to control complex video motions with sparse inputs, lowering editing difficulty and enhancing creative workflows.

�� Virtual and Augmented Reality: Generates physically and causally consistent dynamic scenes, improving immersion and interaction realism.

�� Robotics and Simulation: Supports vision-based reasoning for environment understanding and motion prediction, aiding autonomous navigation and manipulation.

�� Educational and Entertainment Content Creation: Automates generation of physically plausible animations and effects, enriching media production.

�� Film and Visual Effects Previsualization: Facilitates rapid simulation of complex physical interactions to assist design and decision-making.

Limitations & Outlook

�� Heavy reliance on VLM reasoning accuracy; errors in visual or causal inference can degrade motion planning and generation quality.

�� Confidence scoring is based on simulated degradation; real-world confidence estimation remains challenging, potentially affecting control precision.

�� Current method validated on relatively short sequences and limited complexity; scaling to longer, multi-object, high-resolution videos poses computational and modeling challenges.

Abstract

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

cs.CV

References (20)

T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation

Chieh-yun Chen, Min Shi, Gong Zhang et al.

2025 20 citations ⭐ Influential View Analysis →

AUTO-ENCODING VARIATIONAL BAYES

Romain Lopez, Pierre Boyeau, N. Yosef et al.

2020 23277 citations ⭐ Influential

Image Conductor: Precision Control for Interactive Video Synthesis

Yaowei Li, Xintao Wang, Zhaoyang Zhang et al.

2024 58 citations ⭐ Influential View Analysis →

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Ruihang Chu, Yefei He, Zhekai Chen et al.

2025 19 citations ⭐ Influential View Analysis →

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang et al.

2025 51 citations ⭐ Influential View Analysis →

PhysGen3D: Crafting a Miniature Interactive World from a Single Image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu et al.

2025 47 citations ⭐ Influential View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 4472 citations View Analysis →

Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

Nate Gillman, Charles Herrmann, Michael Freeman et al.

2025 31 citations View Analysis →

CameraCtrl: Enabling Camera Control for Video Diffusion Models

Hao He, Yinghao Xu, Yuwei Guo et al.

2025 49 citations

MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

Jinbo Xing, Long Mai, Cusuh Ham et al.

2025 39 citations View Analysis →

Peekaboo: Interactive Video Generation via Masked-Diffusion

Yash Jain, Anshul Nasery, Vibhav Vineet et al.

2023 75 citations View Analysis →

Trajectory Attention for Fine-grained Video Motion Control

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou et al.

2024 53 citations View Analysis →

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Zhongcong Xu, Jianfeng Zhang, J. Liew et al.

2023 381 citations View Analysis →

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Hanlin Wang, Ouyang Hao, Qiuyu Wang et al.

2024 32 citations View Analysis →

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiacheng Zhu et al.

2024 190 citations View Analysis →

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai et al.

2024 637 citations View Analysis →

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Quanhao Li, Zhen Xing, Rui Wang et al.

2025 40 citations View Analysis →

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Koichi Namekata, Sherwin Bahmani, Ziyi Wu et al.

2024 58 citations View Analysis →

VideoAgent: Self-Improving Video Generation

Achint Soni, Sreyas Venkataraman, Abhranil Chandra et al.

2024 22 citations View Analysis →

Generative Video Motion Editing with 3D Point Tracks

Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang et al.

2025 6 citations View Analysis →

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence