Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

TL;DR

Proposes PhaseLock, a training-free framework that extracts motion priors from 2-step inference, improving physical consistency by 6.2 points on average.

cs.CV 🔴 Advanced 2026-06-05 66 views

Woojung Han Seil Kang Youngjun Jun Min-Hung Chen Fu-En Yang Seong Jae Hwang

AI Reader Arxiv Page Download PDF

video generation diffusion models physical consistency spectral analysis motion priors

Key Findings

Methodology

This paper employs spectral analysis to reveal that during the denoising process of diffusion models, phase information—encoding structural and motion dynamics—gradually erodes, leading to structural degradation. It was observed that few-step inference (e.g., 2 steps) retains more accurate motion priors, while standard multi-step inference (e.g., 50 steps) suffers from phase erosion, causing motion hallucinations and structural inconsistencies. Building on this, the authors introduce PhaseLock, a training-free approach that extracts motion priors from early inference (2 steps) via latent space differences (Latent Delta), and enforces these priors during high-fidelity generation through a guidance mechanism. The method involves • spectral decomposition of latent representations into magnitude and phase; • extraction of motion priors from few-step latent differences; • application of these priors in the denoising process via Latent Delta Guidance, with a decaying schedule to balance structure and detail. Experiments on models like CogVideoX and Wan 2.1 demonstrate that PhaseLock improves physical consistency scores by an average of 6.2 points, with negligible overhead (1.06× runtime, 1.02× memory), and reduces reliance on external guidance, which is more costly (~5×).

Key Results

Across multiple models, PhaseLock achieves an average increase of 6.2 points in physical consistency scores (from 30.0 to 36.0), outperforming baseline models. For example, CogVideoX improves from 30.8 to 36.0, and Wan 2.1 from 20.9 to 28.7. Spectral analysis shows that PhaseLock effectively mitigates phase erosion, maintaining 18% more phase information compared to standard multi-step inference. The method preserves motion trajectories more faithfully, reducing hallucinations and structural distortions. Quantitative assessments, including optical flow error measurements, confirm that phase preservation directly correlates with improved physical realism.
Further experiments demonstrate that phase information is highly sensitive to motion fidelity. Controlled perturbation of phase spectra in ground-truth videos causes significant motion errors (up to 8.5× optical flow error), whereas magnitude perturbations have minimal impact. This causal evidence underscores the importance of phase in maintaining realistic motion. The spectral analysis indicates that early inference steps (like 2 steps) inherently retain more phase coherence, which is crucial for physical plausibility. The Latent Delta Guidance then enforces this structure during full denoising, leading to substantial improvements.
The proposed framework is model-agnostic and computationally efficient. It leverages the coarse-to-fine nature of diffusion models, where global structure forms early, and high-frequency details emerge later. By constraining the latent space differences, PhaseLock aligns the phase evolution with the physical motion prior, effectively reducing hallucinations without retraining or external physics modules. The approach's simplicity and effectiveness make it suitable for real-world applications requiring physically plausible video synthesis, such as virtual reality, robotics simulation, and animation.
Experimental results across diverse datasets and models validate the robustness of PhaseLock. It consistently enhances physical consistency, with minimal impact on visual fidelity. The method's low overhead (just over 1× in runtime and memory) makes it practical for large-scale deployment. Ablation studies confirm that the key to success lies in preserving phase information, which is inherently more vulnerable during denoising. The framework's flexibility allows integration with various diffusion architectures, paving the way for future research on physics-aware generative models.
In summary, this work advances the understanding of diffusion model dynamics by identifying phase erosion as a core factor in physical hallucinations. It offers a novel, training-free solution that exploits early inference insights, providing a significant step toward realistic, physically consistent video synthesis. Future directions include adaptive guidance strategies, multi-scale spectral analysis, and extending the approach to multi-modal content generation, aiming to further bridge the gap between visual realism and physical plausibility.

Significance

This research addresses a fundamental challenge in generative modeling: ensuring physical plausibility in synthesized videos. By uncovering the phase erosion mechanism underlying motion hallucinations, it provides a new perspective and practical solution that does not require retraining or external physics engines. The low-cost, high-impact nature of PhaseLock makes it highly attractive for industry applications such as virtual reality, gaming, and autonomous systems, where realistic motion is critical. Moreover, it opens new avenues for integrating physical constraints directly into the generative process, moving beyond purely semantic or visual fidelity. The approach's generality and efficiency suggest it can be adopted across various diffusion-based frameworks, accelerating progress toward trustworthy AI-generated content. Long-term, this work paves the way for models that inherently understand and respect physical laws, reducing the need for manual correction and increasing user trust in AI-generated videos.

Technical Contribution

The key technical innovation lies in the spectral analysis of diffusion model denoising dynamics, revealing that phase information—crucial for structural and motion coherence—degrades significantly during multi-step inference. The authors introduce a novel, training-free method—Latent Delta Guidance—that leverages early inference (2 steps) to extract a motion prior based on latent space differences. This prior is then enforced during full denoising through a guidance mechanism that adjusts latent representations, with a decaying schedule to preserve details. The approach is grounded in signal processing theory, specifically the relation between latent space differences and inter-frame phase shifts, formalized through Fourier analysis. Unlike existing methods that rely on external physics modules or retraining, PhaseLock operates solely during inference, offering a lightweight yet effective solution for physical consistency enhancement.

Novelty

This work is the first to systematically analyze the phase erosion phenomenon in diffusion-based video generation and to exploit early inference steps as a source of reliable motion priors. The introduction of a training-free, spectral-based guidance mechanism that constrains latent space differences is a novel contribution, providing a new paradigm for physics-aware generative modeling. Unlike prior approaches that depend on external physics engines, large-scale data, or retraining, this method leverages intrinsic properties of the diffusion process, making it computationally efficient and broadly applicable. Its core innovation is the insight that phase information, which encodes motion and structure, can be preserved and enforced during high-fidelity generation without additional training or external modules.

Limitations

The method assumes that motion is predominantly represented in low-frequency phase components, which may not hold in highly complex or non-linear dynamic scenes, limiting its effectiveness in such scenarios.
In cases of rapid or highly non-linear motion, the linear approximation between latent differences and phase shifts may break down, reducing the accuracy of the motion prior.
While computationally efficient, the approach may still face challenges at very high resolutions or long sequences, where latent differences become more computationally demanding to compute and enforce, necessitating further optimization.

Future Work

Future research could focus on adaptive spectral analysis, incorporating multi-scale frequency components to better handle complex motions. Integrating learning-based modules to dynamically refine the motion prior extraction process could further enhance robustness. Extending the framework to multi-modal content, such as synchronized audio-visual generation, and real-time applications are promising directions. Additionally, exploring the combination of this spectral guidance with other physical constraints or physics-informed neural networks could lead to even more realistic and reliable content generation, ultimately bridging the gap between semantic fidelity and physical accuracy.

AI Executive Summary

The rapid development of diffusion-based models has revolutionized the field of video synthesis, enabling the creation of highly realistic and detailed content. However, a persistent challenge remains: ensuring that generated videos adhere to fundamental physical laws, especially regarding motion consistency. Existing solutions often rely on external physics engines or extensive retraining, which are computationally expensive and difficult to scale. This paper offers a novel perspective by analyzing the internal dynamics of diffusion models during denoising, revealing that phase information—encoding the structural and motion details—is progressively eroded as the number of denoising steps increases.

This phase erosion leads to hallucinations and unrealistic motion trajectories, undermining the physical plausibility of generated videos. Surprisingly, the authors find that a very limited number of steps (e.g., 2 steps) can produce more physically consistent motion than the traditional 50-step process, because early steps retain more accurate phase information. Building on this insight, they propose PhaseLock, a training-free framework that extracts a motion prior from the early inference trajectory using latent space differences—referred to as Latent Delta—and enforces this prior during full denoising.

The core idea is to constrain the evolution of the latent representation to preserve the phase dynamics crucial for realistic motion. This is achieved through a guidance mechanism that adjusts the latent space in a decaying schedule, balancing structural coherence and visual detail. Extensive experiments across multiple models, including CogVideoX and Wan 2.1, demonstrate that PhaseLock significantly improves physical consistency scores by an average of 6.2 points, with minimal computational overhead. The approach effectively reduces hallucinations, maintains high visual fidelity, and diminishes reliance on costly external guidance modules.

This work offers a new paradigm for integrating physical priors into diffusion models without additional training or external modules. Its simplicity, efficiency, and broad applicability make it a promising step toward more trustworthy and physically plausible AI-generated videos. Future directions include adaptive spectral analysis, multi-scale phase preservation, and extension to multi-modal content, aiming to further enhance the realism and reliability of generative models in complex dynamic scenarios.

Deep Dive

Abstract

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time).

cs.CV

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence