Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

TL;DR

Flex4DHuman employs relative camera-pose encoding within a diffusion framework to synthesize synchronized multi-view videos from monocular or sparse inputs, surpassing prior methods without explicit geometry priors.

cs.CV 🔴 Advanced 2026-06-12 69 views
Jen-Hao Cheng Yipeng Wang Hao Zhang Gengshan Yang Jenq-Neng Hwang
multi-view synthesis diffusion models human reconstruction camera pose encoding 4D dynamic modeling

Key Findings

Methodology

This paper introduces Flex4DHuman, a novel multi-view video diffusion approach built upon Wan 2.1’s 1.3B text-to-video architecture. The core innovation lies in the five-axis positional encoding that integrates spatial coordinates, temporal indices, view indices, and continuous SE(3) camera geometry into the self-attention mechanism. Unlike prior methods such as Diffuman4D, which rely on skeletons or depth maps, this model conditions generation solely on relative camera poses, eliminating the need for explicit geometric priors. The training employs a three-stage curriculum: initial pose following with single reference views, subsequent dynamic reference-to-target view extrapolation, and long-term temporal rollout using teacher-forced history tokens. Multi-view captions are incorporated to enable text-driven control at inference. During inference, the model supports view and temporal extrapolation via chunked rollout, producing dense synchronized multi-view videos that can be directly used for downstream 4D Gaussian splatting reconstruction, enabling high-fidelity dynamic 3D asset creation from monocular videos.

Key Results

  • On the DNA-Rendering dataset, Flex4DHuman achieves a PSNR of 25.44dB, outperforming Diffuman4D-GT-skeleton by +1.21dB, and surpasses monocular baselines such as Diffuman4D-mono-skeleton and MV-Performer by +9.32dB and +8.00dB respectively. It maintains high SSIM (0.9516) and low LPIPS (0.0617), demonstrating superior multi-view consistency and temporal coherence without relying on explicit geometry.
  • In the ActorsHQ dataset, the model demonstrates robust generalization, producing high-quality multi-view videos that, when re-rendered via FreeTimeGS, yield accurate 4D Gaussian splats, validating its applicability in real-world scenarios. The model also generalizes effectively to animal categories after mixed human-animal training, indicating broad applicability.
  • Ablation studies confirm that the five-axis positional encoding and multi-stage curriculum significantly improve multi-view synchronization and long-term temporal stability, especially under sparse or monocular inputs. The inclusion of multi-view captions further enhances controllability and consistency during inference.

Significance

This work addresses a critical bottleneck in multi-view human and animal video synthesis—dependence on explicit geometric priors—by proposing a geometry-agnostic framework that leverages relative camera pose encoding. Its ability to generate high-quality, synchronized multi-view videos from minimal input dramatically reduces data acquisition costs and expands the applicability of 4D content creation in AR/VR, gaming, and film industries. The approach paves the way for scalable, real-world deployment of dynamic 3D scene synthesis, enabling more immersive and interactive virtual experiences.

Technical Contribution

The primary technical contribution is the integration of a five-axis positional encoding combining spatial, temporal, view, and SE(3) camera geometry information within a diffusion-based generative framework. This encoding allows the model to inherently understand relative camera transformations without explicit geometric inputs. The three-stage curriculum training progressively enhances the model’s ability to follow poses, extrapolate views, and perform long-term temporal rollouts. Additionally, the incorporation of multi-view captions and background drop augmentation improves controllability and robustness. The pipeline seamlessly combines multi-view video synthesis with downstream 4D Gaussian splatting, enabling efficient dynamic scene reconstruction from sparse inputs.

Novelty

This research is the first to propose a geometry-agnostic multi-view video diffusion framework that relies solely on relative camera pose encoding, avoiding the need for skeletons, depth maps, or rendered geometry. The five-axis positional encoding, extending RoPE with continuous SE(3) transformations, provides a novel mechanism for integrating camera geometry into attention. Unlike prior methods such as Diffuman4D, which depend on explicit human models, this approach achieves comparable or superior quality with significantly reduced geometric assumptions, marking a substantial leap in flexible, scalable 4D human and animal reconstruction.

Limitations

  • Despite strong performance, the model may struggle with highly complex or fast-moving scenes where the training data lacks sufficient diversity, leading to potential artifacts or temporal inconsistencies.
  • The training process requires substantial computational resources (e.g., 32×H100 GPUs), limiting accessibility for smaller research groups or real-time applications.
  • While generalization to animals is demonstrated, the model’s performance on highly non-rigid or non-human objects remains to be fully validated, and further adaptation may be necessary for broader categories.

Future Work

Future directions include optimizing the model for real-time inference, reducing computational costs, and extending the framework to handle more diverse object categories and complex scenes. Incorporating additional modalities such as audio or textual descriptions could further enhance controllability. Exploring unsupervised or weakly supervised training strategies may improve scalability and robustness, enabling broader deployment in industry applications like virtual production, telepresence, and interactive entertainment.

AI Executive Summary

The rapid development of virtual content creation has highlighted the need for scalable, high-fidelity 4D reconstruction of dynamic scenes, particularly humans and animals. Traditional approaches rely heavily on explicit geometric priors such as skeleton models, depth maps, and normal estimations, which impose significant constraints on data acquisition and generalization. These methods often require calibrated multi-camera rigs and scene-specific optimization, limiting their practicality in uncontrolled environments.

Recent advances in generative modeling, especially diffusion-based approaches, have demonstrated promising capabilities in synthesizing novel views and dynamic scenes. However, most existing methods depend on explicit geometric inputs, restricting their flexibility and scalability. To address these limitations, this paper introduces Flex4DHuman, a novel framework that leverages relative camera-pose encoding within a diffusion model to generate synchronized multi-view videos from monocular or sparse multi-view inputs.

Built upon Wan 2.1’s 1.3B text-to-video architecture, Flex4DHuman employs a five-axis positional encoding that integrates spatial, temporal, view, and continuous SE(3) camera geometry information. This encoding is incorporated into the self-attention mechanism, enabling the model to understand relative camera transformations without explicit geometry priors. The training employs a carefully designed three-stage curriculum: starting with pose following in a single-view setting, progressing to dynamic reference-to-target view extrapolation, and finally enabling long-term temporal rollout with teacher-forced history tokens. Multi-view captions are added during training to facilitate text-driven control, enhancing the model’s flexibility.

Experimental results on datasets such as DNA-Rendering and ActorsHQ demonstrate that Flex4DHuman surpasses previous state-of-the-art methods in both quantitative metrics (PSNR, SSIM, LPIPS) and qualitative multi-view consistency. Notably, the model generalizes well to animal categories after mixed training, indicating broad applicability. The generated multi-view videos can be directly used for downstream 4D Gaussian splatting, enabling fast and high-quality dynamic 3D scene reconstruction from minimal inputs.

This work significantly advances the field of scalable 4D content creation, reducing reliance on complex geometric annotations and calibration. Its ability to produce high-fidelity, synchronized multi-view videos from sparse data opens new possibilities for virtual production, AR/VR, gaming, and film industries. Despite current limitations related to scene complexity and computational demands, ongoing research aims to optimize efficiency, extend to more diverse object categories, and incorporate additional modalities for richer scene understanding. Overall, Flex4DHuman marks a pivotal step toward democratizing high-quality 4D scene synthesis, making immersive virtual experiences more accessible and realistic.

Deep Dive

Abstract

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

cs.CV cs.GR

References (20)

FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction

Yifan Wang, Peishan Yang, Zhen Xu et al.

2025 45 citations ⭐ Influential View Analysis →

Diffuman4D: 4D Consistent Human View Synthesis From Sparse-View Videos With Spatio-Temporal Diffusion Models

Yudong Jin, Sida Peng, Xuan Wang et al.

2025 12 citations ⭐ Influential View Analysis →

MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis

Yihao Zhi, Chenghong Li, Hongjie Liao et al.

2025 5 citations ⭐ Influential View Analysis →

Cameras as Relative Positional Encoding

Ruilong Li, Brent Yi, Junchen Liu et al.

2025 63 citations ⭐ Influential View Analysis →

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1855 citations ⭐ Influential View Analysis →

Artemis: Articulated Neural Pets with Appearance and Motion Synthesis

Huazhong WeiYang, LanXu

2022 41 citations ⭐ Influential

DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering

W. Cheng, Ruixiang Chen, Wanqi Yin et al.

2023 113 citations ⭐ Influential View Analysis →

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Lukas Höllein, Aljavz Bovzivc, N. Muller et al.

2024 80 citations View Analysis →

GPS-Gaussian: Generalizable Pixel-Wise 3D Gaussian Splatting for Real-Time Human Novel View Synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao et al.

2023 199 citations View Analysis →

Gen3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Xuanchi Ren, Tianchang Shen, Jiahui Huang et al.

2025 258 citations View Analysis →

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye et al.

2023 1007 citations View Analysis →

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu et al.

2025 77 citations View Analysis →

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He et al.

2025 377 citations View Analysis →

Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans

Sida Peng, Yuanqing Zhang, Yinghao Xu et al.

2020 858 citations View Analysis →

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

YU Mark, Wenbo Hu, Jinbo Xing et al.

2025 80 citations View Analysis →

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan et al.

2024 346 citations View Analysis →

D-NeRF: Neural Radiance Fields for Dynamic Scenes

Albert Pumarola, Enric Corona, Gerard Pons-Moll et al.

2020 1981 citations View Analysis →

Animatable Gaussians: Learning Pose-Dependent Gaussian Maps for High-Fidelity Human Avatar Modeling

Zhe Li, Zerong Zheng, Lizhen Wang et al.

2024 253 citations

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Jianhong Bai, Menghan Xia, Xintao Wang et al.

2024 81 citations View Analysis →

BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Yiming Wang, Qihang Zhang, Shengqu Cai et al.

2025 5 citations View Analysis →