Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
Flex4DHuman employs relative camera-pose encoding within a diffusion framework to synthesize synchronized multi-view videos from monocular or sparse inputs, surpassing prior methods without explicit geometry priors.
Key Findings
Methodology
This paper introduces Flex4DHuman, a novel multi-view video diffusion approach built upon Wan 2.1’s 1.3B text-to-video architecture. The core innovation lies in the five-axis positional encoding that integrates spatial coordinates, temporal indices, view indices, and continuous SE(3) camera geometry into the self-attention mechanism. Unlike prior methods such as Diffuman4D, which rely on skeletons or depth maps, this model conditions generation solely on relative camera poses, eliminating the need for explicit geometric priors. The training employs a three-stage curriculum: initial pose following with single reference views, subsequent dynamic reference-to-target view extrapolation, and long-term temporal rollout using teacher-forced history tokens. Multi-view captions are incorporated to enable text-driven control at inference. During inference, the model supports view and temporal extrapolation via chunked rollout, producing dense synchronized multi-view videos that can be directly used for downstream 4D Gaussian splatting reconstruction, enabling high-fidelity dynamic 3D asset creation from monocular videos.
Key Results
- On the DNA-Rendering dataset, Flex4DHuman achieves a PSNR of 25.44dB, outperforming Diffuman4D-GT-skeleton by +1.21dB, and surpasses monocular baselines such as Diffuman4D-mono-skeleton and MV-Performer by +9.32dB and +8.00dB respectively. It maintains high SSIM (0.9516) and low LPIPS (0.0617), demonstrating superior multi-view consistency and temporal coherence without relying on explicit geometry.
- In the ActorsHQ dataset, the model demonstrates robust generalization, producing high-quality multi-view videos that, when re-rendered via FreeTimeGS, yield accurate 4D Gaussian splats, validating its applicability in real-world scenarios. The model also generalizes effectively to animal categories after mixed human-animal training, indicating broad applicability.
- Ablation studies confirm that the five-axis positional encoding and multi-stage curriculum significantly improve multi-view synchronization and long-term temporal stability, especially under sparse or monocular inputs. The inclusion of multi-view captions further enhances controllability and consistency during inference.
Significance
This work addresses a critical bottleneck in multi-view human and animal video synthesis—dependence on explicit geometric priors—by proposing a geometry-agnostic framework that leverages relative camera pose encoding. Its ability to generate high-quality, synchronized multi-view videos from minimal input dramatically reduces data acquisition costs and expands the applicability of 4D content creation in AR/VR, gaming, and film industries. The approach paves the way for scalable, real-world deployment of dynamic 3D scene synthesis, enabling more immersive and interactive virtual experiences.
Technical Contribution
The primary technical contribution is the integration of a five-axis positional encoding combining spatial, temporal, view, and SE(3) camera geometry information within a diffusion-based generative framework. This encoding allows the model to inherently understand relative camera transformations without explicit geometric inputs. The three-stage curriculum training progressively enhances the model’s ability to follow poses, extrapolate views, and perform long-term temporal rollouts. Additionally, the incorporation of multi-view captions and background drop augmentation improves controllability and robustness. The pipeline seamlessly combines multi-view video synthesis with downstream 4D Gaussian splatting, enabling efficient dynamic scene reconstruction from sparse inputs.
Novelty
This research is the first to propose a geometry-agnostic multi-view video diffusion framework that relies solely on relative camera pose encoding, avoiding the need for skeletons, depth maps, or rendered geometry. The five-axis positional encoding, extending RoPE with continuous SE(3) transformations, provides a novel mechanism for integrating camera geometry into attention. Unlike prior methods such as Diffuman4D, which depend on explicit human models, this approach achieves comparable or superior quality with significantly reduced geometric assumptions, marking a substantial leap in flexible, scalable 4D human and animal reconstruction.
Limitations
- Despite strong performance, the model may struggle with highly complex or fast-moving scenes where the training data lacks sufficient diversity, leading to potential artifacts or temporal inconsistencies.
- The training process requires substantial computational resources (e.g., 32×H100 GPUs), limiting accessibility for smaller research groups or real-time applications.
- While generalization to animals is demonstrated, the model’s performance on highly non-rigid or non-human objects remains to be fully validated, and further adaptation may be necessary for broader categories.
Future Work
Future directions include optimizing the model for real-time inference, reducing computational costs, and extending the framework to handle more diverse object categories and complex scenes. Incorporating additional modalities such as audio or textual descriptions could further enhance controllability. Exploring unsupervised or weakly supervised training strategies may improve scalability and robustness, enabling broader deployment in industry applications like virtual production, telepresence, and interactive entertainment.
AI Executive Summary
The rapid development of virtual content creation has highlighted the need for scalable, high-fidelity 4D reconstruction of dynamic scenes, particularly humans and animals. Traditional approaches rely heavily on explicit geometric priors such as skeleton models, depth maps, and normal estimations, which impose significant constraints on data acquisition and generalization. These methods often require calibrated multi-camera rigs and scene-specific optimization, limiting their practicality in uncontrolled environments.
Recent advances in generative modeling, especially diffusion-based approaches, have demonstrated promising capabilities in synthesizing novel views and dynamic scenes. However, most existing methods depend on explicit geometric inputs, restricting their flexibility and scalability. To address these limitations, this paper introduces Flex4DHuman, a novel framework that leverages relative camera-pose encoding within a diffusion model to generate synchronized multi-view videos from monocular or sparse multi-view inputs.
Built upon Wan 2.1’s 1.3B text-to-video architecture, Flex4DHuman employs a five-axis positional encoding that integrates spatial, temporal, view, and continuous SE(3) camera geometry information. This encoding is incorporated into the self-attention mechanism, enabling the model to understand relative camera transformations without explicit geometry priors. The training employs a carefully designed three-stage curriculum: starting with pose following in a single-view setting, progressing to dynamic reference-to-target view extrapolation, and finally enabling long-term temporal rollout with teacher-forced history tokens. Multi-view captions are added during training to facilitate text-driven control, enhancing the model’s flexibility.
Experimental results on datasets such as DNA-Rendering and ActorsHQ demonstrate that Flex4DHuman surpasses previous state-of-the-art methods in both quantitative metrics (PSNR, SSIM, LPIPS) and qualitative multi-view consistency. Notably, the model generalizes well to animal categories after mixed training, indicating broad applicability. The generated multi-view videos can be directly used for downstream 4D Gaussian splatting, enabling fast and high-quality dynamic 3D scene reconstruction from minimal inputs.
This work significantly advances the field of scalable 4D content creation, reducing reliance on complex geometric annotations and calibration. Its ability to produce high-fidelity, synchronized multi-view videos from sparse data opens new possibilities for virtual production, AR/VR, gaming, and film industries. Despite current limitations related to scene complexity and computational demands, ongoing research aims to optimize efficiency, extend to more diverse object categories, and incorporate additional modalities for richer scene understanding. Overall, Flex4DHuman marks a pivotal step toward democratizing high-quality 4D scene synthesis, making immersive virtual experiences more accessible and realistic.
Deep Dive
Abstract
We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.
References (20)
FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction
Yifan Wang, Peishan Yang, Zhen Xu et al.
Diffuman4D: 4D Consistent Human View Synthesis From Sparse-View Videos With Spatio-Temporal Diffusion Models
Yudong Jin, Sida Peng, Xuan Wang et al.
MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis
Yihao Zhi, Chenghong Li, Hongjie Liao et al.
Cameras as Relative Positional Encoding
Ruilong Li, Brent Yi, Junchen Liu et al.
Wan: Open and Advanced Large-Scale Video Generative Models
Ang Wang, Baole Ai, Bin Wen et al.
Artemis: Articulated Neural Pets with Appearance and Motion Synthesis
Huazhong WeiYang, LanXu
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering
W. Cheng, Ruixiang Chen, Wanqi Yin et al.
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
Lukas Höllein, Aljavz Bovzivc, N. Muller et al.
GPS-Gaussian: Generalizable Pixel-Wise 3D Gaussian Splatting for Real-Time Human Novel View Synthesis
Shunyuan Zheng, Boyao Zhou, Ruizhi Shao et al.
Gen3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Xuanchi Ren, Tianchang Shen, Jiahui Huang et al.
MVDream: Multi-view Diffusion for 3D Generation
Yichun Shi, Peng Wang, Jianglong Ye et al.
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels
HunyuanWorld Team, Zhenwei Wang, Yuhao Liu et al.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He et al.
Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans
Sida Peng, Yuanqing Zhang, Yinghao Xu et al.
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
YU Mark, Wenbo Hu, Jinbo Xing et al.
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Wangbo Yu, Jinbo Xing, Li Yuan et al.
D-NeRF: Neural Radiance Fields for Dynamic Scenes
Albert Pumarola, Enric Corona, Gerard Pons-Moll et al.
Animatable Gaussians: Learning Pose-Dependent Gaussian Maps for High-Fidelity Human Avatar Modeling
Zhe Li, Zerong Zheng, Lizhen Wang et al.
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
Jianhong Bai, Menghan Xia, Xintao Wang et al.
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
Yiming Wang, Qihang Zhang, Shengqu Cai et al.