Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction
C4G introduces timestamp-conditioned learnable Gaussian tokens with transformer decoding, enabling efficient 4D scene reconstruction from monocular video without scene-specific optimization, reducing Gaussian count by orders of magnitude.
Key Findings
Methodology
The proposed C4G framework employs a set of learnable Gaussian query tokens conditioned on timestamps, decoded via a transformer decoder with full self-attention. Visual features are extracted from multi-frame videos using a pretrained VGGT backbone, then embedded with temporal positional encodings. These features and timestamp-conditioned queries interact within the transformer, producing a compact set of Gaussian parameters representing the scene at arbitrary times. To enhance rendering quality, a video diffusion model-based refinement module is integrated, which refines the generated images conditioned on input views. Additionally, a feature lifting mechanism maps 2D foundation model features into a 4D feature field, supporting point tracking and scene understanding. The entire pipeline is trained end-to-end, avoiding scene-specific optimization, and achieves high-fidelity dynamic scene reconstruction with significantly fewer Gaussians.
Key Results
- On datasets such as DynaCheck, TUM-Dynamics, and NVIDIA, C4G outperforms existing methods like NeoVerse and MoSca, achieving PSNR scores of 15.64dB, 20.59dB at short temporal gaps, with a Gaussian count in the thousands, far fewer than traditional pixel-wise methods which use hundreds of thousands. The model maintains high-quality novel view synthesis even with large temporal gaps (βt=8), where PSNR only drops slightly to 19.23dB.
- Quantitative evaluations on point tracking and 4D feature fields demonstrate that C4G captures scene-wide motion trajectories more accurately than pixel-based Gaussian models, validating its understanding of global scene dynamics.
- Ablation studies confirm that the time-conditioned query design and attention mechanisms are crucial for reducing Gaussian redundancy and improving generalization across diverse scenes.
Significance
This work advances the field of dynamic scene reconstruction by providing a scalable, generalizable, and efficient framework that does not require scene-specific optimization. It addresses longstanding issues such as high computational costs, view-dependent biases, and limited temporal interpolation, making real-time or near-real-time 4D scene understanding feasible. The integration of a diffusion-based rendering enhancement further bridges the gap between geometric accuracy and visual fidelity, opening new avenues for applications in AR/VR, robotics, and content creation. Its ability to model scenes with fewer primitives and without camera pose information marks a significant step toward practical deployment.
Technical Contribution
The core technical innovation lies in the design of timestamp-conditioned learnable Gaussian query tokens that, combined with a transformer decoder, enable global motion modeling with a compact primitive set. This approach circumvents pixel-wise prediction's redundancy and view-dependent biases, providing a unified representation across time and views. The use of full self-attention ensures spatial and temporal coherence, while the feature lifting mechanism allows arbitrary foundation model features to be mapped into a 4D scene representation. The integration of a diffusion-based rendering refinement further enhances visual quality, making the entire pipeline both efficient and robust.
Novelty
This is the first work to incorporate timestamp-conditioned learnable Gaussian query tokens within a transformer framework for dynamic 4D scene reconstruction. Unlike prior pixel-wise high Gaussian count methods, C4G achieves a globally coherent, compact scene representation that generalizes across scenes and large temporal gaps. Its combination of global feature aggregation, temporal conditioning, and diffusion refinement represents a novel paradigm shift in the field, bridging the gap between efficiency and fidelity.
Limitations
- Despite its strengths, the model struggles with scenes involving extremely rapid motion or complex occlusions, where the feature aggregation may not fully capture fine details or occluded regions.
- The computational cost of full self-attention scales quadratically with feature map size, limiting scalability to very high-resolution inputs or large scenes without further optimization.
- While camera pose independence is a strength, the geometric accuracy in large-scale scenes without explicit pose information can still be improved, possibly by integrating additional priors or multi-view cues.
Future Work
Future research could explore multi-scale attention mechanisms to improve efficiency and detail capture, incorporate multi-modal cues like depth and flow for better geometric fidelity, and extend the framework to multi-camera setups for large-scale scene understanding. Additionally, unsupervised or weakly supervised training strategies could further reduce reliance on annotated data, broadening real-world applicability.
AI Executive Summary
Reconstructing dynamic scenes in four dimensions from monocular videos has long been a fundamental challenge in computer vision. Traditional approaches relied heavily on scene-specific optimization, which, while capable of high-fidelity results, suffered from high computational costs, limited scalability, and poor generalization to new scenes. These methods often required hours of optimization per scene, making them impractical for large-scale or real-time applications. Recent advances in neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), introduced powerful representations capable of high-quality static scene reconstruction. Extending these to dynamic scenes, however, posed additional challenges: how to model scene motion coherently across time, how to handle occlusions and view-dependent biases, and how to do so efficiently without scene-specific tuning.
In this context, the paper introduces C4G, a novel framework that leverages a set of timestamp-conditioned learnable Gaussian query tokens. These tokens are decoded via a transformer decoder with full self-attention, enabling the model to aggregate global multi-frame features into a compact set of 3D Gaussians representing the scene at any arbitrary time. This design effectively addresses the redundancy and view-dependence issues prevalent in pixel-wise high Gaussian count methods. The key innovation is conditioning the Gaussian queries on timestamps, which allows the model to produce temporally coherent scene representations without scene-specific optimization.
The authors further enhance the visual quality of the reconstructed scenes by integrating a video diffusion model-based rendering refinement module. This module refines the rendered images conditioned on input views, filling in high-frequency details and reducing artifacts. Moreover, the model employs a feature lifting mechanism, mapping 2D foundation model features into a 4D feature field, supporting point tracking and scene understanding tasks. The entire system is trained end-to-end, relying solely on photometric and auxiliary supervision signals like depth, normals, and motion tracking, without requiring camera pose information.
Experimental results demonstrate that C4G achieves state-of-the-art or competitive performance across multiple dynamic scene datasets, including DynaCheck, TUM-Dynamics, and NVIDIA. It significantly outperforms existing methods in novel view synthesis, maintaining high PSNR scores (~15.64dB) with far fewer Gaussians (thousands versus hundreds of thousands). The model exhibits robust temporal interpolation capabilities, with PSNR only slightly decreasing even at large temporal gaps (βt=8). Point tracking and 4D feature field evaluations confirm that C4G captures scene-wide motion trajectories more accurately than pixel-wise approaches, validating its understanding of global scene dynamics.
This work represents a major step forward in dynamic scene reconstruction, offering a scalable, generalizable, and efficient solution. Its ability to model scenes with fewer primitives, without scene-specific optimization, opens new possibilities for real-time applications in AR/VR, robotics, and content creation. The integration of diffusion-based refinement bridges the gap between geometric accuracy and visual fidelity, setting a new standard for neural dynamic scene modeling. Nonetheless, challenges remain in handling extremely fast motions, occlusions, and large-scale scenes, which motivate future research directions such as multi-scale attention, multi-modal cues, and unsupervised learning strategies.
Deep Dive
Limitations & Outlook
What gaps remain?
Abstract
Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.
References (20)
UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images
Junhwa Hur, Charles Herrmann, Songyou Peng et al.
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
Chenguo Lin, Yuchen Lin, Panwang Pan et al.
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Yuxue Yang, Lue Fan, Ziqi Shi et al.
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting
Kai Zhang, Sai Bi, Hao Tan et al.
Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos
Hanxue Liang, Jiawei Ren, Ashkan Mirzaei et al.
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Ruicheng Wang, Sicheng Xu, Yue Dong et al.
C3G: Learning Compact 3D Representations with 2K Gaussians
Honggyu An, Jaewoo Jung, Mungyeom Kim et al.
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception
Xiaqing Pan, Nicholas Charron, Yongqiang Yang et al.
Decoupled Weight Decay Regularization
I. Loshchilov, F. Hutter
PixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
David Charatan, Sizhe Li, Andrea Tagliasacchi et al.
3D Gaussian Splatting for Real-Time Radiance Field Rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler et al.
4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos
Zhen Xu, Zhengqin Li, Zhao Dong et al.
MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds
Jiahui Lei, Yijia Weng, Adam W. Harley et al.
Shape of Motion: 4D Reconstruction From a Single Video
Qianqian Wang, Vickie Ye, Hang Gao et al.
VGGT: Visual Geometry Grounded Transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev et al.
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.
Unifying Correspondence, Pose and NeRF for Generalized Pose-Free Novel View Synthesis
SungβJin Hong, Jaewoo Jung, Heeseong Shin et al.
Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
Jisang Han, SungβJin Hong, Jaewoo Jung et al.
HexPlane: A Fast Representation for Dynamic Scenes
Ang Cao, Justin Johnson
GeCoNeRF: Few-shot Neural Radiance Fields via Geometric Consistency
Minseop Kwak, Jiuhn Song, Seungryong Kim