Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

TL;DR

C4G introduces timestamp-conditioned learnable Gaussian tokens with transformer decoding, enabling efficient 4D scene reconstruction from monocular video without scene-specific optimization, reducing Gaussian count by orders of magnitude.

cs.CV 🔴 Advanced 2026-05-30 117 views

Mungyeom Kim Minkyeong Jeon Honggyu An Jaewoo Jung Hyuna Ko Jisang Han Hyeonseo Yu Donghwan Shin Sunghwan Hong Takuya Narihira Kazumi Fukuda Yuki Mitsufuji Seungryong Kim

AI Reader Arxiv Page Download PDF

dynamic scene reconstruction 4D modeling neural rendering Gaussian primitives video diffusion

Key Findings

Methodology

The proposed C4G framework employs a set of learnable Gaussian query tokens conditioned on timestamps, decoded via a transformer decoder with full self-attention. Visual features are extracted from multi-frame videos using a pretrained VGGT backbone, then embedded with temporal positional encodings. These features and timestamp-conditioned queries interact within the transformer, producing a compact set of Gaussian parameters representing the scene at arbitrary times. To enhance rendering quality, a video diffusion model-based refinement module is integrated, which refines the generated images conditioned on input views. Additionally, a feature lifting mechanism maps 2D foundation model features into a 4D feature field, supporting point tracking and scene understanding. The entire pipeline is trained end-to-end, avoiding scene-specific optimization, and achieves high-fidelity dynamic scene reconstruction with significantly fewer Gaussians.

Key Results

On datasets such as DynaCheck, TUM-Dynamics, and NVIDIA, C4G outperforms existing methods like NeoVerse and MoSca, achieving PSNR scores of 15.64dB, 20.59dB at short temporal gaps, with a Gaussian count in the thousands, far fewer than traditional pixel-wise methods which use hundreds of thousands. The model maintains high-quality novel view synthesis even with large temporal gaps (∆t=8), where PSNR only drops slightly to 19.23dB.
Quantitative evaluations on point tracking and 4D feature fields demonstrate that C4G captures scene-wide motion trajectories more accurately than pixel-based Gaussian models, validating its understanding of global scene dynamics.
Ablation studies confirm that the time-conditioned query design and attention mechanisms are crucial for reducing Gaussian redundancy and improving generalization across diverse scenes.

Significance

This work advances the field of dynamic scene reconstruction by providing a scalable, generalizable, and efficient framework that does not require scene-specific optimization. It addresses longstanding issues such as high computational costs, view-dependent biases, and limited temporal interpolation, making real-time or near-real-time 4D scene understanding feasible. The integration of a diffusion-based rendering enhancement further bridges the gap between geometric accuracy and visual fidelity, opening new avenues for applications in AR/VR, robotics, and content creation. Its ability to model scenes with fewer primitives and without camera pose information marks a significant step toward practical deployment.

Technical Contribution

The core technical innovation lies in the design of timestamp-conditioned learnable Gaussian query tokens that, combined with a transformer decoder, enable global motion modeling with a compact primitive set. This approach circumvents pixel-wise prediction's redundancy and view-dependent biases, providing a unified representation across time and views. The use of full self-attention ensures spatial and temporal coherence, while the feature lifting mechanism allows arbitrary foundation model features to be mapped into a 4D scene representation. The integration of a diffusion-based rendering refinement further enhances visual quality, making the entire pipeline both efficient and robust.

Novelty

This is the first work to incorporate timestamp-conditioned learnable Gaussian query tokens within a transformer framework for dynamic 4D scene reconstruction. Unlike prior pixel-wise high Gaussian count methods, C4G achieves a globally coherent, compact scene representation that generalizes across scenes and large temporal gaps. Its combination of global feature aggregation, temporal conditioning, and diffusion refinement represents a novel paradigm shift in the field, bridging the gap between efficiency and fidelity.

Limitations

Despite its strengths, the model struggles with scenes involving extremely rapid motion or complex occlusions, where the feature aggregation may not fully capture fine details or occluded regions.
The computational cost of full self-attention scales quadratically with feature map size, limiting scalability to very high-resolution inputs or large scenes without further optimization.
While camera pose independence is a strength, the geometric accuracy in large-scale scenes without explicit pose information can still be improved, possibly by integrating additional priors or multi-view cues.

Future Work

Future research could explore multi-scale attention mechanisms to improve efficiency and detail capture, incorporate multi-modal cues like depth and flow for better geometric fidelity, and extend the framework to multi-camera setups for large-scale scene understanding. Additionally, unsupervised or weakly supervised training strategies could further reduce reliance on annotated data, broadening real-world applicability.

AI Executive Summary

Reconstructing dynamic scenes in four dimensions from monocular videos has long been a fundamental challenge in computer vision. Traditional approaches relied heavily on scene-specific optimization, which, while capable of high-fidelity results, suffered from high computational costs, limited scalability, and poor generalization to new scenes. These methods often required hours of optimization per scene, making them impractical for large-scale or real-time applications. Recent advances in neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), introduced powerful representations capable of high-quality static scene reconstruction. Extending these to dynamic scenes, however, posed additional challenges: how to model scene motion coherently across time, how to handle occlusions and view-dependent biases, and how to do so efficiently without scene-specific tuning.

In this context, the paper introduces C4G, a novel framework that leverages a set of timestamp-conditioned learnable Gaussian query tokens. These tokens are decoded via a transformer decoder with full self-attention, enabling the model to aggregate global multi-frame features into a compact set of 3D Gaussians representing the scene at any arbitrary time. This design effectively addresses the redundancy and view-dependence issues prevalent in pixel-wise high Gaussian count methods. The key innovation is conditioning the Gaussian queries on timestamps, which allows the model to produce temporally coherent scene representations without scene-specific optimization.

The authors further enhance the visual quality of the reconstructed scenes by integrating a video diffusion model-based rendering refinement module. This module refines the rendered images conditioned on input views, filling in high-frequency details and reducing artifacts. Moreover, the model employs a feature lifting mechanism, mapping 2D foundation model features into a 4D feature field, supporting point tracking and scene understanding tasks. The entire system is trained end-to-end, relying solely on photometric and auxiliary supervision signals like depth, normals, and motion tracking, without requiring camera pose information.

Experimental results demonstrate that C4G achieves state-of-the-art or competitive performance across multiple dynamic scene datasets, including DynaCheck, TUM-Dynamics, and NVIDIA. It significantly outperforms existing methods in novel view synthesis, maintaining high PSNR scores (~15.64dB) with far fewer Gaussians (thousands versus hundreds of thousands). The model exhibits robust temporal interpolation capabilities, with PSNR only slightly decreasing even at large temporal gaps (∆t=8). Point tracking and 4D feature field evaluations confirm that C4G captures scene-wide motion trajectories more accurately than pixel-wise approaches, validating its understanding of global scene dynamics.

This work represents a major step forward in dynamic scene reconstruction, offering a scalable, generalizable, and efficient solution. Its ability to model scenes with fewer primitives, without scene-specific optimization, opens new possibilities for real-time applications in AR/VR, robotics, and content creation. The integration of diffusion-based refinement bridges the gap between geometric accuracy and visual fidelity, setting a new standard for neural dynamic scene modeling. Nonetheless, challenges remain in handling extremely fast motions, occlusions, and large-scale scenes, which motivate future research directions such as multi-scale attention, multi-modal cues, and unsupervised learning strategies.

Deep Dive

⚠️

Limitations & Outlook

What gaps remain?

While C4G demonstrates impressive capabilities, it faces limitations in scenes with extremely rapid motion or complex occlusions, where feature aggregation may not fully capture fine details or hidden regions. The quadratic complexity of full self-attention constrains scalability to very high resolutions or large scenes, necessitating further architectural optimizations. Additionally, the absence of explicit camera pose information can lead to geometric inaccuracies in large-scale environments, indicating a need for integrating pose estimation or multi-view cues. Future work should focus on multi-scale attention mechanisms, efficient attention variants, and multi-modal data fusion to address these issues.

Abstract

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.

cs.CV

References (20)

UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

Junhwa Hur, Charles Herrmann, Songyou Peng et al.

2026 2 citations ⭐ Influential View Analysis →

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

Chenguo Lin, Yuchen Lin, Panwang Pan et al.

2025 24 citations ⭐ Influential View Analysis →

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

Yuxue Yang, Lue Fan, Ziqi Shi et al.

2026 12 citations ⭐ Influential View Analysis →

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Kai Zhang, Sai Bi, Hao Tan et al.

2024 319 citations ⭐ Influential View Analysis →

Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos

Hanxue Liang, Jiawei Ren, Ashkan Mirzaei et al.

2024 45 citations ⭐ Influential View Analysis →

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong et al.

2025 157 citations ⭐ Influential View Analysis →

C3G: Learning Compact 3D Representations with 2K Gaussians

Honggyu An, Jaewoo Jung, Mungyeom Kim et al.

2025 8 citations ⭐ Influential View Analysis →

Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception

Xiaqing Pan, Nicholas Charron, Yongqiang Yang et al.

2023 153 citations ⭐ Influential View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 34449 citations ⭐ Influential

PixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

David Charatan, Sizhe Li, Andrea Tagliasacchi et al.

2023 670 citations ⭐ Influential View Analysis →

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler et al.

2023 8611 citations ⭐ Influential View Analysis →

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

Zhen Xu, Zhengqin Li, Zhao Dong et al.

2025 32 citations ⭐ Influential View Analysis →

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

Jiahui Lei, Yijia Weng, Adam W. Harley et al.

2024 164 citations ⭐ Influential View Analysis →

Shape of Motion: 4D Reconstruction From a Single Video

Qianqian Wang, Vickie Ye, Hang Gao et al.

2024 234 citations ⭐ Influential View Analysis →

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev et al.

2025 1215 citations ⭐ Influential View Analysis →

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.

2025 329 citations ⭐ Influential View Analysis →

Unifying Correspondence, Pose and NeRF for Generalized Pose-Free Novel View Synthesis

Sung‐Jin Hong, Jaewoo Jung, Heeseong Shin et al.

2023 39 citations View Analysis →

Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

Jisang Han, Sung‐Jin Hong, Jaewoo Jung et al.

2025 16 citations View Analysis →

HexPlane: A Fast Representation for Dynamic Scenes

Ang Cao, Justin Johnson

2023 740 citations View Analysis →

GeCoNeRF: Few-shot Neural Radiance Fields via Geometric Consistency

Minseop Kwak, Jiuhn Song, Seungryong Kim

2023 67 citations View Analysis →

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Limitations & Outlook

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence