Modality Forcing for Scalable Spatial Generation

TL;DR

Proposes Modality Forcing, a post-training method enabling a single DiT model to jointly generate image and sparse depth data, achieving 57% reduction in AbsRel and scaling with model size.

cs.CV 🔴 Advanced 2026-06-12 100 views

Bardienus Pieter Duisterhof Deva Ramanan Jeffrey Ichnowski Justin Johnson Keunhong Park

AI Reader Arxiv Page Download PDF

multimodal generation diffusion models depth estimation spatial perception large-scale pretraining

Key Findings

Methodology

This paper introduces Modality Forcing, a simple yet scalable post-training strategy that adapts pretrained diffusion models (DiT) for joint image-depth generation. The core idea involves assigning separate noise schedules to each modality—RGB and depth—within the diffusion process, enabling flexible conditional and joint generation in any permutation. The approach employs pixel-space depth tokenization, where depth maps are encoded as discrete tokens, allowing the model to learn from sparse real-world depth annotations. During training, the model is exposed to various noise levels for each modality, supporting tasks such as unconditional joint generation, image-to-depth (I2D), and depth-to-image (D2I). To preserve the pretrained spatial priors, the authors incorporate a self-distillation loss that penalizes deviation from the original T2I checkpoint, ensuring the model retains its generative prior while learning to predict depth. They further demonstrate that training a family of models from scratch, with parameters ranging from 370 million to 3.3 billion, reveals a positive correlation between model capacity and depth prediction accuracy, confirming the scalability of the approach.

Key Results

The largest model (3.3B parameters) achieves an AbsRel of 2.52% on NYUv2, surpassing existing joint models by 57%, and approaches the performance of state-of-the-art monocular depth estimators like MoGe-2 (3.14%). The results indicate that larger models trained on more image data produce more accurate depth predictions, validating the scalability of the method.
In the depth-to-image (D2I) task, the model achieves an FID of 11.41 on OpenImages, outperforming baselines such as ControlNet and UniCon, demonstrating high-quality image synthesis conditioned on depth maps. Similarly, in image-to-depth (I2D), the model surpasses existing methods, with a significant margin in accuracy metrics across multiple benchmarks.
The experiments show that depth prediction improves consistently with increasing model size and training data volume, confirming that T2I pretraining provides a powerful spatial prior. The ablation studies highlight the importance of pixel-space depth tokenization and multi-noise scheduling, which contribute substantially to the model’s performance and robustness.

Significance

This work addresses a fundamental challenge in spatial perception: how to leverage large-scale image generation models for accurate depth estimation with sparse data. By demonstrating that T2I models inherently contain rich spatial priors, and that these priors can be extracted via a simple post-training recipe, the authors open new avenues for scalable, data-efficient 3D scene understanding. The approach significantly reduces the dependency on dense depth annotations, lowering costs and enabling broader deployment in real-world applications such as robotics, AR/VR, and content creation. Furthermore, the validation of model scalability underscores the potential of large pretraining in advancing spatial perception tasks, bridging the gap between generative modeling and geometric reasoning.

Technical Contribution

The paper’s main technical innovation lies in the Modality Forcing diffusion algorithm, which assigns independent noise schedules to each modality within a unified diffusion process. This design allows the model to support multiple tasks—joint generation, conditional image-to-depth, and depth-to-image—using a single set of weights. The pixel-space depth tokenizer enables learning from sparse, real-world depth annotations, overcoming limitations of previous dense supervision-dependent methods. The incorporation of self-distillation preserves the pretrained spatial priors during post-training, ensuring the model’s generative capabilities are maintained while adapting to new modalities. Additionally, training a series of models of increasing size demonstrates the scalability of the approach, with larger models consistently producing more accurate depth predictions, thus establishing a new paradigm for large-scale multimodal pretraining.

Novelty

This work is the first to systematically adapt large-scale T2I diffusion models for joint image-depth generation using a simple post-training recipe. Unlike prior methods relying on dense supervision or complex adapter modules, Modality Forcing leverages independent noise schedules and pixel-space depth tokenization to enable scalable, flexible, and high-fidelity spatial generation. The approach demonstrates that the spatial priors embedded in T2I models can be effectively transferred to depth prediction, establishing a new link between generative modeling and spatial perception. The comprehensive scaling study further reveals that model size and training data volume directly influence depth accuracy, highlighting the method’s scalability and practical potential.

Limitations

Despite its strengths, the model’s performance degrades in scenarios with extremely sparse or noisy depth data, especially at long distances or in complex geometries, due to limitations in the depth tokenizer’s expressiveness.
The training process requires substantial computational resources, especially for larger models, which may limit accessibility for smaller research groups or deployment in resource-constrained environments.
The current framework primarily focuses on static scenes; extending it to dynamic or multi-view scenarios remains an open challenge, requiring additional temporal or multi-view modeling.

Future Work

Future research could explore adaptive noise scheduling mechanisms that dynamically optimize modality-specific denoising trajectories, enhancing performance in diverse conditions. Integrating temporal information and multi-view data could extend the model’s applicability to dynamic scenes and multi-camera setups. Developing more efficient architectures to reduce computational costs without sacrificing accuracy is another promising direction. Additionally, combining this approach with self-supervised learning or reinforcement learning could further improve the model’s understanding of complex spatial relationships, pushing the boundaries of scalable spatial perception.

AI Executive Summary

The quest for accurate spatial understanding remains a central challenge in computer vision. Traditional methods rely heavily on dense depth annotations, which are costly and difficult to scale across diverse environments. Recent advances in large-scale image generation models, particularly text-to-image (T2I) diffusion frameworks like Stable Diffusion and FLUX, have demonstrated remarkable capacity to encode rich spatial priors. However, leveraging these models for depth estimation and 3D scene understanding has remained an open problem.

This paper introduces Modality Forcing, a novel post-training strategy that transforms a pretrained DiT-based T2I model into a joint image-depth generator capable of handling sparse real-world depth data. The core innovation lies in assigning separate noise schedules to each modality within the diffusion process, enabling flexible conditional and joint generation. By incorporating pixel-space depth tokenization, the model learns from sparse depth annotations, bypassing the need for dense supervision. The approach supports multiple tasks—unconditional joint generation, image-to-depth, and depth-to-image—using a single set of weights, significantly simplifying the pipeline.

A key aspect of the methodology is the training of a family of models ranging from 370 million to 3.3 billion parameters. The experiments reveal a clear trend: larger models trained on more image data produce more accurate depth predictions, confirming the scalability of the approach. On benchmarks such as NYUv2, ETH3D, and ScanNet, the largest model achieves an AbsRel of 2.52%, outperforming existing joint models by 57%, and approaching the performance of dedicated monocular depth estimators like MoGe-2.

The results demonstrate that the spatial priors embedded in large-scale T2I models are highly transferable to depth prediction tasks. The model excels not only in depth estimation but also in depth-conditioned image synthesis, achieving an FID of 11.41 on OpenImages, surpassing baselines. These findings suggest that image generation can serve as a scalable pretraining objective for spatial perception, reducing reliance on costly dense annotations.

Beyond academic validation, this work has practical implications for robotics, AR/VR, and content creation, where rapid, high-quality scene understanding is essential. The ability to generate consistent 3D point clouds and accurate depth maps from sparse data opens new avenues for real-time scene reconstruction and virtual environment generation. Despite its strengths, challenges remain in handling highly sparse or noisy data, and in extending the framework to dynamic scenes. Future directions include adaptive denoising strategies, multi-view integration, and efficiency improvements.

Overall, this research bridges the gap between generative modeling and spatial perception, establishing a scalable, data-efficient paradigm that leverages the power of large-scale pretraining. It paves the way for next-generation AI systems capable of understanding and generating complex 3D environments with minimal supervision, marking a significant step forward in the quest for intelligent spatial reasoning.

Deep Dive

Abstract

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

cs.CV

References (20)

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang et al.

2024 1704 citations ⭐ Influential View Analysis →

JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Byung-Ki Kwon, Qi Dai, Hyoseok Lee et al.

2025 7 citations ⭐ Influential View Analysis →

A Multi-view Stereo Benchmark with High-Resolution Images and Multi-camera Videos

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani et al.

2017 1144 citations ⭐ Influential

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, A. Blattmann et al.

2024 4040 citations ⭐ Influential View Analysis →

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong et al.

2025 178 citations ⭐ Influential View Analysis →

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Angela Dai, Angel X. Chang, M. Savva et al.

2017 5520 citations ⭐ Influential View Analysis →

Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation

Alan Baade, E. Chan, Kyle Sargent et al.

2026 15 citations ⭐ Influential View Analysis →

Indoor Segmentation and Support Inference from RGBD Images

N. Silberman, Derek Hoiem, Pushmeet Kohli et al.

2012 6593 citations ⭐ Influential

DIODE: A Dense Indoor and Outdoor DEpth Dataset

Igor Vasiljevic, Nicholas I. Kolkin, Shanyi Zhang et al.

2019 350 citations ⭐ Influential View Analysis →

Learning without Forgetting

Zhizhong Li, Derek Hoiem

2016 5625 citations View Analysis →

Image Generators are Generalist Vision Learners

Valentin Gabeur, Shangbang Long, Songyou Peng et al.

2026 7 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 59296 citations View Analysis →

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang, Bingyi Kang, Zilong Huang et al.

2024 1847 citations View Analysis →

UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

Hyung Won Chung, Noah Constant, Xavier García et al.

2023 141 citations View Analysis →

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, A. Blattmann, Dominik Lorenz et al.

2021 25234 citations View Analysis →

Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Gangwei Xu, Haotong Lin, Hongcheng Luo et al.

2025 27 citations View Analysis →

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang et al.

2023 416 citations View Analysis →

A Learning Algorithm for Continually Running Fully Recurrent Neural Networks

Ronald J. Williams, D. Zipser

1989 5018 citations

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, R. Birkl, Diana Wofk et al.

2023 927 citations View Analysis →

Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields

J. Barron, B. Mildenhall, Dor Verbin et al.

2021 2640 citations View Analysis →

Modality Forcing for Scalable Spatial Generation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence