DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
DecQ introduces detail-condensing queries to RAEs, boosting reconstruction PSNR to 22.76dB and reducing generation FID to 1.41 with only 3.9% extra computation.
Key Findings
Methodology
DecQ is a novel framework enhancing Representation Autoencoders (RAEs) by introducing lightweight detail-condensing queries that extract fine-grained information from intermediate layers of frozen Vision Foundation Models (VFMs) via condenser modules. These queries complement the semantic patch tokens from the VFM encoder and are jointly input into a ViT-based decoder. During generation, both query and patch tokens are jointly denoised in the latent diffusion model, preserving the original semantic latent space while enriching low-level visual details. This design mitigates the inherent trade-off between reconstruction fidelity and generative quality in RAEs by aggregating multi-level features without fine-tuning the VFM, thus maintaining semantic consistency and improving both reconstruction and generation.
Key Results
- On ImageNet 256×256, DecQ adds only 8 queries and 3.9% computational overhead to a frozen DINOv2-based RAE, improving reconstruction PSNR from 19.13dB to 22.76dB and significantly reducing reconstruction FID (rFID), demonstrating enhanced detail recovery.
- In generative modeling, DecQ achieves an FID of 1.80 at 80 epochs without guidance, outperforming RAE's 2.16 and converging 3.3× faster; at 800 epochs with guidance, FID further improves to 1.05, setting new state-of-the-art for high-dimensional VFM latent space generation.
- Ablation studies reveal that 8 queries and condenser modules attached at VFM layers 0, 3, 6, and 9 yield optimal reconstruction-generation balance. Notably, predicting detail-condensing queries benefits generation quality even when query tokens are discarded at inference.
Significance
DecQ addresses a critical bottleneck in RAEs: the frozen VFM encoder's limited spatial reconstruction capacity versus the semantic space disruption caused by fine-tuning. By introducing detail-condensing queries that extract complementary low-level features without modifying the VFM, DecQ significantly enhances image reconstruction fidelity and generative quality simultaneously. This advances the state-of-the-art in high-dimensional semantic latent space generation, offering a practical solution to the longstanding trade-off between reconstruction and generation. The framework's lightweight and modular design facilitates integration with existing pretrained VFMs, promising broad impact in both academic research and industrial applications requiring high-fidelity image synthesis and editing.
Technical Contribution
The key technical innovation of DecQ lies in its cross-attention-based detail-condensing queries that serve as an information bridge between frozen VFMs and the generative decoder. Unlike prior methods that fine-tune or concatenate features—often disrupting semantic consistency—DecQ preserves the original VFM latent space by unidirectional information flow from patch tokens to queries. The multi-layer condenser modules aggregate shallow and deep features, enhancing both low-level detail recovery and high-level semantic coherence. Additionally, DecQ's joint denoising of query and patch tokens in the diffusion process improves training efficiency and generation quality, extending the capabilities of RAEs with minimal computational overhead.
Novelty
DecQ is the first framework to resolve the reconstruction-generation trade-off in RAEs by introducing learnable detail-condensing queries that extract fine-grained features from intermediate VFM layers without fine-tuning the encoder. This contrasts with existing approaches that either fine-tune VFMs—risking semantic space distortion—or concatenate features, which may misalign latent spaces. DecQ's joint generation of query and patch tokens further distinguishes it by enhancing generative modeling directly, achieving simultaneous improvements in reconstruction fidelity and generation quality.
Limitations
- DecQ's performance depends on the representational capacity of the underlying pretrained VFM; in scenarios with extremely complex or highly detailed images, the fixed query capacity may limit detail recovery.
- While increasing the number of queries improves reconstruction, excessive queries introduce redundant low-level information that can degrade generation quality, indicating sensitivity to hyperparameter tuning.
- The framework has been primarily validated on DINOv2 and SigLIP2 VFMs; its generalization to other architectures and modalities remains to be thoroughly explored.
Future Work
Future research could explore adaptive mechanisms to dynamically adjust the number of detail-condensing queries and condenser layer selections to better balance reconstruction and generation across diverse datasets and resolutions. Extending DecQ to multimodal pretrained models could enhance cross-modal detail representation. Additionally, designing more efficient condenser architectures and lightweight query structures may reduce computational costs, enabling application to higher-resolution images and real-time generation scenarios.
AI Executive Summary
In the rapidly advancing field of visual generation, diffusion models have become the dominant paradigm, typically relying on a two-stage training process: first learning a tokenizer to encode images into a latent space, then training a generative model within that space. Representation Autoencoders (RAEs) have recently innovated by replacing the tokenizer encoder with a frozen pretrained Vision Foundation Model (VFM), leveraging its rich semantic features to accelerate diffusion model convergence and improve generation quality. However, freezing the VFM limits its spatial reconstruction capabilities, resulting in loss of fine-grained details such as textures and colors, which hampers detailed image generation and editing. Conversely, fine-tuning the VFM to improve reconstruction disrupts the pretrained semantic space, degrading generative fidelity and training stability.
To overcome this fundamental trade-off, the authors propose DecQ, a simple yet effective framework that introduces detail-condensing queries—lightweight learnable tokens that attend to intermediate VFM features via condenser modules. These queries extract complementary low-level visual details progressively lost in the frozen VFM's semantic latent space. During generation, both query and patch tokens are jointly denoised by a latent diffusion model and decoded together, enriching the final image with fine-grained information while preserving semantic consistency.
Technically, DecQ attaches condenser modules at multiple VFM layers (e.g., layers 0, 3, 6, and 9) to aggregate multi-level features into a small set of query tokens (typically 8). Cross-attention mechanisms ensure unidirectional information flow from patch tokens to queries, preventing any alteration of the original VFM representations. This modular design maintains the frozen VFM's semantic latent space intact while supplementing it with detailed visual cues. The approach is computationally efficient, incurring only a 3.9% increase in overhead.
Extensive experiments on ImageNet 256×256 demonstrate that DecQ significantly improves reconstruction PSNR from 19.13dB to 22.76dB and reduces reconstruction FID, indicating superior detail recovery. In generative tasks, DecQ achieves an FID of 1.80 at 80 epochs without guidance—surpassing the baseline RAE's 2.16—and converges 3.3 times faster. With guidance at 800 epochs, FID further improves to 1.05, setting new state-of-the-art results for high-dimensional VFM latent space generation. Ablation studies confirm the importance of query number and condenser layer selection for balancing reconstruction and generation performance. Moreover, DecQ generalizes well across different VFMs such as SigLIP2.
This work advances the state-of-the-art in representation autoencoding by resolving the longstanding conflict between reconstruction fidelity and generative quality in frozen VFM latent spaces. Its lightweight, modular design facilitates integration with existing pretrained models, enabling richer, more detailed image synthesis without sacrificing semantic coherence. DecQ paves the way for future research in adaptive query mechanisms, multimodal extensions, and efficient architectures, with promising applications in high-fidelity image generation, editing, and beyond.
Deep Analysis
Background
Visual generation has witnessed remarkable progress with the advent of diffusion models, which generate high-quality images by iteratively denoising latent representations. Traditionally, these models rely on Variational Autoencoders (VAEs) to learn compressed latent spaces; however, VAEs often produce latent representations lacking strong semantic structure due to their reconstruction-centric objectives. To address this, Representation Autoencoders (RAEs) leverage frozen pretrained Vision Foundation Models (VFMs) such as DINOv2 and SigLIP2 as tokenizers, capitalizing on their rich semantic features learned via self-supervised or multimodal training. This approach accelerates diffusion model convergence and enhances generation quality by operating in a semantically meaningful high-dimensional latent space. Despite these advantages, VFMs are typically trained with objectives emphasizing semantic invariance rather than pixel-level fidelity, resulting in latent representations that inadequately preserve low-level visual details like color and texture. Consequently, RAEs built on frozen VFMs exhibit limited spatial reconstruction capacity, hindering fine-grained image generation and editing. Prior attempts to fine-tune VFMs or augment latent spaces with reconstruction-oriented features often disrupt the semantic latent space, leading to degraded generative performance and slower convergence. Thus, balancing semantic consistency with reconstruction fidelity remains a critical challenge in RAE design.
Core Problem
The core challenge addressed in this work is the inherent trade-off in RAEs between preserving the semantic latent space of frozen VFMs and achieving high-fidelity image reconstruction. Frozen VFMs provide stable, semantically rich representations that facilitate fast and high-quality generation but lack sensitivity to low-level details, causing reconstruction artifacts such as texture loss and color shifts. Fine-tuning VFMs to enhance reconstruction introduces conflicting objectives, perturbing the pretrained semantic space and impairing generative fidelity. Alternative methods that concatenate reconstruction features with semantic tokens risk misaligning the latent space, hindering downstream diffusion training. This trade-off limits the practical utility of RAEs in applications requiring both detailed reconstruction and high-quality generation. Therefore, a mechanism is needed to supplement frozen VFM representations with complementary low-level information without modifying the encoder parameters or disrupting semantic consistency, enabling simultaneous improvements in reconstruction and generation.
Innovation
The principal innovation of DecQ lies in its introduction of detail-condensing queries—learnable tokens that attend to intermediate features of a frozen VFM via cross-attention condenser modules. This design achieves several key advances: (1) It preserves the frozen VFM's semantic latent space by ensuring unidirectional information flow from patch tokens to queries, preventing any modification of pretrained representations. (2) It aggregates multi-level features from shallow to deep VFM layers, capturing both low-level details and high-level semantics, thus addressing the reconstruction-generation trade-off. (3) The joint denoising of query and patch tokens during diffusion training integrates fine-grained detail recovery directly into the generative process, enhancing convergence speed and output quality. Unlike prior approaches that rely on fine-tuning or feature concatenation, DecQ maintains semantic consistency while enriching detail representation, enabling simultaneous improvements in reconstruction fidelity and generative performance with minimal computational overhead.
Methodology
- �� Utilize a frozen Vision Foundation Model (e.g., DINOv2-B) as the encoder, producing semantic patch tokens representing image regions.
- �� Attach condenser modules at selected intermediate VFM layers (default: layers 0, 3, 6, 9). Each condenser comprises a cross-attention block and a feed-forward network (FFN).
- �� Initialize a small set of learnable detail-condensing query tokens (typically 8) that serve as queries in the cross-attention, attending to the intermediate patch tokens (keys and values) to extract complementary low-level features.
- �� Ensure unidirectional information flow from patch tokens to queries to preserve the frozen VFM's latent space integrity.
- �� Project query and patch tokens separately with positional embeddings (2D sinusoidal for patches, learnable for queries), then concatenate and input into a ViT-based decoder for image reconstruction.
- �� During training, add noise to both query and patch tokens and jointly optimize a flow matching diffusion objective to denoise and reconstruct images.
- �� During generation, jointly sample and denoise query and patch tokens in the latent diffusion model, then decode to produce high-fidelity images.
- �� Evaluate on ImageNet 256×256 with metrics including PSNR, SSIM, reconstruction FID (rFID), and generation FID, IS, Precision, Recall.
- �� Conduct ablation studies on query number, condenser layer selection, and training paradigms to optimize the reconstruction-generation trade-off.
Experiments
Experiments are conducted on the ImageNet dataset at 256×256 resolution, using DINOv2-B as the default frozen VFM encoder and a ViT-XL decoder with approximately 500 million parameters. Baselines include the original RAE with frozen VFM, fine-tuned VFM variants with and without distillation losses, and feature concatenation methods. Reconstruction quality is assessed via PSNR, SSIM, and rFID, while generative performance is measured using FID, Inception Score (IS), Precision, and Recall. The diffusion model employs 50 sampling steps, with query-token loss weight set to 1. Ablation studies explore the impact of varying the number of detail-condensing queries, condenser module placement across VFM layers, and the effect of discarding query tokens at inference. Additional experiments validate DecQ's generalization on alternative VFMs such as SigLIP2-B. Results demonstrate DecQ's superior reconstruction and generation performance with minimal computational overhead and faster convergence compared to baselines.
Results
DecQ significantly improves reconstruction PSNR from 19.13dB to 22.76dB and reduces reconstruction FID compared to the frozen VFM RAE baseline, indicating enhanced recovery of low-level details. In generative tasks, DecQ achieves an FID of 1.80 at 80 epochs without guidance, outperforming RAE's 2.16 and converging 3.3 times faster. With guidance at 800 epochs, FID further improves to 1.05, setting new state-of-the-art results for high-dimensional VFM latent space generation. Ablation studies confirm that 8 detail-condensing queries and condenser modules at layers 0, 3, 6, and 9 provide the best balance between reconstruction fidelity and generation quality. Notably, even when query tokens are discarded at inference, predicting them during training benefits the generation of patch tokens, highlighting the queries' auxiliary role. DecQ also generalizes effectively across different VFMs such as SigLIP2-B.
Applications
DecQ is well-suited for applications demanding high-fidelity image reconstruction and generation, such as medical imaging, satellite image analysis, and artistic content creation. Its ability to preserve semantic consistency while enhancing fine-grained details benefits image editing and virtual reality scenarios requiring precise texture and color reproduction. The framework's modular design facilitates integration with existing pretrained VFMs, reducing training costs and enabling rapid deployment. Furthermore, DecQ's approach can inform multi-task vision systems that require simultaneous semantic understanding and detailed reconstruction, advancing both academic research and practical deployments in computer vision and graphics.
Limitations & Outlook
DecQ's reliance on the representational capacity of pretrained VFMs may limit its effectiveness in extremely complex or detail-rich image domains where fixed query capacity constrains detail extraction. The number of detail-condensing queries and condenser layer selection require careful tuning; excessive queries can introduce redundant low-level information that hampers generative modeling. Additionally, while validated on DINOv2 and SigLIP2, the framework's adaptability to other VFM architectures and modalities remains to be fully explored. Computational overhead, though modest, may increase with higher-resolution or larger-scale applications, necessitating further efficiency improvements.
Abstract
Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.
References (20)
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong et al.
StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis
Axel Sauer, Tero Karras, S. Laine et al.
Fast Training of Diffusion Models with Masked Transformers
Hongkai Zheng, Weili Nie, Arash Vahdat et al.
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Wei Song, Yuran Wang, Zijia Song et al.
Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think
Ge Wu, Shen Zhang, Ruijing Shi et al.
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen et al.
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, A. Blattmann, Dominik Lorenz et al.
Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Tianci Bi, Xiaoyi Zhang, Yan Lu et al.
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Richard Zhang, Phillip Isola, Alexei A. Efros et al.
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, R. Socher et al.
Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models
Bowei Chen, Sai Bi, Hao Tan et al.
Latent Diffusion Model without Variational Autoencoder
Minglei Shi, Haolin Wang, Wenzhao Zheng et al.
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Kaihang Pan, Wang Lin, Zhongqi Yue et al.
Image quality assessment: from error visibility to structural similarity
Zhou Wang, A. Bovik, H. Sheikh et al.
AUTO-ENCODING VARIATIONAL BAYES
Romain Lopez, Pierre Boyeau, N. Yosef et al.
Distribution Matching Variational AutoEncoder
Sen Ye, Jianning Pei, Mengde Xu et al.
RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing
Yue Gong, Hongyu Li, Shanyuan Liu et al.
Neural Discrete Representation Learning
Aäron van den Oord, O. Vinyals, K. Kavukcuoglu
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang et al.