DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

TL;DR

DecQ introduces detail-condensing queries to RAEs, boosting reconstruction PSNR to 22.76dB and reducing generation FID to 1.41 with only 3.9% extra computation.

cs.CV 🔴 Advanced 2026-05-22 46 views

Tianhang Wang Yitong Chen Wei Song Zuxuan Wu Min Li Jiaqi Wang

AI Reader Arxiv Page Download PDF

Representation Autoencoders Vision Foundation Models Detail-Condensing Queries Image Generation Diffusion Models

Key Findings

Methodology

DecQ is a novel framework enhancing Representation Autoencoders (RAEs) by introducing lightweight detail-condensing queries that extract fine-grained information from intermediate layers of frozen Vision Foundation Models (VFMs) via condenser modules. These queries complement the semantic patch tokens from the VFM encoder and are jointly input into a ViT-based decoder. During generation, both query and patch tokens are jointly denoised in the latent diffusion model, preserving the original semantic latent space while enriching low-level visual details. This design mitigates the inherent trade-off between reconstruction fidelity and generative quality in RAEs by aggregating multi-level features without fine-tuning the VFM, thus maintaining semantic consistency and improving both reconstruction and generation.

Key Results

On ImageNet 256×256, DecQ adds only 8 queries and 3.9% computational overhead to a frozen DINOv2-based RAE, improving reconstruction PSNR from 19.13dB to 22.76dB and significantly reducing reconstruction FID (rFID), demonstrating enhanced detail recovery.
In generative modeling, DecQ achieves an FID of 1.80 at 80 epochs without guidance, outperforming RAE's 2.16 and converging 3.3× faster; at 800 epochs with guidance, FID further improves to 1.05, setting new state-of-the-art for high-dimensional VFM latent space generation.
Ablation studies reveal that 8 queries and condenser modules attached at VFM layers 0, 3, 6, and 9 yield optimal reconstruction-generation balance. Notably, predicting detail-condensing queries benefits generation quality even when query tokens are discarded at inference.

Significance

DecQ addresses a critical bottleneck in RAEs: the frozen VFM encoder's limited spatial reconstruction capacity versus the semantic space disruption caused by fine-tuning. By introducing detail-condensing queries that extract complementary low-level features without modifying the VFM, DecQ significantly enhances image reconstruction fidelity and generative quality simultaneously. This advances the state-of-the-art in high-dimensional semantic latent space generation, offering a practical solution to the longstanding trade-off between reconstruction and generation. The framework's lightweight and modular design facilitates integration with existing pretrained VFMs, promising broad impact in both academic research and industrial applications requiring high-fidelity image synthesis and editing.

Technical Contribution

The key technical innovation of DecQ lies in its cross-attention-based detail-condensing queries that serve as an information bridge between frozen VFMs and the generative decoder. Unlike prior methods that fine-tune or concatenate features—often disrupting semantic consistency—DecQ preserves the original VFM latent space by unidirectional information flow from patch tokens to queries. The multi-layer condenser modules aggregate shallow and deep features, enhancing both low-level detail recovery and high-level semantic coherence. Additionally, DecQ's joint denoising of query and patch tokens in the diffusion process improves training efficiency and generation quality, extending the capabilities of RAEs with minimal computational overhead.

Novelty

DecQ is the first framework to resolve the reconstruction-generation trade-off in RAEs by introducing learnable detail-condensing queries that extract fine-grained features from intermediate VFM layers without fine-tuning the encoder. This contrasts with existing approaches that either fine-tune VFMs—risking semantic space distortion—or concatenate features, which may misalign latent spaces. DecQ's joint generation of query and patch tokens further distinguishes it by enhancing generative modeling directly, achieving simultaneous improvements in reconstruction fidelity and generation quality.

Limitations

DecQ's performance depends on the representational capacity of the underlying pretrained VFM; in scenarios with extremely complex or highly detailed images, the fixed query capacity may limit detail recovery.
While increasing the number of queries improves reconstruction, excessive queries introduce redundant low-level information that can degrade generation quality, indicating sensitivity to hyperparameter tuning.
The framework has been primarily validated on DINOv2 and SigLIP2 VFMs; its generalization to other architectures and modalities remains to be thoroughly explored.

Future Work

Future research could explore adaptive mechanisms to dynamically adjust the number of detail-condensing queries and condenser layer selections to better balance reconstruction and generation across diverse datasets and resolutions. Extending DecQ to multimodal pretrained models could enhance cross-modal detail representation. Additionally, designing more efficient condenser architectures and lightweight query structures may reduce computational costs, enabling application to higher-resolution images and real-time generation scenarios.

AI Executive Summary

In the rapidly advancing field of visual generation, diffusion models have become the dominant paradigm, typically relying on a two-stage training process: first learning a tokenizer to encode images into a latent space, then training a generative model within that space. Representation Autoencoders (RAEs) have recently innovated by replacing the tokenizer encoder with a frozen pretrained Vision Foundation Model (VFM), leveraging its rich semantic features to accelerate diffusion model convergence and improve generation quality. However, freezing the VFM limits its spatial reconstruction capabilities, resulting in loss of fine-grained details such as textures and colors, which hampers detailed image generation and editing. Conversely, fine-tuning the VFM to improve reconstruction disrupts the pretrained semantic space, degrading generative fidelity and training stability.

To overcome this fundamental trade-off, the authors propose DecQ, a simple yet effective framework that introduces detail-condensing queries—lightweight learnable tokens that attend to intermediate VFM features via condenser modules. These queries extract complementary low-level visual details progressively lost in the frozen VFM's semantic latent space. During generation, both query and patch tokens are jointly denoised by a latent diffusion model and decoded together, enriching the final image with fine-grained information while preserving semantic consistency.

Technically, DecQ attaches condenser modules at multiple VFM layers (e.g., layers 0, 3, 6, and 9) to aggregate multi-level features into a small set of query tokens (typically 8). Cross-attention mechanisms ensure unidirectional information flow from patch tokens to queries, preventing any alteration of the original VFM representations. This modular design maintains the frozen VFM's semantic latent space intact while supplementing it with detailed visual cues. The approach is computationally efficient, incurring only a 3.9% increase in overhead.

Extensive experiments on ImageNet 256×256 demonstrate that DecQ significantly improves reconstruction PSNR from 19.13dB to 22.76dB and reduces reconstruction FID, indicating superior detail recovery. In generative tasks, DecQ achieves an FID of 1.80 at 80 epochs without guidance—surpassing the baseline RAE's 2.16—and converges 3.3 times faster. With guidance at 800 epochs, FID further improves to 1.05, setting new state-of-the-art results for high-dimensional VFM latent space generation. Ablation studies confirm the importance of query number and condenser layer selection for balancing reconstruction and generation performance. Moreover, DecQ generalizes well across different VFMs such as SigLIP2.

This work advances the state-of-the-art in representation autoencoding by resolving the longstanding conflict between reconstruction fidelity and generative quality in frozen VFM latent spaces. Its lightweight, modular design facilitates integration with existing pretrained models, enabling richer, more detailed image synthesis without sacrificing semantic coherence. DecQ paves the way for future research in adaptive query mechanisms, multimodal extensions, and efficient architectures, with promising applications in high-fidelity image generation, editing, and beyond.

Deep Analysis

Background

Visual generation has witnessed remarkable progress with the advent of diffusion models, which generate high-quality images by iteratively denoising latent representations. Traditionally, these models rely on Variational Autoencoders (VAEs) to learn compressed latent spaces; however, VAEs often produce latent representations lacking strong semantic structure due to their reconstruction-centric objectives. To address this, Representation Autoencoders (RAEs) leverage frozen pretrained Vision Foundation Models (VFMs) such as DINOv2 and SigLIP2 as tokenizers, capitalizing on their rich semantic features learned via self-supervised or multimodal training. This approach accelerates diffusion model convergence and enhances generation quality by operating in a semantically meaningful high-dimensional latent space. Despite these advantages, VFMs are typically trained with objectives emphasizing semantic invariance rather than pixel-level fidelity, resulting in latent representations that inadequately preserve low-level visual details like color and texture. Consequently, RAEs built on frozen VFMs exhibit limited spatial reconstruction capacity, hindering fine-grained image generation and editing. Prior attempts to fine-tune VFMs or augment latent spaces with reconstruction-oriented features often disrupt the semantic latent space, leading to degraded generative performance and slower convergence. Thus, balancing semantic consistency with reconstruction fidelity remains a critical challenge in RAE design.

Core Problem

The core challenge addressed in this work is the inherent trade-off in RAEs between preserving the semantic latent space of frozen VFMs and achieving high-fidelity image reconstruction. Frozen VFMs provide stable, semantically rich representations that facilitate fast and high-quality generation but lack sensitivity to low-level details, causing reconstruction artifacts such as texture loss and color shifts. Fine-tuning VFMs to enhance reconstruction introduces conflicting objectives, perturbing the pretrained semantic space and impairing generative fidelity. Alternative methods that concatenate reconstruction features with semantic tokens risk misaligning the latent space, hindering downstream diffusion training. This trade-off limits the practical utility of RAEs in applications requiring both detailed reconstruction and high-quality generation. Therefore, a mechanism is needed to supplement frozen VFM representations with complementary low-level information without modifying the encoder parameters or disrupting semantic consistency, enabling simultaneous improvements in reconstruction and generation.

Innovation

The principal innovation of DecQ lies in its introduction of detail-condensing queries—learnable tokens that attend to intermediate features of a frozen VFM via cross-attention condenser modules. This design achieves several key advances: (1) It preserves the frozen VFM's semantic latent space by ensuring unidirectional information flow from patch tokens to queries, preventing any modification of pretrained representations. (2) It aggregates multi-level features from shallow to deep VFM layers, capturing both low-level details and high-level semantics, thus addressing the reconstruction-generation trade-off. (3) The joint denoising of query and patch tokens during diffusion training integrates fine-grained detail recovery directly into the generative process, enhancing convergence speed and output quality. Unlike prior approaches that rely on fine-tuning or feature concatenation, DecQ maintains semantic consistency while enriching detail representation, enabling simultaneous improvements in reconstruction fidelity and generative performance with minimal computational overhead.

Methodology

�� Utilize a frozen Vision Foundation Model (e.g., DINOv2-B) as the encoder, producing semantic patch tokens representing image regions.

�� Attach condenser modules at selected intermediate VFM layers (default: layers 0, 3, 6, 9). Each condenser comprises a cross-attention block and a feed-forward network (FFN).

�� Initialize a small set of learnable detail-condensing query tokens (typically 8) that serve as queries in the cross-attention, attending to the intermediate patch tokens (keys and values) to extract complementary low-level features.

�� Ensure unidirectional information flow from patch tokens to queries to preserve the frozen VFM's latent space integrity.

�� Project query and patch tokens separately with positional embeddings (2D sinusoidal for patches, learnable for queries), then concatenate and input into a ViT-based decoder for image reconstruction.

�� During training, add noise to both query and patch tokens and jointly optimize a flow matching diffusion objective to denoise and reconstruct images.

�� During generation, jointly sample and denoise query and patch tokens in the latent diffusion model, then decode to produce high-fidelity images.

�� Evaluate on ImageNet 256×256 with metrics including PSNR, SSIM, reconstruction FID (rFID), and generation FID, IS, Precision, Recall.

�� Conduct ablation studies on query number, condenser layer selection, and training paradigms to optimize the reconstruction-generation trade-off.

Experiments

Experiments are conducted on the ImageNet dataset at 256×256 resolution, using DINOv2-B as the default frozen VFM encoder and a ViT-XL decoder with approximately 500 million parameters. Baselines include the original RAE with frozen VFM, fine-tuned VFM variants with and without distillation losses, and feature concatenation methods. Reconstruction quality is assessed via PSNR, SSIM, and rFID, while generative performance is measured using FID, Inception Score (IS), Precision, and Recall. The diffusion model employs 50 sampling steps, with query-token loss weight set to 1. Ablation studies explore the impact of varying the number of detail-condensing queries, condenser module placement across VFM layers, and the effect of discarding query tokens at inference. Additional experiments validate DecQ's generalization on alternative VFMs such as SigLIP2-B. Results demonstrate DecQ's superior reconstruction and generation performance with minimal computational overhead and faster convergence compared to baselines.

Results

DecQ significantly improves reconstruction PSNR from 19.13dB to 22.76dB and reduces reconstruction FID compared to the frozen VFM RAE baseline, indicating enhanced recovery of low-level details. In generative tasks, DecQ achieves an FID of 1.80 at 80 epochs without guidance, outperforming RAE's 2.16 and converging 3.3 times faster. With guidance at 800 epochs, FID further improves to 1.05, setting new state-of-the-art results for high-dimensional VFM latent space generation. Ablation studies confirm that 8 detail-condensing queries and condenser modules at layers 0, 3, 6, and 9 provide the best balance between reconstruction fidelity and generation quality. Notably, even when query tokens are discarded at inference, predicting them during training benefits the generation of patch tokens, highlighting the queries' auxiliary role. DecQ also generalizes effectively across different VFMs such as SigLIP2-B.

Applications

DecQ is well-suited for applications demanding high-fidelity image reconstruction and generation, such as medical imaging, satellite image analysis, and artistic content creation. Its ability to preserve semantic consistency while enhancing fine-grained details benefits image editing and virtual reality scenarios requiring precise texture and color reproduction. The framework's modular design facilitates integration with existing pretrained VFMs, reducing training costs and enabling rapid deployment. Furthermore, DecQ's approach can inform multi-task vision systems that require simultaneous semantic understanding and detailed reconstruction, advancing both academic research and practical deployments in computer vision and graphics.

Limitations & Outlook

DecQ's reliance on the representational capacity of pretrained VFMs may limit its effectiveness in extremely complex or detail-rich image domains where fixed query capacity constrains detail extraction. The number of detail-condensing queries and condenser layer selection require careful tuning; excessive queries can introduce redundant low-level information that hampers generative modeling. Additionally, while validated on DINOv2 and SigLIP2, the framework's adaptability to other VFM architectures and modalities remains to be fully explored. Computational overhead, though modest, may increase with higher-resolution or larger-scale applications, necessitating further efficiency improvements.

Abstract

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

cs.CV

References (20)

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong et al.

2025 144 citations ⭐ Influential View Analysis →

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Axel Sauer, Tero Karras, S. Laine et al.

2023 292 citations View Analysis →

Fast Training of Diffusion Models with Masked Transformers

Hongkai Zheng, Weili Nie, Arash Vahdat et al.

2023 167 citations View Analysis →

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Wei Song, Yuran Wang, Zijia Song et al.

2025 31 citations View Analysis →

Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think

Ge Wu, Shen Zhang, Ruijing Shi et al.

2025 61 citations View Analysis →

DINOv3

Oriane Sim'eoni, Huy V. Vo, Maximilian Seitzer et al.

2025 758 citations View Analysis →

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen et al.

2024 292 citations View Analysis →

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, A. Blattmann, Dominik Lorenz et al.

2021 24583 citations View Analysis →

Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu et al.

2025 11 citations View Analysis →

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

Richard Zhang, Phillip Isola, Alexei A. Efros et al.

2018 17550 citations View Analysis →

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, R. Socher et al.

2009 73150 citations

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Bowei Chen, Sai Bi, Hao Tan et al.

2025 21 citations View Analysis →

Latent Diffusion Model without Variational Autoencoder

Minglei Shi, Haolin Wang, Wenzhao Zheng et al.

2025 54 citations View Analysis →

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Kaihang Pan, Wang Lin, Zhongqi Yue et al.

2025 27 citations View Analysis →

Image quality assessment: from error visibility to structural similarity

Zhou Wang, A. Bovik, H. Sheikh et al.

2004 56751 citations

AUTO-ENCODING VARIATIONAL BAYES

Romain Lopez, Pierre Boyeau, N. Yosef et al.

2020 23277 citations

Distribution Matching Variational AutoEncoder

Sen Ye, Jianning Pei, Mengde Xu et al.

2025 5 citations View Analysis →

RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong, Hongyu Li, Shanyuan Liu et al.

2026 4 citations View Analysis →

Neural Discrete Representation Learning

Aäron van den Oord, O. Vinyals, K. Kavukcuoglu

2017 7199 citations View Analysis →

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang et al.

2024 506 citations View Analysis →

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence