DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

TL;DR

DiT-IC achieves efficient image compression using a 32x downscaled diffusion transformer, offering 30x faster decoding.

eess.IV πŸ”΄ Advanced 2026-03-14 1 views
Junqi Shi Ming Lu Xingchen Li Anle Ke Ruiqi Zhang Zhan Ma
image compression diffusion model transformer deep learning computational efficiency

Key Findings

Methodology

DiT-IC employs an aligned diffusion transformer framework with three key alignment mechanisms for efficient image compression: 1) Variance-guided reconstruction flow adjusts denoising strength based on latent uncertainty; 2) Self-distillation alignment ensures consistency with encoder-defined latent geometry for one-step diffusion; 3) Latent-conditioned guidance replaces text prompts, enabling text-free inference. These mechanisms allow DiT-IC to perform diffusion in a 32x downscaled latent space, significantly enhancing computational efficiency.

Key Results

  • DiT-IC achieves up to 30x faster decoding for 2048x2048 image reconstruction compared to existing diffusion codecs, with significantly reduced memory usage.
  • On multiple benchmark datasets, DiT-IC achieves state-of-the-art perceptual quality, especially excelling in low-bitrate scenarios.
  • Ablation studies confirm the contribution of variance-guided reconstruction flow, self-distillation alignment, and latent-conditioned guidance to overall performance.

Significance

The introduction of DiT-IC holds significant implications for both academia and industry. It addresses the computational efficiency issues of diffusion models in image compression, making high-quality image reconstruction feasible on standard hardware. This breakthrough not only provides new insights for the field of image compression but also offers a reference for other visual tasks requiring efficient computation.

Technical Contribution

DiT-IC's technical contributions include adapting a pretrained text-to-image multi-step diffusion transformer into a single-step reconstruction model, performing efficient diffusion in deeply compressed latent spaces. Through alignment mechanisms, DiT-IC significantly reduces computational complexity and memory requirements without compromising reconstruction quality.

Novelty

DiT-IC is the first to achieve efficient diffusion operations in a 32x downscaled latent space, offering significant advantages in computational efficiency and memory usage compared to traditional U-Net architectures.

Limitations

  • At extremely low bitrates, latent conditions may not provide sufficient semantic information, potentially requiring auxiliary text priors to enhance perceptual quality.
  • Despite significant improvements in decoding speed, high-resolution scenarios may still face hardware limitations.

Future Work

Future research directions include exploring performance optimization at lower bitrates and extending this approach to other visual tasks such as video compression and 3D reconstruction.

AI Executive Summary

Recent advances in diffusion-based generative models have achieved remarkable progress in visual synthesis, yet their application in fundamental tasks like image compression remains constrained by computational inefficiency. Traditional diffusion codecs typically employ U-Net architectures, operating in relatively shallow latent spaces, leading to excessive computational and memory burdens. In contrast, DiT-IC introduces an aligned diffusion transformer framework, enabling efficient image compression in a 32x downscaled deep latent space.

The core of DiT-IC lies in three key alignment mechanisms: variance-guided reconstruction flow, self-distillation alignment, and latent-conditioned guidance. These mechanisms collectively allow DiT-IC to significantly enhance decoding speed and reduce memory usage while maintaining perceptual quality. Experimental results demonstrate that DiT-IC achieves state-of-the-art perceptual quality across multiple benchmark datasets, particularly excelling in low-bitrate scenarios.

Ablation studies validate the contribution of each alignment mechanism to overall performance. The variance-guided reconstruction flow maps latent variance to pseudo-timesteps, collapsing iterative denoising into a single transformation. Self-distillation alignment ensures consistency between the denoised output and the encoder's latent representation, enabling single-step diffusion. Latent-conditioned guidance aligns latent and text embeddings, eliminating the need for text input.

The introduction of DiT-IC has garnered significant attention in academia and offers new solutions for the industry. It effectively addresses the computational efficiency issues of diffusion models in image compression, making high-quality image reconstruction feasible on standard hardware. This breakthrough provides new insights for the field of image compression and offers a reference for other visual tasks requiring efficient computation.

However, DiT-IC faces challenges at extremely low bitrates, where latent conditions may not provide sufficient semantic information, potentially requiring auxiliary text priors to enhance perceptual quality. Additionally, despite significant improvements in decoding speed, high-resolution scenarios may still face hardware limitations. Future research directions include exploring performance optimization at lower bitrates and extending this approach to other visual tasks such as video compression and 3D reconstruction.

Deep Analysis

Background

Recent years have seen significant advances in diffusion models for generating high-quality, semantically controllable images. However, in the fundamental task of image compression, the practical application of diffusion models has been limited by computational efficiency and memory usage constraints. Traditional diffusion codecs typically employ U-Net architectures, operating in relatively shallow latent spaces, leading to excessive computational and memory burdens. In contrast, modern learned codecs operate in much deeper latent domains, motivating researchers to explore the possibility of performing diffusion in deeply compressed latent spaces.

Core Problem

The application of diffusion models in image compression faces dual challenges of computational efficiency and memory usage. Traditional U-Net architectures operate in relatively shallow latent spaces, leading to excessive computational and memory burdens. Additionally, diffusion models typically require multi-step denoising, further increasing computational complexity. In this context, improving the computational efficiency of diffusion models without compromising reconstruction quality is a pressing issue.

Innovation

DiT-IC introduces an aligned diffusion transformer framework for efficient image compression in a 32x downscaled deep latent space. β€’ Variance-guided reconstruction flow: Maps latent variance to pseudo-timesteps, collapsing iterative denoising into a single transformation. β€’ Self-distillation alignment: Ensures consistency between the denoised output and the encoder's latent representation, enabling single-step diffusion. β€’ Latent-conditioned guidance: Aligns latent and text embeddings, eliminating the need for text input.

Methodology

The implementation of DiT-IC includes the following key steps: β€’ Use a pretrained text-to-image multi-step diffusion transformer as the base model. β€’ Variance-guided reconstruction flow: Maps latent variance to pseudo-timesteps, collapsing iterative denoising into a single transformation. β€’ Self-distillation alignment: Ensures consistency between the denoised output and the encoder's latent representation, enabling single-step diffusion. β€’ Latent-conditioned guidance: Aligns latent and text embeddings, eliminating the need for text input.

Experiments

The experimental design includes performance evaluation on multiple benchmark datasets such as CLIC 2020 Professional, DIV2K, and Kodak datasets. Metrics used include PSNR, MS-SSIM, LPIPS, and DISTS. The experiments also include ablation studies to validate the contribution of each alignment mechanism to overall performance.

Results

Experimental results demonstrate that DiT-IC achieves state-of-the-art perceptual quality across multiple benchmark datasets, particularly excelling in low-bitrate scenarios. Compared to existing diffusion codecs, DiT-IC achieves up to 30x faster decoding for 2048x2048 image reconstruction, with significantly reduced memory usage.

Applications

DiT-IC's application scenarios include efficient image compression, particularly in situations requiring fast decoding and low memory usage. It can also be applied to other visual tasks requiring efficient computation, such as video compression and 3D reconstruction.

Limitations & Outlook

Despite significant improvements in decoding speed and memory usage, DiT-IC faces challenges at extremely low bitrates, where latent conditions may not provide sufficient semantic information, potentially requiring auxiliary text priors to enhance perceptual quality. Additionally, high-resolution scenarios may still face hardware limitations.

Plain Language Accessible to non-experts

Imagine you are in a kitchen cooking a meal. Traditional diffusion models are like a chef who needs to try multiple times to make the perfect dish. He needs to constantly adjust the seasoning and try different cooking methods until he is satisfied. This is like the multi-step denoising process, where each step requires computation and time. DiT-IC, on the other hand, is like an experienced chef who has mastered all the cooking techniques and can make a delicious dish in just one step. This is because he knows how to adjust the cooking method based on the different characteristics of the ingredients, just like DiT-IC adjusts the denoising strength based on latent variance. By doing so, DiT-IC not only saves time but also reduces the mess in the kitchen (i.e., memory usage), allowing you to enjoy high-quality food (i.e., images) at home with ease.

ELI14 Explained like you're 14

Hey there, friends! Did you know that scientists have invented a super cool technology called DiT-IC that lets us quickly see high-definition pictures on our computers? Imagine you're playing a game and suddenly need to load a huge map. Traditional methods are like slowly putting together a puzzle, taking a lot of time. But DiT-IC is like a super-fast puzzle master who can place all the pieces in the right spots in an instant! This is because DiT-IC has a special skill that allows it to quickly find the right place for each puzzle piece based on its characteristics. This way, you can jump into the game faster and enjoy exciting adventures! Isn't that awesome?

Glossary

Diffusion Transformer

A model combining diffusion models and transformer architectures for efficient image generation and compression.

Used in DiT-IC for performing diffusion operations in deep latent spaces.

Variance-Guided Reconstruction Flow

A mechanism that adjusts denoising strength based on latent uncertainty to aid efficient reconstruction.

Used in DiT-IC to collapse multi-step denoising into a single transformation.

Self-Distillation Alignment

A mechanism ensuring consistency between the denoised output and the encoder's latent representation for single-step diffusion.

Used in DiT-IC to enhance computational efficiency.

Latent-Conditioned Guidance

A mechanism that aligns latent and text embeddings, eliminating the need for text input.

Used in DiT-IC for text-free inference.

U-Net Architecture

A neural network architecture commonly used in image generation and compression, known for its multi-scale encoder-decoder structure.

Traditionally used in diffusion codecs.

Latent Space

The internal representation of data within a model, often used to capture high-dimensional features of the data.

In DiT-IC, diffusion operations are performed in a 32x downscaled latent space.

Perceptual Quality

The degree to which an image visually aligns with human perception, typically evaluated through subjective and objective metrics.

DiT-IC achieves state-of-the-art perceptual quality across multiple benchmark datasets.

Bitrate

The amount of data transmitted or processed per unit time, often used to measure compression efficiency.

DiT-IC excels in low-bitrate scenarios.

Ablation Study

An experimental method that evaluates the impact of removing certain parts of a model on overall performance.

Used to validate the contribution of each alignment mechanism in DiT-IC.

Efficient Computation

The ability to achieve fast and accurate computation with limited resources.

DiT-IC achieves efficient computation through alignment mechanisms.

Open Questions Unanswered questions from this research

  • 1 How can perceptual quality be further improved at extremely low bitrates? Current methods may not provide sufficient semantic information, especially when latent conditions are insufficient. New priors or guidance mechanisms need to be explored to enhance model performance.
  • 2 How can hardware limitations be overcome in high-resolution scenarios to achieve more efficient decoding? Although DiT-IC performs well on standard hardware, it may still face limitations in certain high-resolution scenarios.
  • 3 How can DiT-IC's approach be extended to other visual tasks such as video compression and 3D reconstruction? This requires adapting existing methods to meet the specific needs of different tasks.
  • 4 In multimodal generation tasks, how can information from different modalities be effectively combined to improve generation quality? This involves aligning and fusing information across modalities.
  • 5 How can computational complexity and memory usage be further reduced without compromising reconstruction quality? This requires optimizing and improving existing model architectures.

Applications

Immediate Applications

Efficient Image Compression

DiT-IC can be used in scenarios requiring fast decoding and low memory usage, such as online image transmission and storage.

Real-Time Video Streaming

By extending DiT-IC's approach to video compression, more efficient real-time video streaming can be achieved.

Image Processing on Mobile Devices

DiT-IC's low memory usage makes it suitable for high-quality image processing on resource-constrained mobile devices.

Long-term Vision

3D Reconstruction

Applying DiT-IC's approach to 3D reconstruction tasks can improve reconstruction efficiency and quality.

Multimodal Generation

By combining information from different modalities, DiT-IC can be used to generate higher-quality multimodal content, such as text-image integrated virtual reality experiences.

Abstract

Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.

eess.IV cs.CV

References (20)

OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates

Jinpei Guo, Yifei Ji, Zheng Chen et al.

2025 11 citations ⭐ Influential View Analysis β†’

ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding

Dailan He, Zi Yang, Weikun Peng et al.

2022 498 citations ⭐ Influential View Analysis β†’

StableCodec: Taming One-Step Diffusion for Extreme Image Compression

Tianyu Zhang, Xin Luo, Li Li et al.

2025 10 citations ⭐ Influential View Analysis β†’

Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion

Anle Ke, Xu Zhang, Tong Chen et al.

2025 7 citations ⭐ Influential View Analysis β†’

One-Step Diffusion-Based Image Compression with Semantic Distillation

Naifu Xue, Zhaoyang Jia, Jiahao Li et al.

2025 7 citations ⭐ Influential View Analysis β†’

Image Quality Assessment: Unifying Structure and Texture Similarity

Keyan Ding, Kede Ma, Shiqi Wang et al.

2020 1194 citations View Analysis β†’

The Perception-Distortion Tradeoff

Yochai Blau, T. Michaeli

2017 1012 citations View Analysis β†’

Towards image compression with perfect realism at ultra-low bitrates

Marlene Careil, Matthew Muckley, Jakob Verbeek et al.

2023 111 citations View Analysis β†’

Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model

Leheng Zhang, Weiyi You, Kexuan Shi et al.

2025 22 citations View Analysis β†’

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 3579 citations View Analysis β†’

Learned Image Compression with Mixed Transformer-CNN Architectures

Jinming Liu, Heming Sun, J. Katto

2023 397 citations View Analysis β†’

Calculation of Average PSNR Differences between RD-curves

G. BjΓΈntegaard

2001 5722 citations

Demystifying MMD GANs

Mikolaj Binkowski, Danica J. Sutherland, M. Arbel et al.

2018 1883 citations View Analysis β†’

Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff

Yochai Blau, T. Michaeli

2019 401 citations View Analysis β†’

Bridging the Gap between Gaussian Diffusion Models and Universal Quantization for Image Compression

Lucas Relic, Roberto Azevedo, Yang Zhang et al.

2025 5 citations View Analysis β†’

PixArt-Ξ±: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge et al.

2023 761 citations View Analysis β†’

EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation

Nikolai Korber, Eduard Kromer, Andreas Siebert et al.

2023 10 citations View Analysis β†’

LSDIR: A Large Scale Dataset for Image Restoration

Yawei Li, K. Zhang, Jingyun Liang et al.

2023 210 citations

Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likelihood Models

Matthew Muckley, Alaaeldin El-Nouby, Karen Ullrich et al.

2023 100 citations View Analysis β†’

Lossy Image Compression with Conditional Diffusion Models

Ruihan Yang, Stephan Mandt

2022 220 citations View Analysis β†’