BiGain: Unified Token Compression for Joint Generation and Classification

TL;DR

BiGain enhances diffusion models by frequency separation, improving classification accuracy by 7.15% and FID by 0.34.

cs.CV 🔴 Advanced 2026-03-13 12 views

Jiacheng Liu Shengkun Tang Jiacheng Cui Dongkuan Xu Zhiqiang Shen

diffusion models token compression frequency separation generation and classification acceleration

Key Findings

Methodology

BiGain is a training-free, plug-and-play framework that leverages frequency separation to optimize both generation and classification performance in diffusion models. It employs two frequency-aware operators: Laplacian-gated token merging, which retains edges and textures by encouraging merges among spectrally smooth tokens, and Interpolate-Extrapolate KV Downsampling, which downsamples keys/values while maintaining query precision.

Key Results

On the ImageNet-1K dataset, BiGain achieves a 70% token merging in Stable Diffusion 2.0, increasing classification accuracy by 7.15% and improving FID by 0.34 (1.85%).
Across COCO-2017 and ImageNet-100 datasets, BiGain maintains or enhances generation quality while significantly improving the speed-accuracy trade-off in diffusion models.
Ablation studies confirm the advantage of frequency-aware token compression in preserving high-frequency details and low/mid-frequency semantic content.

Significance

BiGain is the first framework to achieve joint optimization of generation and classification in diffusion models, addressing the traditional oversight of classification performance in acceleration methods. Its innovative frequency separation approach maintains generation quality while significantly enhancing classification performance, enabling dual-purpose generative systems for low-cost deployment. This research holds significant academic value and provides new technological pathways for industry, particularly in applications requiring both generation and classification.

Technical Contribution

BiGain's technical contributions lie in its frequency separation strategy, realized through Laplacian-gated token merging and Interpolate-Extrapolate KV Downsampling. Unlike existing methods, BiGain requires no retraining and can be directly inserted at inference time, demonstrating its effectiveness across various datasets and model architectures. This framework offers new theoretical insights and engineering possibilities for token compression in diffusion models.

Novelty

BiGain is the first framework to simultaneously study and enhance both generation and classification under accelerated diffusion. Its novelty lies in introducing frequency separation, addressing the traditional neglect of classification performance during acceleration, and providing a balanced approach to token compression.

Limitations

BiGain may experience a decline in classification performance under extreme sparsity, particularly on the COCO2017 dataset.
The applicability of this method across different model architectures requires further validation, especially beyond U-Net and DiT architectures.
Although BiGain requires no retraining, its computational complexity may still be high in some scenarios.

Future Work

Future research directions include exploring BiGain's applicability across more model architectures and datasets, further optimizing its computational efficiency, and validating its performance in practical applications. Additionally, integrating BiGain with other acceleration techniques could achieve even more efficient generation and classification performance.

AI Executive Summary

Diffusion models have become the backbone of modern generative systems, yet their computational footprint during sampling has motivated a surge of training-free acceleration techniques. However, these methods often focus solely on generation quality, neglecting the model's latent discriminative capacity. BiGain framework, through frequency separation strategy, achieves joint optimization of generation and classification for the first time.

At the core of BiGain are two frequency-aware operators: Laplacian-gated token merging and Interpolate-Extrapolate KV Downsampling. Laplacian-gated token merging retains edges and textures by encouraging merges among spectrally smooth tokens, while Interpolate-Extrapolate KV Downsampling downsamples keys/values while maintaining query precision.

On the ImageNet-1K dataset, BiGain achieves a 70% token merging in Stable Diffusion 2.0, increasing classification accuracy by 7.15% and improving FID by 0.34 (1.85%). These results indicate that BiGain maintains or enhances generation quality while significantly improving the speed-accuracy trade-off in diffusion models.

BiGain's novelty lies in introducing frequency separation, addressing the traditional neglect of classification performance during acceleration, and providing a balanced approach to token compression. This research holds significant academic value and provides new technological pathways for industry, particularly in applications requiring both generation and classification.

However, BiGain may experience a decline in classification performance under extreme sparsity, particularly on the COCO2017 dataset. Additionally, the applicability of this method across different model architectures requires further validation, especially beyond U-Net and DiT architectures. Future research directions include exploring BiGain's applicability across more model architectures and datasets, further optimizing its computational efficiency, and validating its performance in practical applications.

Deep Analysis

Background

Diffusion models have recently emerged as a core technology in generative AI, excelling in image and text generation. However, the high computational complexity during sampling poses a significant challenge for practical deployment. To address this, researchers have proposed various acceleration methods, such as token merging and downsampling, which aim to optimize generation quality by reducing computational load. These methods, however, often overlook the model's discriminative capabilities. As the demand for combined generation and classification tasks increases, the challenge of maintaining or enhancing classification performance during acceleration becomes critical.

Core Problem

Traditional acceleration methods for diffusion models primarily focus on optimizing generation quality, often neglecting the preservation and enhancement of classification performance. This single-objective optimization strategy falls short in applications where both generation and classification are required. In fields such as medical imaging and industrial inspection, the joint use of generation and classification is becoming increasingly prevalent. Therefore, achieving dual optimization of generation and classification in accelerated diffusion models is a pressing research problem.

Innovation

The BiGain framework achieves joint optimization of generation and classification in diffusion models through a frequency separation strategy. Its innovations include:

1. Introducing frequency-aware token compression, utilizing Laplacian-gated token merging and Interpolate-Extrapolate KV Downsampling to preserve high-frequency details and low/mid-frequency semantic content.

2. The framework requires no retraining and can be directly inserted at inference time, applicable to various model architectures and datasets.

3. By addressing the traditional neglect of classification performance during acceleration, BiGain provides a balanced approach to token compression, enabling dual-purpose generative systems for low-cost deployment.

Methodology

The implementation of the BiGain framework involves the following key steps:

�� Laplacian-gated token merging: Computes local Laplacian magnitudes to guide merging, retaining edges and textured micro-structures.
�� Interpolate-Extrapolate KV Downsampling: Downsamples keys/values via controllable interpolation and extrapolation while keeping queries intact, reducing computational load.
�� Frequency separation strategy: Maps intermediate feature signals into a frequency-aware representation, disentangling high-frequency details from low/mid-frequency semantic content to achieve dual optimization of generation and classification.

Experiments

The experimental design includes testing on datasets such as ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017. The model architectures used include DiT and U-Net, with comparisons to various baseline methods such as ToMe and DiP-GO. Key evaluation metrics include classification accuracy and generation quality (FID). Additionally, ablation studies were conducted to verify the effectiveness of the frequency-aware strategy in token compression.

Results

Experimental results demonstrate significant performance improvements across multiple datasets. On the ImageNet-1K dataset, BiGain achieves a 70% token merging in Stable Diffusion 2.0, increasing classification accuracy by 7.15% and improving FID by 0.34 (1.85%). Across COCO-2017 and ImageNet-100 datasets, BiGain maintains or enhances generation quality while significantly improving the speed-accuracy trade-off in diffusion models. Ablation studies further confirm the advantage of frequency-aware token compression in preserving high-frequency details and low/mid-frequency semantic content.

Applications

The BiGain framework has broad application potential in fields requiring both generation and classification, such as medical imaging for diagnostic prediction and uncertainty analysis, industrial inspection for defect identification and reconstruction, and remote sensing for cloud removal and super-resolution synthesis.

Limitations & Outlook

Despite its impressive performance across multiple datasets, BiGain may experience a decline in classification performance under extreme sparsity, particularly on the COCO2017 dataset. Additionally, the applicability of this method across different model architectures requires further validation, especially beyond U-Net and DiT architectures. Although BiGain requires no retraining, its computational complexity may still be high in some scenarios. Future research directions include exploring BiGain's applicability across more model architectures and datasets, further optimizing its computational efficiency, and validating its performance in practical applications.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a meal. You have various ingredients like vegetables, meat, and spices. To save time, you need to quickly decide which ingredients can be cooked together and which need separate attention. BiGain acts like a smart chef, deciding how to combine ingredients based on their characteristics (like taste and texture) to ensure the final dish is both delicious and nutritious.

In this process, BiGain uses a method called 'frequency separation.' Just as a chef decides cooking methods based on the taste and texture of ingredients, BiGain decides how to compress and process information based on the 'frequency' characteristics of data. This allows it to speed up data processing without losing important information.

BiGain's two key steps are like two kitchen tools. The first tool is 'Laplacian-gated token merging,' which helps the chef decide which ingredients can be cooked together to retain the dish's flavor and texture. The second tool is 'Interpolate-Extrapolate KV Downsampling,' which reduces unnecessary steps without affecting the overall flavor.

Through this approach, BiGain not only processes data quickly but also ensures accuracy and quality of results. Like an experienced chef, BiGain significantly improves processing efficiency while ensuring the dish remains tasty.

ELI14 Explained like you're 14

Hey there! Did you know that scientists have invented a super tool called BiGain that makes computers process images faster and more accurately? Imagine you're playing a fast-paced game, and BiGain is your secret weapon, helping you win effortlessly!

BiGain works like the experiments we do at school. It breaks down an image into many small pieces, just like we divide materials into small parts for experiments. Then, it uses a method called 'frequency separation' to analyze these pieces and decide which ones are important and can be merged.

Next, BiGain uses two super tools to handle these pieces. The first tool, 'Laplacian-gated token merging,' acts like a smart referee, helping us decide which pieces can be combined. The second tool, 'Interpolate-Extrapolate KV Downsampling,' is like a magician, reducing unnecessary steps without affecting the overall result.

With this method, BiGain not only speeds up image processing but also ensures accuracy and quality. It's like using a superpower in a game, making you the champion of the competition!

Glossary

Diffusion Model

A generative model that generates data by gradually denoising. It is commonly used in image generation due to its high-quality outputs.

In this paper, diffusion models are the core subject, with BiGain optimizing their acceleration performance to enhance generation and classification.

Token Compression

A method to reduce computational load by merging or deleting redundant tokens. It is often used to accelerate model inference.

BiGain uses frequency-aware token compression to achieve dual optimization of generation and classification.

Frequency Separation

A method that decomposes signals into different frequency components, helping to identify and retain important information. Commonly used in image processing.

BiGain employs frequency separation to retain high-frequency details and low/mid-frequency semantic content.

Laplacian-Gated Token Merging

A token merging method based on Laplacian filtering, guiding merges by computing local frequency to retain important details.

In BiGain, this method is used to retain edges and textures, optimizing classification performance.

Interpolate-Extrapolate KV Downsampling

A method for downsampling keys/values through interpolation and extrapolation, reducing computational load while maintaining query integrity.

BiGain uses this method to optimize computational efficiency while maintaining generation quality.

Generation Quality

A measure of the quality of outputs from a generative model, typically evaluated using metrics like FID.

BiGain improves generation quality while significantly enhancing classification performance.

Classification Performance

A measure of a model's performance in classification tasks, typically evaluated using accuracy metrics.

BiGain significantly enhances classification performance through frequency-aware strategies.

FID (Fréchet Inception Distance)

A metric used to evaluate the quality of generative models, with lower values indicating higher quality.

In experiments, BiGain improves FID, indicating enhanced generation quality.

U-Net

A convolutional neural network architecture commonly used for image segmentation, named for its symmetric encoder-decoder structure.

BiGain was validated on U-Net architecture, demonstrating its effectiveness.

DiT (Diffusion Transformer)

A generative model combining diffusion models and Transformer architecture, known for its strong generative capabilities.

BiGain was tested on DiT architecture, verifying its applicability across different models.

Open Questions Unanswered questions from this research

1 The applicability of BiGain beyond U-Net and DiT architectures remains to be further validated. While it performs well on these architectures, its performance on other models is unclear and requires more experimentation to confirm its generalizability.
2 BiGain may experience a decline in classification performance under extreme sparsity. This indicates potential limitations in handling certain specific datasets, necessitating further research to improve its robustness.
3 BiGain's computational complexity may still be high in some scenarios. Although it requires no retraining, its efficiency on large-scale datasets needs optimization for broader application.
4 How to integrate BiGain with other acceleration techniques to achieve more efficient generation and classification performance remains an open question. This requires exploring the synergistic effects of different technologies.
5 The performance of BiGain in practical applications still needs further research. Although it performs well in experiments, its performance in real-world scenarios is unclear and requires more practical validation.

Applications

Immediate Applications

Medical Imaging Analysis

BiGain can be used for diagnostic prediction and uncertainty analysis in medical imaging, aiding doctors in making faster and more accurate decisions.

Industrial Visual Inspection

In industrial inspection, BiGain can be used for defect identification and reconstruction, improving detection efficiency and accuracy on production lines.

Remote Sensing Image Processing

In remote sensing, BiGain can be used for cloud removal and super-resolution synthesis, enhancing image quality and classification performance.

Long-term Vision

Intelligent Transportation Systems

BiGain can be used in intelligent transportation systems for real-time monitoring and anomaly detection, enhancing the intelligence level of traffic management.

Autonomous Driving Technology

In autonomous driving, BiGain can be used for environmental perception and decision support, enhancing the safety and reliability of autonomous driving systems.

Abstract

Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.

cs.CV cs.LG

References (20)

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, A. Blattmann, Dominik Lorenz et al.

2021 22759 citations ⭐ Influential View Analysis →

Your Diffusion Model is Secretly a Zero-Shot Classifier

Alexander C. Li, Mihir Prabhudesai, Shivam Duggal et al.

2023 332 citations ⭐ Influential View Analysis →

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, M. Maire, Serge J. Belongie et al.

2014 51006 citations View Analysis →

Data Augmentation in Earth Observation: A Diffusion Model Approach

Tiago Sousa, B. Ries, N. Guelfi

2024 13 citations View Analysis →

Emergent Correspondence from Image Diffusion

Luming Tang, Menglin Jia, Qianqian Wang et al.

2023 433 citations View Analysis →

TokenLearner: Adaptive Space-Time Tokenization for Videos

M. Ryoo, A. Piergiovanni, Anurag Arnab et al.

2021 214 citations

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, Jia Deng, Hao Su et al.

2014 42057 citations View Analysis →

Training-Free and Hardware-Friendly Acceleration for Diffusion Models via Similarity-based Token Pruning

Evelyn Zhang, Jiayi Tang, Xuefei Ning et al.

2025 34 citations

LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Thibault Castells, Hyoung-Kyu Song, Bo-Kyeong Kim et al.

2024 64 citations View Analysis →

A Diffusion-Based Framework for Multi-Class Anomaly Detection

Haoyang He, Jiangning Zhang, Hongxu Chen et al.

2024 201 citations

DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

Haowei Zhu, Dehua Tang, Ji Liu et al.

2024 35 citations View Analysis →

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, A. Blattmann et al.

2025 498 citations View Analysis →

Robust Classification via a Single Diffusion Model

Huanran Chen, Yinpeng Dong, Zhengyi Wang et al.

2023 89 citations View Analysis →

Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for Free

G. Favero, Parham Saremi, E. Kaczmarek et al.

2025 8 citations View Analysis →

Scalable Diffusion Models with Transformers

William S. Peebles, Saining Xie

2022 4938 citations View Analysis →

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, Stefano Ermon

2020 11047 citations View Analysis →

Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

Benyuan Meng, Qianqian Xu, Zitai Wang et al.

2024 20 citations View Analysis →

Structural Pruning for Diffusion Models

Gongfan Fang, Xinyin Ma, Xinchao Wang

2023 210 citations View Analysis →

Token Merging for Fast Stable Diffusion

Daniel Bolya, Judy Hoffman

2023 208 citations View Analysis →

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, P. Abbeel

2020 28073 citations View Analysis →

BiGain: Unified Token Compression for Joint Generation and Classification

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Diffusion Model

Token Compression

Frequency Separation

Laplacian-Gated Token Merging

Interpolate-Extrapolate KV Downsampling

Generation Quality

Classification Performance

FID (Fréchet Inception Distance)

U-Net

DiT (Diffusion Transformer)

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Medical Imaging Analysis

Industrial Visual Inspection

Remote Sensing Image Processing

Long-term Vision

Intelligent Transportation Systems

Autonomous Driving Technology

Abstract

References (20)

Related Papers

Visual-ERM: Reward Modeling for Visual Equivalence

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning