End-to-End Training for Unified Tokenization and Latent Denoising

TL;DR

UNITE achieves unified tokenization and latent diffusion with an autoencoder, reaching FID 2.12 on ImageNet.

cs.CV 🔴 Advanced 2026-03-24 42 views

Shivam Duggal Xingjian Bai Zongze Wu Richard Zhang Eli Shechtman Antonio Torralba Phillip Isola William T. Freeman

Latent Diffusion Models Autoencoder Tokenization Generative Models Image Synthesis

Key Findings

Methodology

The paper introduces UNITE, an autoencoder architecture for unified tokenization and latent diffusion. The core component is a Generative Encoder that serves as both an image tokenizer and a latent generator via weight sharing. The key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. This approach enables a single-stage training process that jointly optimizes both tasks through two forward passes using the same Generative Encoder.

Key Results

On the ImageNet 256 x 256 dataset, UNITE's Base and Large models achieved FID scores of 2.12 and 1.73, respectively, approaching state-of-the-art levels.
UNITE performs exceptionally well across image and molecule modalities without adversarial losses or pretrained encoders like DINO.
Analysis of the Generative Encoder through representation alignment and compression validates the feasibility of single-stage joint training from scratch.

Significance

UNITE's introduction holds significant implications for both academia and industry. Academically, it simplifies the training process of latent diffusion models by eliminating the need for complex staged training, thus advancing the development of generative models. Industrially, UNITE's efficient training process and outstanding performance make it highly applicable in fields such as image synthesis and molecular generation. Moreover, the method's independence from adversarial losses or pretrained encoders reduces implementation complexity and computational cost.

Technical Contribution

UNITE's technical contributions lie in its innovative autoencoder architecture and single-stage training method. Unlike existing latent diffusion models, UNITE achieves unified tokenization and generation through weight sharing, simplifying the training process. Additionally, UNITE demonstrates the ability to reach near state-of-the-art levels without adversarial losses, offering new insights and possibilities for generative model design.

Novelty

UNITE's novelty lies in its unified autoencoder architecture and single-stage training method. This approach is the first to view tokenization and generation as the same latent inference problem and achieves joint optimization through shared parameters, offering significant simplification and efficiency improvements over existing methods.

Limitations

UNITE may underperform in certain scenarios, such as extremely complex image or molecule generation tasks, which may require higher computational resources.
Without adversarial losses, UNITE might not achieve the same level of detail as adversarial generative models in some cases.
Further fine-tuning may be necessary in specific application scenarios to achieve optimal performance.

Future Work

Future research directions include exploring UNITE's application to more modalities and further optimizing its generation quality and efficiency. Additionally, investigating how to apply UNITE to larger-scale datasets and exploring its potential in real-time generation tasks are promising areas. The community might also consider combining UNITE with other generative models to achieve more complex generation tasks.

AI Executive Summary

Latent diffusion models (LDMs) operate in learned latent spaces to enable high-fidelity synthesis. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first before the diffusion model can be trained in the frozen latent space. We propose UNITE, an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both an image tokenizer and a latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a 'common latent language'.

Across image and molecule modalities, UNITE achieves near state-of-the-art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single-stage joint training of tokenization & generation from scratch is feasible.

However, UNITE may underperform in certain scenarios, such as extremely complex image or molecule generation tasks, which may require higher computational resources. Without adversarial losses, UNITE might not achieve the same level of detail as adversarial generative models in some cases. Further fine-tuning may be necessary in specific application scenarios to achieve optimal performance.

Deep Analysis

Background

Latent diffusion models (LDMs) have gained significant attention in the field of generative models in recent years. LDMs enable high-fidelity image and data synthesis by operating in learned latent spaces. Traditional LDM training processes typically require complex staged methods: first, a tokenizer is trained, and then the diffusion model is trained in the frozen latent space. While effective, this approach involves multiple training stages, resulting in high computational costs and complexity. Additionally, existing methods often rely on adversarial losses or pretrained encoders (e.g., DINO), further increasing the difficulty of implementation. Thus, simplifying the training process of LDMs, reducing computational costs, and maintaining or improving generation quality have become pressing issues.

Core Problem

The core problem lies in the training complexity and computational cost of existing latent diffusion models. Traditional methods require training a tokenizer first before the diffusion model can be trained in the frozen latent space. This staged training approach is not only time-consuming and complex but may also lead to suboptimal latent space representations. Furthermore, methods that rely on adversarial losses or pretrained encoders add to the difficulty and computational cost. Therefore, achieving efficient tokenization and generation without relying on these complex mechanisms has become a crucial research challenge.

Innovation

The core innovations of UNITE include its unified autoencoder architecture and single-stage training method. • Generative Encoder: Serves as both an image tokenizer and a latent generator through weight sharing, simplifying the training process. • Single-stage Training: Jointly optimizes tokenization and generation tasks through two forward passes using the same Generative Encoder, eliminating the need for complex staged training. • Latent Inference: Views tokenization and generation as the same latent inference problem under different conditioning regimes, encouraging the formation of a 'common latent language'. These innovations significantly simplify the training process of latent diffusion models, reduce computational costs, and achieve high-quality generation without adversarial losses.

Methodology

The methodology of UNITE is detailed as follows: • Generative Encoder: Serves as both an image tokenizer and a latent generator through weight sharing. Inputs are either images or noise, and outputs are latent representations. • Single-stage Training: Jointly optimizes tokenization and generation tasks through two forward passes using the same Generative Encoder. The first forward pass is for tokenization, inferring latent variables from fully observed images. The second forward pass is for generation, inferring latent variables from noise together with text or class conditioning. • Parameter Sharing: Shared parameters enable gradients to jointly shape the latent space, encouraging the formation of a 'common latent language'. • Optimization Objective: Jointly optimizes the loss functions of tokenization and generation tasks to ensure their synergistic evolution.

Experiments

The experimental design includes training and evaluation on the ImageNet 256 x 256 dataset. Baselines used include existing state-of-the-art latent diffusion models. Evaluation metrics primarily consist of FID scores, which measure the quality of generated images. Key hyperparameters include the structure of the Generative Encoder and learning rates during training. Experiments also include ablation studies to verify the representation alignment and compression capabilities of the Generative Encoder.

Results

Experimental results show that UNITE's Base and Large models achieved FID scores of 2.12 and 1.73 on the ImageNet 256 x 256 dataset, respectively, approaching state-of-the-art levels. Ablation studies reveal significant advantages of the Generative Encoder in representation alignment and compression. Furthermore, UNITE achieves high-quality generation without adversarial losses or pretrained encoders, validating the effectiveness of its single-stage training method.

Applications

UNITE has broad application potential in fields such as image synthesis and molecular generation. Direct use cases include high-quality image generation and molecular structure design. Due to its simplified training process and outstanding performance, UNITE holds significant impact in industries such as image processing and pharmaceutical development.

Limitations & Outlook

Despite UNITE's excellent performance in many aspects, there are still some limitations. For example, in extremely complex image or molecule generation tasks, higher computational resources may be required. Additionally, without adversarial losses, UNITE might not achieve the same level of detail as adversarial generative models in some cases. Further fine-tuning may be necessary in specific application scenarios to achieve optimal performance. Future research can focus on optimizing generation quality and efficiency and exploring more application modalities.

Plain Language Accessible to non-experts

Imagine you have a machine that can both make and package candy at the same time. Traditionally, you would use one machine to make the candy and another to package it. While effective, this requires two machines and more time. UNITE is like a multifunctional machine that can do both tasks simultaneously. By sharing internal components, it simplifies the entire process. Just like this machine, UNITE uses a shared Generative Encoder to perform both tokenization and generation, simplifying the training process. It doesn't need extra adversarial mechanisms or pretraining steps to produce high-quality results efficiently. Imagine this machine can not only make candy but also adjust the flavor and shape according to your taste preferences. This is akin to UNITE's ability to perform latent inference under different conditions. It can flexibly adjust its output based on different input conditions (like images or noise), ensuring you get the desired result every time.

ELI14 Explained like you're 14

Hey there, friends! Imagine you have a super cool robot that can do two things at once: it can turn your favorite comic book into a digital version and draw new comic characters based on your description! Traditionally, you'd need two different robots, one for scanning comics and another for drawing. But our UNITE robot is like an all-in-one artist, doing both tasks simultaneously! It has a magical 'Generative Encoder' that's like its brain, sharing its 'thoughts' internally to understand comic content and create new characters at the same time. It's like in a game where your character can both fight monsters and build houses, all in one go! Plus, it doesn't need extra help to do all this, which is super cool, right? So next time you want a new comic character, remember to ask our UNITE robot, it won't let you down!

Glossary

Latent Diffusion Model

A generative model operating in learned latent spaces, enabling high-fidelity synthesis.

Used for generating high-quality images and data.

Autoencoder

A neural network architecture used to learn efficient encodings of data.

UNITE uses an autoencoder architecture for tokenization and latent generation.

Tokenization

The process of converting input data into a set of tokens for easier processing.

In UNITE, tokenization is a function of the Generative Encoder.

Generative Encoder

The core component in UNITE that serves as both an image tokenizer and a latent generator.

Achieves unified tokenization and generation through weight sharing.

FID (Fréchet Inception Distance)

A metric for evaluating the quality of generated images; lower values indicate higher quality.

Used to assess UNITE's performance on the ImageNet dataset.

Weight Sharing

Sharing the same parameters across different tasks or model components to improve efficiency.

UNITE achieves unified tokenization and generation through weight sharing.

Latent Space

The representation space of data after transformation by an encoder, used for generative model operations.

UNITE operates in latent space for tokenization and generation.

Adversarial Loss

A loss function used in generative adversarial networks to train generators and discriminators.

UNITE achieves high-quality generation without relying on adversarial losses.

Pretrained Encoder

An encoder pretrained on large datasets to enhance model performance.

UNITE achieves high-quality generation without using pretrained encoders.

Representation Alignment

Ensuring representations of different data modalities align in latent space to improve model performance.

Used to analyze the performance of UNITE's Generative Encoder.

Compression

Reducing redundancy in data representations to improve efficiency.

Used to analyze the performance of UNITE's Generative Encoder.

Single-stage Training

A training method that does not require staged processes, simplifying model training.

UNITE achieves joint optimization of tokenization and generation through single-stage training.

Common Latent Language

A unified representation formed in latent space through shared parameters.

UNITE encourages the formation of a common latent language through parameter sharing.

Ablation Study

Evaluating the impact of removing or modifying model components on overall performance.

Used to verify the representation alignment and compression capabilities of UNITE's Generative Encoder.

Text or Class Conditioning

Using text or class information to guide the generation process in tasks.

UNITE performs latent inference with text or class conditioning during generation.

Open Questions Unanswered questions from this research

1 How can UNITE be applied to larger-scale datasets? Current experiments focus primarily on the ImageNet 256 x 256 dataset. While results are promising, UNITE's performance and efficiency on larger-scale datasets remain to be verified. Further research is needed to address potential computational resource constraints and training time issues.
2 How does UNITE perform in extremely complex image or molecule generation tasks? Although UNITE performs well in regular tasks, its performance and efficiency in more complex generation tasks need further exploration. Higher computational resources or further model optimization may be necessary.
3 How can generation detail levels be further improved without adversarial losses? UNITE achieves high-quality generation without relying on adversarial losses, but it may not reach the same level of detail as adversarial generative models in some cases. New methods need to be explored to enhance generation detail.
4 What is UNITE's potential in real-time generation tasks? Current research focuses primarily on offline generation tasks, and UNITE's performance and efficiency in real-time generation tasks remain to be verified. Further research is needed to address potential latency and computational resource issues.
5 How can UNITE be combined with other generative models to achieve more complex generation tasks? Current research focuses primarily on UNITE's performance alone. Combining it with other generative models may bring new possibilities and challenges. Exploration is needed on how to effectively combine the strengths of different models.

Applications

Immediate Applications

High-Quality Image Generation

UNITE can be used for generating high-quality images, applicable in fields such as advertising and film production. Its simplified training process and outstanding performance make it impactful in these areas.

Molecular Structure Design

In pharmaceutical research, UNITE can be used to generate new molecular structures, aiding scientists in discovering new drugs. Its efficient generation process and flexible conditional inference capabilities offer broad application potential in this field.

Image Processing

UNITE can be applied to image processing tasks such as image restoration and style transfer. Its high-quality generation capabilities and simplified training process make it impactful in this field.

Long-term Vision

Real-Time Generation Applications

UNITE's potential in real-time generation tasks is worth exploring, such as real-time image generation and augmented reality applications. Potential latency and computational resource issues need to be addressed.

Cross-Modal Generation

Combining UNITE with other generative models to explore cross-modal generation possibilities, such as image-to-text generation. Exploration is needed on how to effectively combine the strengths of different models.

Abstract

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.

cs.CV cs.AI cs.GR cs.LG

References (20)

Similarity of Neural Network Representations Revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee et al.

2019 1934 citations ⭐ Influential View Analysis →

Generative Adversarial Networks

I. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza et al.

2021 30432 citations ⭐ Influential View Analysis →

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, Timothée Darcet, Théo Moutakanni et al.

2023 7109 citations ⭐ Influential View Analysis →

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang et al.

2024 404 citations ⭐ Influential View Analysis →

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra et al.

2021 8554 citations View Analysis →

All-atom Diffusion Transformers: Unified generative modelling of molecules and materials

Chaitanya K. Joshi, Xiang Fu, Yiyi Liao et al.

2025 59 citations View Analysis →

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Hila Chefer, Patrick Esser, Dominik Lorenz et al.

2026 1 citations View Analysis →

Autoregressive Image Generation without Vector Quantization

Tianhong Li, Yonglong Tian, He Li et al.

2024 554 citations View Analysis →

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown et al.

2024 449 citations View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 3715 citations View Analysis →

Neural Discrete Representation Learning

Aäron van den Oord, O. Vinyals, K. Kavukcuoglu

2017 6837 citations View Analysis →

Layer Normalization

Jimmy Ba, J. Kiros, Geoffrey E. Hinton

2016 12170 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 55690 citations View Analysis →

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge et al.

2023 776 citations View Analysis →

Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis

S. Ong, W. Richards, Anubhav Jain et al.

2012 3778 citations

PixNerd: Pixel Neural Field Diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu et al.

2025 28 citations View Analysis →

Quantum chemistry structures and properties of 134 kilo molecules

R. Ramakrishnan, Pavlo O. Dral, Pavlo O. Dral et al.

2014 2227 citations

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1226 citations View Analysis →

BERT: A Review of Applications in Natural Language Processing and Understanding

M. V. Koroteev

2021 455 citations View Analysis →

Adaptive Length Image Tokenization via Recurrent Allocation

Shivam Duggal, Phillip Isola, Antonio Torralba et al.

2024 26 citations View Analysis →

End-to-End Training for Unified Tokenization and Latent Denoising

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Latent Diffusion Model

Autoencoder

Tokenization

Generative Encoder

FID (Fréchet Inception Distance)

Weight Sharing

Latent Space

Adversarial Loss

Pretrained Encoder

Representation Alignment

Compression

Single-stage Training

Common Latent Language

Ablation Study

Text or Class Conditioning

Open Questions Unanswered questions from this research

Applications

Immediate Applications

High-Quality Image Generation

Molecular Structure Design

Image Processing

Long-term Vision

Real-Time Generation Applications

Cross-Modal Generation

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock