EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

TL;DR

EVATok achieves efficient visual autoregressive generation with adaptive video tokenization, saving 24.4% tokens on average.

cs.CV 🔴 Advanced 2026-03-13 14 views
Tianwei Xiong Jun Hao Liew Zilong Huang Zhijie Lin Jiashi Feng Xihui Liu
video generation autoregressive models tokenization deep learning computer vision

Key Findings

Methodology

EVATok employs an adaptive video tokenization framework to optimize token allocation, using lightweight routers for fast prediction and enhanced training with video semantic encoders. The method unfolds in four stages: training a proxy tokenizer, curating a dataset for router training, training a lightweight router, and training the final adaptive tokenizer under router assignments.

Key Results

  • On the UCF-101 dataset, EVATok demonstrates superior performance in video reconstruction and class-to-video generation, with at least 24.4% savings in token usage.
  • Compared to fixed-length baselines, EVATok shows significant improvements in video reconstruction quality and generation efficiency.
  • On the WebVid-10M dataset, EVATok's router-guided tokenizer achieves excellent LPIPS and rFVD metrics, saving 29.6% in token length.

Significance

EVATok is significant in the video generation domain as it addresses the issue of uneven token allocation in traditional video tokenization, improving video reconstruction quality and generation efficiency. Its adaptive tokenization strategy offers a new approach for video generation models, particularly in handling complex dynamic videos by better allocating computational resources.

Technical Contribution

EVATok introduces an adaptive tokenization framework and lightweight routers to optimize token allocation. Compared to existing methods, EVATok significantly improves token usage efficiency and video generation quality, providing new engineering possibilities and theoretical guarantees.

Novelty

EVATok is the first to achieve content-based adaptive video tokenization, overcoming the limitations of traditional fixed-length tokenization. Compared to existing methods, EVATok dynamically adjusts token allocation based on video content complexity, significantly enhancing token usage efficiency.

Limitations

  • EVATok may underperform when dealing with extremely complex or simple videos, as token allocation predictions may not be precise enough.
  • The training process requires substantial computational resources, which may not be suitable for resource-constrained scenarios.
  • The reliance on token allocation may lead to performance fluctuations in certain cases.

Future Work

Future research could focus on further optimizing the precision of token allocation predictions and exploring EVATok's applications in other video generation tasks. Additionally, reducing computational resource requirements could broaden its applicability.

AI Executive Summary

In the field of video generation, traditional autoregressive models rely on fixed-length token sequences, which are inefficient when handling dynamically complex videos. EVATok addresses this issue by introducing an adaptive length video tokenization framework. This framework uses lightweight routers to predict optimal token allocations for each video, balancing token usage efficiency and video generation quality. Experiments on the UCF-101 dataset show that EVATok saves at least 24.4% in token usage compared to state-of-the-art methods.

The core technology of EVATok includes training a proxy tokenizer, creating datasets, and training routers. The proxy tokenizer evaluates video reconstruction quality under different token allocations, while the router predicts optimal allocations through a classification task. The final adaptive tokenizer is trained under router-predicted allocations, achieving adaptive length video tokenization.

Experimental results demonstrate that EVATok excels in video reconstruction and generation tasks, particularly in handling complex dynamic videos by better allocating computational resources, improving generation efficiency and quality. Compared to traditional methods, EVATok significantly enhances token usage efficiency and video generation quality.

EVATok's adaptive tokenization strategy offers a new approach for video generation models, addressing the issue of uneven token allocation in traditional video tokenization. Its potential applications in the video generation domain are vast, especially in scenarios requiring efficient handling of complex dynamic videos.

However, EVATok may underperform when dealing with extremely complex or simple videos, as token allocation predictions may not be precise enough. Additionally, the training process requires substantial computational resources, which may not be suitable for resource-constrained scenarios. Future research could focus on further optimizing the precision of token allocation predictions and exploring EVATok's applications in other video generation tasks.

Deep Analysis

Background

Video generation technology has made significant progress in recent years, driven by autoregressive models. These models achieve efficient video generation by compressing video pixels into discrete token sequences. However, traditional video tokenization methods typically use fixed-length token allocations, which are inefficient when handling videos of varying complexity. Existing methods like LARP and AdapTok have achieved some level of adaptive tokenization, but their token allocation strategies still fall short, failing to fully leverage the complexity of video content.

Core Problem

Traditional video tokenization methods are inefficient when handling complex dynamic videos because they typically use fixed-length token allocations. This approach wastes tokens on simple, static, or repetitive video segments while under-allocating tokens for dynamic or complex segments, leading to a decline in reconstruction quality and generation efficiency. The challenge is to dynamically adjust token allocation based on video content complexity, achieving a balance between token usage efficiency and video generation quality.

Innovation

EVATok introduces an adaptive length video tokenization framework, addressing the issue of uneven token allocation in traditional methods. Its core innovations include: 1) the introduction of lightweight routers for fast prediction of optimal token allocations; 2) training a proxy tokenizer to evaluate video reconstruction quality under different token allocations; 3) training an adaptive tokenizer under router-predicted allocations to achieve adaptive length video tokenization. These innovations allow EVATok to dynamically adjust token allocation based on video content complexity, significantly improving token usage efficiency.

Methodology

  • �� Train a proxy tokenizer: used to evaluate video reconstruction quality under different token allocations.

  • �� Curate a dataset: create a dataset for training routers by evaluating the quality of different token allocations using the proxy tokenizer.

  • �� Train a lightweight router: predict optimal token allocations through a classification task.

  • �� Train the final adaptive tokenizer: train under router-predicted allocations to achieve adaptive length video tokenization.

Experiments

Experiments were conducted on the UCF-101 and WebVid-10M datasets to evaluate EVATok's performance in video reconstruction and generation tasks. The experimental design includes: 1) using the proxy tokenizer to evaluate the quality of different token allocations; 2) predicting optimal token allocations through routers; 3) training the final tokenizer under router-predicted allocations. The results show significant improvements in token usage efficiency and video generation quality.

Results

Experimental results show that EVATok performs excellently in video reconstruction and class-to-video generation tasks on the UCF-101 dataset, saving at least 24.4% in token usage. On the WebVid-10M dataset, EVATok's router-guided tokenizer achieves excellent LPIPS and rFVD metrics, saving 29.6% in token length. Compared to traditional fixed-length tokenization methods, EVATok significantly enhances token usage efficiency and video generation quality.

Applications

EVATok has broad application prospects in the video generation domain, especially in scenarios requiring efficient handling of complex dynamic videos. Its adaptive tokenization strategy can be applied to video reconstruction, class-to-video generation, frame prediction, and other tasks, improving generation efficiency and quality.

Limitations & Outlook

While EVATok performs excellently in video generation tasks, it may underperform when dealing with extremely complex or simple videos, as token allocation predictions may not be precise enough. Additionally, the training process requires substantial computational resources, which may not be suitable for resource-constrained scenarios. Future research could focus on further optimizing the precision of token allocation predictions and exploring EVATok's applications in other video generation tasks.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. Traditional video generation is like following a fixed recipe, using the same steps and time regardless of the ingredients' quantity and type. This approach, while simple, may waste resources or result in less tasty dishes. EVATok is like a smart chef who adjusts cooking time and steps based on the ingredients. For complex ingredients, it spends more time and effort, while for simple ones, it finishes quickly. This not only saves resources but also ensures the quality of each dish. The same approach applies to video generation, where dynamic token allocation improves generation efficiency and quality.

ELI14 Explained like you're 14

Imagine you're playing a video game. Traditional video generation is like using the same strategy every time to defeat enemies, regardless of their strength, with the same weapons and skills. This method, while simple, may waste resources or fail against strong enemies. EVATok is like a smart player who adjusts strategies and gear based on the enemy. For strong enemies, it uses more powerful weapons and skills, while for weaker ones, it finishes quickly. This not only saves resources but also ensures victory in every battle. The same approach applies to video generation, where dynamic token allocation improves generation efficiency and quality.

Glossary

Autoregressive Model

A generative model that generates data by sequentially predicting each element in a sequence.

Used in video generation to generate video frame sequences.

Tokenization

The process of breaking down data into discrete tokens for model processing.

Used in video generation to compress video pixels into discrete token sequences.

Proxy Tokenizer

A tokenizer used to evaluate video reconstruction quality under different token allocations.

Used in EVATok for dataset creation for router training.

Router

A lightweight model used to predict optimal token allocations for each video.

Used in EVATok to achieve adaptive tokenization.

LPIPS

A metric for evaluating image and video reconstruction quality based on perceptual similarity.

Used in experiments to evaluate EVATok's video reconstruction quality.

rFVD

A metric for evaluating video generation quality based on the distribution similarity of generated videos.

Used in experiments to evaluate EVATok's video generation quality.

UCF-101

A commonly used video dataset containing 101 classes of action videos.

Used in experiments to evaluate EVATok's video generation performance.

WebVid-10M

A large video dataset containing a variety of video content.

Used in experiments to evaluate EVATok's video reconstruction performance.

VideoMAE

A semantic encoder used in video generation to enhance video tokenizer training.

Used in EVATok's final tokenizer training.

Generative Adversarial Network (GAN)

A generative model that generates data through adversarial training between a generator and a discriminator.

Used in EVATok's training to enhance video reconstruction quality.

Open Questions Unanswered questions from this research

  • 1 EVATok may underperform when dealing with extremely complex or simple videos, as token allocation predictions may not be precise enough. Future research could focus on further optimizing the precision of token allocation predictions.
  • 2 The training process of EVATok requires substantial computational resources, which may not be suitable for resource-constrained scenarios. Exploring ways to reduce computational resource requirements is a worthwhile direction.
  • 3 Performance fluctuations in certain cases may be related to the reliance on token allocation. Researching how to improve the stability of token allocation is a future direction.
  • 4 EVATok has great potential for application in other video generation tasks, and future research could explore its performance in different tasks.
  • 5 EVATok's adaptive tokenization strategy offers a new approach for video generation models, and future research could explore its application in other generative models.

Applications

Immediate Applications

Video Reconstruction

EVATok can be used to improve the efficiency and quality of video reconstruction, especially when handling complex dynamic videos.

Class-to-Video Generation

EVATok can be used to generate videos based on class labels, improving generation efficiency and quality.

Frame Prediction

EVATok can be used for video frame prediction tasks, improving prediction accuracy and efficiency.

Long-term Vision

Intelligent Video Editing

EVATok can be used for intelligent video editing, improving editing efficiency and quality through adaptive tokenization.

Automated Video Generation

EVATok can be used for automated video generation tasks, improving generation efficiency and quality through adaptive tokenization.

Abstract

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.

cs.CV

References (20)

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser, Robin Rombach, B. Ommer

2020 4004 citations ⭐ Influential View Analysis →

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

Hanyu Wang, Saksham Suri, Yixuan Ren et al.

2024 33 citations ⭐ Influential View Analysis →

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Tianwei Xiong, J. Liew, Zilong Huang et al.

2025 31 citations ⭐ Influential View Analysis →

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan Tong, Yibing Song, Jue Wang et al.

2022 1746 citations ⭐ Influential View Analysis →

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal, Alex Nichol

2021 11003 citations ⭐ Influential View Analysis →

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan et al.

2025 231 citations ⭐ Influential View Analysis →

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

M. Heusel, Hubert Ramsauer, Thomas Unterthiner et al.

2017 17256 citations ⭐ Influential

ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan, Matei Zaharia, Volodymyr Mnih et al.

2024 28 citations ⭐ Influential View Analysis →

Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

Yan Li, Changyao Tian, Renqiu Xia et al.

2025 5 citations ⭐ Influential View Analysis →

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

Richard Zhang, Phillip Isola, Alexei A. Efros et al.

2018 16328 citations ⭐ Influential View Analysis →

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, N. B. Gundavarapu et al.

2023 566 citations ⭐ Influential View Analysis →

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Max Bain, Arsha Nagrani, Gül Varol et al.

2021 1504 citations ⭐ Influential View Analysis →

Adaptive Length Image Tokenization via Recurrent Allocation

Shivam Duggal, Phillip Isola, Antonio Torralba et al.

2024 24 citations View Analysis →

One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression

Keita Miwa, Kento Sasaki, Hidehisa Arai et al.

2025 27 citations View Analysis →

Autoregressive Image Generation using Residual Quantization

Doyup Lee, Chiheon Kim, Saehoon Kim et al.

2022 693 citations View Analysis →

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou et al.

2016 21859 citations View Analysis →

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Junke Wang, Yi Jiang, Zehuan Yuan et al.

2024 92 citations View Analysis →

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Roman Bachmann, Jesse Allardice, David Mizrahi et al.

2025 62 citations View Analysis →

Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization

Mengqi Huang, Zhendong Mao, Zhuowei Chen et al.

2023 62 citations View Analysis →

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen et al.

2024 616 citations View Analysis →