AdaCodec: A Predictive Visual Code for Video MLLMs

TL;DR

AdaCodec employs predictive visual coding, transmitting full reference frames only when prediction is costly, reducing visual tokens by 84.7% and boosting long-video understanding efficiency.

cs.CV 🔴 Advanced 2026-06-02 143 views

Haowen Hou Zhen Huang Zheming Liang Qingyi Si Chenglin Li Shuai Dong Kele Shao Ruilin Li Dianyi Wang Nan Duan Jiaqi Wang

AI Reader Arxiv Page Download PDF

video understanding multimodal large models predictive coding video compression efficiency

Key Findings

Methodology

AdaCodec is built upon a predictive coding framework that segments videos into GOPs, with an adaptive mechanism to insert full reference frames (I-frames) only when prediction errors exceed a threshold. It encodes intermediate frames as motion and residual updates (P-frames) using a specialized P-tokenizer aligned with ViT patches. The core components include: 1) an MLLM-oriented predictive codec that aligns macroblocks with ViT patches, leveraging motion vectors and residuals for efficient encoding; 2) a dual-branch visual token pipeline comprising a reference frame encoder and a P-frame tokenizer, both pretrained and fine-tuned for multimodal tasks; 3) a two-stage training process: first, feature alignment to optimize P-token representations; second, multimodal training to align visual tokens with language models. The system dynamically adjusts GOP lengths based on prediction costs, ensuring minimal token usage while maintaining accuracy.

Key Results

Across eleven benchmarks, AdaCodec surpasses the Qwen3-VL-8B RGB baseline at the same visual token budget, achieving an average accuracy gain of 0.5-0.8 points. Notably, at only 1/7 of the baseline token count (32k tokens), it outperforms the 224k token baseline on all long-video tasks, demonstrating exceptional compression efficiency.
Latency measurements show that AdaCodec reduces inference time from 9.26 seconds to 1.62 seconds, a reduction of 82%, and end-to-end latency from 11.18 seconds to 3.20 seconds, enabling near real-time performance while improving accuracy.
Ablation studies confirm that predictive coding, macroblock alignment, and adaptive GOP strategies are critical to performance gains. The system maintains long-term content continuity, with adaptive GOPs extending the predictive chain length in stable scenes, further boosting accuracy.

Significance

This work addresses the fundamental bottleneck in long video understanding: the explosive growth of visual tokens due to per-frame encoding. By introducing a predictive visual coding paradigm tailored for multimodal large models, AdaCodec effectively reduces computational costs and latency while preserving or enhancing understanding accuracy. This innovation paves the way for scalable, real-time video analysis in applications like surveillance, autonomous driving, and content moderation. It also offers a new perspective on how to integrate video compression principles with AI reasoning, bridging the gap between traditional codecs and modern deep learning models. The approach's ability to adaptively balance prediction and reference frames signifies a major step toward efficient, intelligent video systems.

Technical Contribution

The primary technical contributions include: 1) the formulation of a predictive visual coding interface optimized for multimodal large models, replacing traditional per-frame RGB input; 2) the design of a macroblock-aligned P-tokenizer that encodes motion and residuals into compact tokens compatible with ViT-based models; 3) an adaptive GOP construction mechanism that triggers reference frame insertion based on prediction error, dynamically balancing token efficiency and accuracy; 4) a two-stage training pipeline that aligns visual tokens with language models via feature and multimodal supervision. These innovations enable a drastic reduction in visual token consumption while maintaining high understanding performance, opening new avenues for scalable video AI.

Novelty

This research is the first to systematically integrate predictive coding principles into the visual interface of video multimodal large models. Unlike standard codecs optimized for human perception, AdaCodec’s predictive code is designed explicitly for AI reasoning, focusing on minimizing tokens and maximizing information content relevant for downstream tasks. Its adaptive GOP strategy and macroblock-aligned tokenization differ fundamentally from existing compression methods and prior work that treat codecs as fixed modules. This approach enables the model to retain long-term content continuity, significantly improving efficiency and robustness in long-video understanding, marking a new paradigm in AI-oriented video encoding.

Limitations

The accuracy of motion estimation critically impacts the system; in scenes with complex, fast-moving objects, prediction errors increase, leading to more frequent reference frame insertions and reduced compression gains.
Extreme compression ratios may cause loss of fine details, especially in high-dynamic scenes, affecting downstream task performance.
The training process requires substantial computational resources, including large-scale pretraining and multimodal alignment, which may hinder deployment on resource-constrained devices.

Future Work

Future research could focus on enhancing motion estimation accuracy through learned optical flow models, integrating multi-scale prediction strategies to better handle complex scenes, and reducing computational overhead via model pruning or quantization. Additionally, exploring self-supervised training paradigms could improve generalization across diverse video domains. Extending the predictive coding framework to generative tasks, such as video synthesis and editing, and integrating with real-time streaming systems are promising directions. Further, adapting the approach for multi-camera setups and multi-modal inputs could broaden its applicability in surveillance, robotics, and AR/VR environments.

AI Executive Summary

The rapid growth of multimodal large models has revolutionized AI's ability to understand and generate complex content, yet video understanding remains a significant challenge due to the sheer volume of visual data involved. Traditional approaches rely on sampling and encoding each frame independently as RGB images, which leads to an exponential increase in visual tokens, high computational costs, and latency issues. This bottleneck hampers the deployment of long-video AI systems in real-world applications such as surveillance, autonomous navigation, and multimedia analysis.

In response to this challenge, the paper introduces AdaCodec, a novel predictive visual coding framework designed specifically for video multimodal large models. Inspired by principles from biological predictive coding and modern video codecs, AdaCodec dynamically balances the transmission of full reference frames and compact inter-frame change representations. It constructs a hierarchical GOP structure where full reference frames (I-frames) are inserted only when prediction errors exceed a certain threshold, while intermediate frames (P-frames) are encoded as motion vectors and residuals—collectively termed P-tokens. This adaptive strategy significantly reduces redundant information, leading to an 84.7% reduction in visual tokens compared to traditional per-frame RGB encoding.

The core of AdaCodec’s architecture involves a dual-branch tokenization pipeline: a reference frame encoder aligned with ViT patches, and a P-frame tokenizer that encodes motion and residuals efficiently. Both components are pretrained and fine-tuned through a two-stage training process—first, feature alignment to optimize P-token representations; second, multimodal training to align visual tokens with language models. The system employs an adaptive GOP construction mechanism, where the insertion of reference frames is triggered by prediction costs, enabling the model to extend predictive chains in stable scenes and refresh reference frames in dynamic scenes.

Extensive experiments across eleven benchmarks demonstrate AdaCodec’s superior efficiency and accuracy. At a fixed visual token budget, it outperforms the RGB baseline, especially in long-video tasks, where it maintains or improves accuracy while using only 1/7 of the tokens. Latency is reduced by over 80%, making real-time inference feasible. Ablation studies confirm the importance of predictive coding, macroblock alignment, and adaptive GOP strategies. The system also exhibits content-dependent GOP behavior, adapting to scene stability and motion complexity, further enhancing performance.

This work fundamentally advances the state of video understanding by integrating predictive coding into the visual interface of multimodal models. It addresses long-standing challenges of redundancy and latency, offering a scalable solution for real-time, long-duration video analysis. The approach opens new avenues for research, including more accurate motion modeling, multi-scale prediction, and broader applications in video synthesis and interactive AI systems. Despite some limitations in highly dynamic scenes and computational costs, AdaCodec sets a new benchmark for efficient, high-performance long-video AI, with promising implications for both academia and industry.

Deep Dive

Abstract

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emph{predictive visual code}, and instantiate it for video MLLMs as \textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.

cs.CV cs.AI cs.CL

References (20)

Kwai Keye-VL 1.5 Technical Report

Biao Yang, Bin Wen, Boyang Ding et al.

2025 53 citations View Analysis →

Mdp3: a Training-Free Approach for List-Wise Frame Selection in Video-Llms

Hui Sun, Shiyin Lu, Huan Wang et al.

2025 25 citations View Analysis →

Adaptive Keyframe Sampling for Long Video Understanding

Xi Tang, Jihao Qiu, Lingxi Xie et al.

2025 139 citations View Analysis →

Egocentric Video-Language Pretraining

Kevin Lin, Alex Wang, Mattia Soldan et al.

2022 279 citations View Analysis →

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Kaichen Zhang, Bo Li, Peiyuan Zhang et al.

2024 297 citations View Analysis →

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

J. Cho, Andrea Madotto, E. Mavroudi et al.

2025 70 citations View Analysis →

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan et al.

2023 551 citations View Analysis →

MotionBench: Benchmarking and Improving Fine-Grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang et al.

2025 59 citations View Analysis →

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang, Xiang An, Yu Yan et al.

2026 3 citations View Analysis →

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li et al.

2025 34 citations View Analysis →

Compressed Video Action Recognition

Chao-Yuan Wu, M. Zaheer, Hexiang Hu et al.

2017 366 citations View Analysis →

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao et al.

2024 195 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 4184 citations View Analysis →

Accelerating Video Object Segmentation with Compressed Video

Kai-yu Xu, Angela Yao

2021 27 citations View Analysis →

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

2025 397 citations View Analysis →

Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.

Rajesh P. N. Rao, D. Ballard

1999 4980 citations

Overview of the H.264/AVC video coding standard

T. Wiegand, G. Sullivan, G. Bjøntegaard et al.

2003 9099 citations

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin, Ryuichi Takanobu, Caiwan Zhang et al.

2023 433 citations View Analysis →

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu et al.

2024 313 citations View Analysis →

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Y. Jang, Yale Song, Youngjae Yu et al.

2017 677 citations View Analysis →

AdaCodec: A Predictive Visual Code for Video MLLMs

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence