AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec employs predictive visual coding, transmitting full reference frames only when prediction is costly, reducing visual tokens by 84.7% and boosting long-video understanding efficiency.
Key Findings
Methodology
AdaCodec is built upon a predictive coding framework that segments videos into GOPs, with an adaptive mechanism to insert full reference frames (I-frames) only when prediction errors exceed a threshold. It encodes intermediate frames as motion and residual updates (P-frames) using a specialized P-tokenizer aligned with ViT patches. The core components include: 1) an MLLM-oriented predictive codec that aligns macroblocks with ViT patches, leveraging motion vectors and residuals for efficient encoding; 2) a dual-branch visual token pipeline comprising a reference frame encoder and a P-frame tokenizer, both pretrained and fine-tuned for multimodal tasks; 3) a two-stage training process: first, feature alignment to optimize P-token representations; second, multimodal training to align visual tokens with language models. The system dynamically adjusts GOP lengths based on prediction costs, ensuring minimal token usage while maintaining accuracy.
Key Results
- Across eleven benchmarks, AdaCodec surpasses the Qwen3-VL-8B RGB baseline at the same visual token budget, achieving an average accuracy gain of 0.5-0.8 points. Notably, at only 1/7 of the baseline token count (32k tokens), it outperforms the 224k token baseline on all long-video tasks, demonstrating exceptional compression efficiency.
- Latency measurements show that AdaCodec reduces inference time from 9.26 seconds to 1.62 seconds, a reduction of 82%, and end-to-end latency from 11.18 seconds to 3.20 seconds, enabling near real-time performance while improving accuracy.
- Ablation studies confirm that predictive coding, macroblock alignment, and adaptive GOP strategies are critical to performance gains. The system maintains long-term content continuity, with adaptive GOPs extending the predictive chain length in stable scenes, further boosting accuracy.
Significance
This work addresses the fundamental bottleneck in long video understanding: the explosive growth of visual tokens due to per-frame encoding. By introducing a predictive visual coding paradigm tailored for multimodal large models, AdaCodec effectively reduces computational costs and latency while preserving or enhancing understanding accuracy. This innovation paves the way for scalable, real-time video analysis in applications like surveillance, autonomous driving, and content moderation. It also offers a new perspective on how to integrate video compression principles with AI reasoning, bridging the gap between traditional codecs and modern deep learning models. The approach's ability to adaptively balance prediction and reference frames signifies a major step toward efficient, intelligent video systems.
Technical Contribution
The primary technical contributions include: 1) the formulation of a predictive visual coding interface optimized for multimodal large models, replacing traditional per-frame RGB input; 2) the design of a macroblock-aligned P-tokenizer that encodes motion and residuals into compact tokens compatible with ViT-based models; 3) an adaptive GOP construction mechanism that triggers reference frame insertion based on prediction error, dynamically balancing token efficiency and accuracy; 4) a two-stage training pipeline that aligns visual tokens with language models via feature and multimodal supervision. These innovations enable a drastic reduction in visual token consumption while maintaining high understanding performance, opening new avenues for scalable video AI.
Novelty
This research is the first to systematically integrate predictive coding principles into the visual interface of video multimodal large models. Unlike standard codecs optimized for human perception, AdaCodec’s predictive code is designed explicitly for AI reasoning, focusing on minimizing tokens and maximizing information content relevant for downstream tasks. Its adaptive GOP strategy and macroblock-aligned tokenization differ fundamentally from existing compression methods and prior work that treat codecs as fixed modules. This approach enables the model to retain long-term content continuity, significantly improving efficiency and robustness in long-video understanding, marking a new paradigm in AI-oriented video encoding.
Limitations
- The accuracy of motion estimation critically impacts the system; in scenes with complex, fast-moving objects, prediction errors increase, leading to more frequent reference frame insertions and reduced compression gains.
- Extreme compression ratios may cause loss of fine details, especially in high-dynamic scenes, affecting downstream task performance.
- The training process requires substantial computational resources, including large-scale pretraining and multimodal alignment, which may hinder deployment on resource-constrained devices.
Future Work
Future research could focus on enhancing motion estimation accuracy through learned optical flow models, integrating multi-scale prediction strategies to better handle complex scenes, and reducing computational overhead via model pruning or quantization. Additionally, exploring self-supervised training paradigms could improve generalization across diverse video domains. Extending the predictive coding framework to generative tasks, such as video synthesis and editing, and integrating with real-time streaming systems are promising directions. Further, adapting the approach for multi-camera setups and multi-modal inputs could broaden its applicability in surveillance, robotics, and AR/VR environments.
AI Executive Summary
The rapid growth of multimodal large models has revolutionized AI's ability to understand and generate complex content, yet video understanding remains a significant challenge due to the sheer volume of visual data involved. Traditional approaches rely on sampling and encoding each frame independently as RGB images, which leads to an exponential increase in visual tokens, high computational costs, and latency issues. This bottleneck hampers the deployment of long-video AI systems in real-world applications such as surveillance, autonomous navigation, and multimedia analysis.
In response to this challenge, the paper introduces AdaCodec, a novel predictive visual coding framework designed specifically for video multimodal large models. Inspired by principles from biological predictive coding and modern video codecs, AdaCodec dynamically balances the transmission of full reference frames and compact inter-frame change representations. It constructs a hierarchical GOP structure where full reference frames (I-frames) are inserted only when prediction errors exceed a certain threshold, while intermediate frames (P-frames) are encoded as motion vectors and residuals—collectively termed P-tokens. This adaptive strategy significantly reduces redundant information, leading to an 84.7% reduction in visual tokens compared to traditional per-frame RGB encoding.
The core of AdaCodec’s architecture involves a dual-branch tokenization pipeline: a reference frame encoder aligned with ViT patches, and a P-frame tokenizer that encodes motion and residuals efficiently. Both components are pretrained and fine-tuned through a two-stage training process—first, feature alignment to optimize P-token representations; second, multimodal training to align visual tokens with language models. The system employs an adaptive GOP construction mechanism, where the insertion of reference frames is triggered by prediction costs, enabling the model to extend predictive chains in stable scenes and refresh reference frames in dynamic scenes.
Extensive experiments across eleven benchmarks demonstrate AdaCodec’s superior efficiency and accuracy. At a fixed visual token budget, it outperforms the RGB baseline, especially in long-video tasks, where it maintains or improves accuracy while using only 1/7 of the tokens. Latency is reduced by over 80%, making real-time inference feasible. Ablation studies confirm the importance of predictive coding, macroblock alignment, and adaptive GOP strategies. The system also exhibits content-dependent GOP behavior, adapting to scene stability and motion complexity, further enhancing performance.
This work fundamentally advances the state of video understanding by integrating predictive coding into the visual interface of multimodal models. It addresses long-standing challenges of redundancy and latency, offering a scalable solution for real-time, long-duration video analysis. The approach opens new avenues for research, including more accurate motion modeling, multi-scale prediction, and broader applications in video synthesis and interactive AI systems. Despite some limitations in highly dynamic scenes and computational costs, AdaCodec sets a new benchmark for efficient, high-performance long-video AI, with promising implications for both academia and industry.
Deep Dive
Abstract
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emph{predictive visual code}, and instantiate it for video MLLMs as \textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.
References (20)
Kwai Keye-VL 1.5 Technical Report
Biao Yang, Bin Wen, Boyang Ding et al.
Mdp3: a Training-Free Approach for List-Wise Frame Selection in Video-Llms
Hui Sun, Shiyin Lu, Huan Wang et al.
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie et al.
Egocentric Video-Language Pretraining
Kevin Lin, Alex Wang, Mattia Soldan et al.
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang, Bo Li, Peiyuan Zhang et al.
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
J. Cho, Andrea Madotto, E. Mavroudi et al.
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan et al.
MotionBench: Benchmarking and Improving Fine-Grained Video Motion Understanding for Vision Language Models
Wenyi Hong, Yean Cheng, Zhuoyi Yang et al.
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
Feilong Tang, Xiang An, Yu Yan et al.
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li et al.
Compressed Video Action Recognition
Chao-Yuan Wu, M. Zaheer, Hexiang Hu et al.
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao et al.
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan et al.
Accelerating Video Object Segmentation with Compressed Video
Kai-yu Xu, Angela Yao
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li et al.
Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.
Rajesh P. N. Rao, D. Ballard
Overview of the H.264/AVC video coding standard
T. Wiegand, G. Sullivan, G. Bjøntegaard et al.
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin, Ryuichi Takanobu, Caiwan Zhang et al.
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu, Shicheng Li, Yi Liu et al.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Y. Jang, Yale Song, Youngjae Yu et al.