Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Introduced Spatio-Temporal Token Scoring (STTS) to enhance video VLMs efficiency by 62% with minimal performance drop.
Key Findings
Methodology
This paper introduces a lightweight module called Spatio-Temporal Token Scoring (STTS) designed to prune vision tokens through a unified spatio-temporal scoring mechanism. STTS prunes tokens in both the Vision Transformer (ViT) and the Large Language Model (LLM) without requiring text conditioning or token merging. By learning to score temporally via an auxiliary loss and spatially via LLM downstream gradients, STTS can prune 50% of vision tokens across the entire architecture.
Key Results
- STTS achieved a 62% efficiency improvement across 13 short and long video QA tasks, with only a 0.7% drop in average performance. Test-time scaling for long-video QA further improved performance by 0.5-1%.
- In experiments, STTS demonstrated minimal performance degradation of only 0.7% at a 50% token pruning rate, showcasing its stability and robustness across diverse tasks.
- With the auxiliary loss of neighboring-frame cosine similarity, STTS effectively identifies and prunes redundant temporal frames, achieving significant computational acceleration in long video understanding.
Significance
The introduction of STTS provides a novel solution for enhancing computational efficiency in video vision-language models. By significantly reducing computational overhead without notably affecting model performance, STTS offers a more efficient tool for academia and industry when handling long video tasks. It addresses the shortcomings of existing methods in dealing with cross-frame temporal redundancy, making video processing tasks more scalable.
Technical Contribution
STTS's technical contribution lies in its ability to prune tokens directly in the ViT and LLM without complex text-conditioned selection or merging algorithms. This approach not only simplifies the architecture but also achieves more efficient token scoring through auxiliary loss and downstream gradient learning. Additionally, STTS's efficient packing algorithm further optimizes computational resource utilization.
Novelty
STTS is the first to achieve unified vision token pruning across the entire architecture without relying on complex text conditioning or merging algorithms. Compared to existing methods, STTS significantly enhances computational efficiency in video processing tasks through its simple module design and innovative scoring mechanism.
Limitations
- STTS may still face computational bottlenecks when processing extremely long videos, especially when dealing with a large number of frames.
- While STTS performs well in most tasks, it may require further parameter optimization in certain specific tasks to achieve optimal performance.
- STTS's performance may significantly degrade at extreme pruning rates, necessitating trade-offs in practical applications.
Future Work
Future research directions could include optimizing STTS's performance in extremely long videos and exploring its application in other multimodal tasks. Additionally, further investigation into combining other pruning techniques to achieve more efficient computational resource utilization is a worthwhile direction.
AI Executive Summary
In recent years, vision-language models (VLMs) have made significant strides in video understanding, but this progress comes with substantial computational costs. Processing video requires encoding a large number of frames, each decomposed into hundreds of patch tokens by a Vision Transformer (ViT). As the number of frames increases, the resulting token sequences become quadratically expensive under attention, leading to significant memory usage, reduced training throughput, and increased inference latency.
Existing pruning methods only address part of the problem. Pre-ViT and in-ViT approaches reduce token redundancy before or during ViT encoding, employing strategies such as early exiting, token matching and mixing, and attention-based scoring. While effective for spatial redundancy in unimodal perception tasks, these methods are not explicitly designed for multimodal VLM objectives and do not account for cross-frame temporal redundancy in video inputs.
This paper introduces a lightweight module called Spatio-Temporal Token Scoring (STTS), designed to prune vision tokens through a unified spatio-temporal scoring mechanism. STTS prunes tokens in both the Vision Transformer (ViT) and the Large Language Model (LLM) without requiring text conditioning or token merging. By learning to score temporally via an auxiliary loss and spatially via LLM downstream gradients, STTS can prune 50% of vision tokens across the entire architecture.
In experiments, STTS achieved a 62% efficiency improvement across 13 short and long video QA tasks, with only a 0.7% drop in average performance. Test-time scaling for long-video QA further improved performance by 0.5-1%. The introduction of STTS provides a novel solution for enhancing computational efficiency in video vision-language models.
STTS's technical contribution lies in its ability to prune tokens directly in the ViT and LLM without complex text-conditioned selection or merging algorithms. This approach not only simplifies the architecture but also achieves more efficient token scoring through auxiliary loss and downstream gradient learning. Additionally, STTS's efficient packing algorithm further optimizes computational resource utilization. Future research directions could include optimizing STTS's performance in extremely long videos and exploring its application in other multimodal tasks.
Deep Analysis
Background
Vision-language models (VLMs) have made significant strides in video understanding, but this progress comes with substantial computational costs. Processing video requires encoding a large number of frames, each decomposed into hundreds of patch tokens by a Vision Transformer (ViT). As the number of frames increases, the resulting token sequences become quadratically expensive under attention, leading to significant memory usage, reduced training throughput, and increased inference latency. Existing pruning methods only address part of the problem. Pre-ViT and in-ViT approaches reduce token redundancy before or during ViT encoding, employing strategies such as early exiting, token matching and mixing, and attention-based scoring. While effective for spatial redundancy in unimodal perception tasks, these methods are not explicitly designed for multimodal VLM objectives and do not account for cross-frame temporal redundancy in video inputs.
Core Problem
Video vision-language models face challenges in computational efficiency when handling long video tasks. Existing methods fall short in addressing cross-frame temporal redundancy, leading to high computational costs. The core problem is how to reduce computational overhead without significantly affecting model performance, which remains a pressing issue.
Innovation
This paper introduces a lightweight module called Spatio-Temporal Token Scoring (STTS), designed to prune vision tokens through a unified spatio-temporal scoring mechanism. STTS prunes tokens in both the Vision Transformer (ViT) and the Large Language Model (LLM) without requiring text conditioning or token merging. By learning to score temporally via an auxiliary loss and spatially via LLM downstream gradients, STTS can prune 50% of vision tokens across the entire architecture.
Methodology
- οΏ½οΏ½ STTS Module Design: Prunes tokens in both ViT and LLM without text conditioning or token merging.
- οΏ½οΏ½ Temporal Scoring: Learns temporal scoring via an auxiliary loss.
- οΏ½οΏ½ Spatial Scoring: Learns spatial scoring via LLM downstream gradients.
- οΏ½οΏ½ Efficient Packing Algorithm: Optimizes computational resource utilization.
Experiments
Experiments were conducted across 13 short and long video QA tasks to evaluate the efficiency improvement and performance impact of STTS. By comparing model performance at different pruning rates, the stability and robustness of STTS were verified. Results showed that STTS demonstrated minimal performance degradation of only 0.7% at a 50% token pruning rate.
Results
STTS achieved a 62% efficiency improvement across 13 short and long video QA tasks, with only a 0.7% drop in average performance. Test-time scaling for long-video QA further improved performance by 0.5-1%. With the auxiliary loss of neighboring-frame cosine similarity, STTS effectively identifies and prunes redundant temporal frames, achieving significant computational acceleration in long video understanding.
Applications
STTS can be directly applied to scenarios requiring efficient video processing, such as video surveillance and video QA systems. By reducing computational overhead, STTS can significantly improve processing efficiency without affecting performance.
Limitations & Outlook
STTS may still face computational bottlenecks when processing extremely long videos, especially when dealing with a large number of frames. While STTS performs well in most tasks, it may require further parameter optimization in certain specific tasks to achieve optimal performance. STTS's performance may significantly degrade at extreme pruning rates, necessitating trade-offs in practical applications.
Plain Language Accessible to non-experts
Imagine you're in a busy kitchen, and the chef needs to quickly prepare a large meal. Each chef has a pile of ingredients (like frames in a video), but not all ingredients are necessary. To improve efficiency, the chefs need to decide which ingredients are essential and which can be skipped. STTS acts like a smart assistant, helping the chefs quickly identify less important ingredients (redundant frames), saving time and effort. This way, the kitchen can complete the work faster without compromising the quality of the dishes. STTS plays a similar role in video processing, helping the model process video data more efficiently without affecting performance by using intelligent token pruning mechanisms.
ELI14 Explained like you're 14
Hey there, buddy! Imagine you're playing a super cool game where your task is to tidy up a super messy room. The room is full of stuff, and you need to quickly decide which things are important and which can be set aside for now. STTS is like your super helper, helping you quickly spot the less important things, so you can finish the task faster! It's like in video processing, where STTS helps the model quickly find the important information without having to deal with all the details. This way, the model can finish the task faster and more efficiently, just like you tidying up the room in the game!
Glossary
Vision-Language Model
A model that combines visual and language information to understand and generate multimodal data.
In this paper, VLMs are used for video understanding tasks.
Vision Transformer
A neural network architecture based on self-attention mechanisms, used for processing visual data.
The paper uses ViT to decompose video frames into patch tokens.
Large Language Model
A deep learning model capable of processing and generating natural language, typically with a large number of parameters.
In this paper, the LLM processes the output of the ViT.
Token Pruning
A method to reduce the computational burden of a model by selectively discarding unimportant tokens.
STTS improves model efficiency through token pruning.
Spatio-Temporal Token Scoring
A scoring mechanism used to evaluate and prune vision tokens in both spatial and temporal dimensions.
The paper introduces STTS to enhance video processing efficiency.
Auxiliary Loss
An additional loss function used in training to help the model learn specific tasks.
In STTS, auxiliary loss is used for learning temporal scoring.
Downstream Gradients
Gradient information propagated back from the final task loss to adjust model parameters.
STTS uses downstream gradients to learn spatial scoring.
Efficient Packing Algorithm
A method to optimize computational resource utilization by reorganizing data to reduce computational burden.
STTS uses an efficient packing algorithm to optimize resource utilization.
Neighboring-Frame Cosine Similarity
A method to measure similarity between adjacent frames, helping identify redundant information.
STTS uses neighboring-frame cosine similarity as an auxiliary loss.
Test-Time Scaling
A method to adjust the scale of model inputs during inference to improve performance.
In long-video QA, STTS improves performance through test-time scaling.
Open Questions Unanswered questions from this research
- 1 How can STTS's performance in extremely long videos be further optimized? Current methods may still face computational bottlenecks when dealing with a large number of frames, requiring exploration of more efficient pruning strategies.
- 2 What is the potential for STTS's application in other multimodal tasks? Further research is needed to explore its applicability and performance in different tasks.
- 3 How can other pruning techniques be combined to achieve more efficient computational resource utilization? While the current STTS method is effective, there is still room for improvement.
- 4 How can model performance be maintained without significant degradation at extreme pruning rates? More intelligent pruning strategies need to be explored to balance efficiency and performance.
- 5 How can computational overhead be further reduced without affecting model performance? More advanced pruning and optimization techniques need to be researched.
Applications
Immediate Applications
Video Surveillance
STTS can be used to improve the processing efficiency of video surveillance systems by reducing computational overhead for faster real-time monitoring.
Video QA Systems
In video QA systems, STTS can help models process and understand video content faster, improving response speed.
Video Editing
STTS can be used in video editing software to accelerate video processing and rendering through intelligent pruning.
Long-term Vision
Intelligent Traffic Systems
STTS can be applied in intelligent traffic systems to achieve smarter traffic management and monitoring through efficient video processing.
Virtual Reality
In virtual reality applications, STTS can help improve video rendering efficiency, providing a smoother user experience.
Abstract
Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.
References (20)
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma et al.
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng, Jin Wang, Chuanhao Li et al.
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, J. Tan et al.
Qwen3 Technical Report
An Yang, Anfeng Li, Baosong Yang et al.
A Diagram is Worth a Dozen Images
Aniruddha Kembhavi, M. Salvato, Eric Kolve et al.
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu, Yan Shu, Zheng Liu et al.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He et al.
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
Hanning Chen, Yang Ni, Wenjun Huang et al.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang, Xingyu Fu, James Y. Huang et al.
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han et al.
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
Run Luo, Renke Shan, Longze Chen et al.
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang et al.
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew, Dimosthenis Karatzas, R. Manmatha et al.
Towards VQA Models That Can Read
Amanpreet Singh, Vivek Natarajan, Meet Shah et al.
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu, Shicheng Li, Yi Liu et al.
Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu et al.
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao et al.
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia et al.
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho, Andrea Madotto, E. Mavroudi et al.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal, Tejas Khot, D. Summers-Stay et al.