Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

TL;DR

Introduced Spatio-Temporal Token Scoring (STTS) to enhance video VLMs efficiency by 62% with minimal performance drop.

cs.CV πŸ”΄ Advanced 2026-03-19 51 views
Jianrui Zhang Yue Yang Rohun Tripathi Winson Han Ranjay Krishna Christopher Clark Yong Jae Lee Sangho Lee
vision-language models video processing token pruning computational efficiency spatio-temporal analysis

Key Findings

Methodology

This paper introduces a lightweight module called Spatio-Temporal Token Scoring (STTS) designed to prune vision tokens through a unified spatio-temporal scoring mechanism. STTS prunes tokens in both the Vision Transformer (ViT) and the Large Language Model (LLM) without requiring text conditioning or token merging. By learning to score temporally via an auxiliary loss and spatially via LLM downstream gradients, STTS can prune 50% of vision tokens across the entire architecture.

Key Results

  • STTS achieved a 62% efficiency improvement across 13 short and long video QA tasks, with only a 0.7% drop in average performance. Test-time scaling for long-video QA further improved performance by 0.5-1%.
  • In experiments, STTS demonstrated minimal performance degradation of only 0.7% at a 50% token pruning rate, showcasing its stability and robustness across diverse tasks.
  • With the auxiliary loss of neighboring-frame cosine similarity, STTS effectively identifies and prunes redundant temporal frames, achieving significant computational acceleration in long video understanding.

Significance

The introduction of STTS provides a novel solution for enhancing computational efficiency in video vision-language models. By significantly reducing computational overhead without notably affecting model performance, STTS offers a more efficient tool for academia and industry when handling long video tasks. It addresses the shortcomings of existing methods in dealing with cross-frame temporal redundancy, making video processing tasks more scalable.

Technical Contribution

STTS's technical contribution lies in its ability to prune tokens directly in the ViT and LLM without complex text-conditioned selection or merging algorithms. This approach not only simplifies the architecture but also achieves more efficient token scoring through auxiliary loss and downstream gradient learning. Additionally, STTS's efficient packing algorithm further optimizes computational resource utilization.

Novelty

STTS is the first to achieve unified vision token pruning across the entire architecture without relying on complex text conditioning or merging algorithms. Compared to existing methods, STTS significantly enhances computational efficiency in video processing tasks through its simple module design and innovative scoring mechanism.

Limitations

  • STTS may still face computational bottlenecks when processing extremely long videos, especially when dealing with a large number of frames.
  • While STTS performs well in most tasks, it may require further parameter optimization in certain specific tasks to achieve optimal performance.
  • STTS's performance may significantly degrade at extreme pruning rates, necessitating trade-offs in practical applications.

Future Work

Future research directions could include optimizing STTS's performance in extremely long videos and exploring its application in other multimodal tasks. Additionally, further investigation into combining other pruning techniques to achieve more efficient computational resource utilization is a worthwhile direction.

AI Executive Summary

In recent years, vision-language models (VLMs) have made significant strides in video understanding, but this progress comes with substantial computational costs. Processing video requires encoding a large number of frames, each decomposed into hundreds of patch tokens by a Vision Transformer (ViT). As the number of frames increases, the resulting token sequences become quadratically expensive under attention, leading to significant memory usage, reduced training throughput, and increased inference latency.

Existing pruning methods only address part of the problem. Pre-ViT and in-ViT approaches reduce token redundancy before or during ViT encoding, employing strategies such as early exiting, token matching and mixing, and attention-based scoring. While effective for spatial redundancy in unimodal perception tasks, these methods are not explicitly designed for multimodal VLM objectives and do not account for cross-frame temporal redundancy in video inputs.

This paper introduces a lightweight module called Spatio-Temporal Token Scoring (STTS), designed to prune vision tokens through a unified spatio-temporal scoring mechanism. STTS prunes tokens in both the Vision Transformer (ViT) and the Large Language Model (LLM) without requiring text conditioning or token merging. By learning to score temporally via an auxiliary loss and spatially via LLM downstream gradients, STTS can prune 50% of vision tokens across the entire architecture.

In experiments, STTS achieved a 62% efficiency improvement across 13 short and long video QA tasks, with only a 0.7% drop in average performance. Test-time scaling for long-video QA further improved performance by 0.5-1%. The introduction of STTS provides a novel solution for enhancing computational efficiency in video vision-language models.

STTS's technical contribution lies in its ability to prune tokens directly in the ViT and LLM without complex text-conditioned selection or merging algorithms. This approach not only simplifies the architecture but also achieves more efficient token scoring through auxiliary loss and downstream gradient learning. Additionally, STTS's efficient packing algorithm further optimizes computational resource utilization. Future research directions could include optimizing STTS's performance in extremely long videos and exploring its application in other multimodal tasks.

Deep Analysis

Background

Vision-language models (VLMs) have made significant strides in video understanding, but this progress comes with substantial computational costs. Processing video requires encoding a large number of frames, each decomposed into hundreds of patch tokens by a Vision Transformer (ViT). As the number of frames increases, the resulting token sequences become quadratically expensive under attention, leading to significant memory usage, reduced training throughput, and increased inference latency. Existing pruning methods only address part of the problem. Pre-ViT and in-ViT approaches reduce token redundancy before or during ViT encoding, employing strategies such as early exiting, token matching and mixing, and attention-based scoring. While effective for spatial redundancy in unimodal perception tasks, these methods are not explicitly designed for multimodal VLM objectives and do not account for cross-frame temporal redundancy in video inputs.

Core Problem

Video vision-language models face challenges in computational efficiency when handling long video tasks. Existing methods fall short in addressing cross-frame temporal redundancy, leading to high computational costs. The core problem is how to reduce computational overhead without significantly affecting model performance, which remains a pressing issue.

Innovation

This paper introduces a lightweight module called Spatio-Temporal Token Scoring (STTS), designed to prune vision tokens through a unified spatio-temporal scoring mechanism. STTS prunes tokens in both the Vision Transformer (ViT) and the Large Language Model (LLM) without requiring text conditioning or token merging. By learning to score temporally via an auxiliary loss and spatially via LLM downstream gradients, STTS can prune 50% of vision tokens across the entire architecture.

Methodology

  • οΏ½οΏ½ STTS Module Design: Prunes tokens in both ViT and LLM without text conditioning or token merging.
  • οΏ½οΏ½ Temporal Scoring: Learns temporal scoring via an auxiliary loss.
  • οΏ½οΏ½ Spatial Scoring: Learns spatial scoring via LLM downstream gradients.
  • οΏ½οΏ½ Efficient Packing Algorithm: Optimizes computational resource utilization.

Experiments

Experiments were conducted across 13 short and long video QA tasks to evaluate the efficiency improvement and performance impact of STTS. By comparing model performance at different pruning rates, the stability and robustness of STTS were verified. Results showed that STTS demonstrated minimal performance degradation of only 0.7% at a 50% token pruning rate.

Results

STTS achieved a 62% efficiency improvement across 13 short and long video QA tasks, with only a 0.7% drop in average performance. Test-time scaling for long-video QA further improved performance by 0.5-1%. With the auxiliary loss of neighboring-frame cosine similarity, STTS effectively identifies and prunes redundant temporal frames, achieving significant computational acceleration in long video understanding.

Applications

STTS can be directly applied to scenarios requiring efficient video processing, such as video surveillance and video QA systems. By reducing computational overhead, STTS can significantly improve processing efficiency without affecting performance.

Limitations & Outlook

STTS may still face computational bottlenecks when processing extremely long videos, especially when dealing with a large number of frames. While STTS performs well in most tasks, it may require further parameter optimization in certain specific tasks to achieve optimal performance. STTS's performance may significantly degrade at extreme pruning rates, necessitating trade-offs in practical applications.

Plain Language Accessible to non-experts

Imagine you're in a busy kitchen, and the chef needs to quickly prepare a large meal. Each chef has a pile of ingredients (like frames in a video), but not all ingredients are necessary. To improve efficiency, the chefs need to decide which ingredients are essential and which can be skipped. STTS acts like a smart assistant, helping the chefs quickly identify less important ingredients (redundant frames), saving time and effort. This way, the kitchen can complete the work faster without compromising the quality of the dishes. STTS plays a similar role in video processing, helping the model process video data more efficiently without affecting performance by using intelligent token pruning mechanisms.

ELI14 Explained like you're 14

Hey there, buddy! Imagine you're playing a super cool game where your task is to tidy up a super messy room. The room is full of stuff, and you need to quickly decide which things are important and which can be set aside for now. STTS is like your super helper, helping you quickly spot the less important things, so you can finish the task faster! It's like in video processing, where STTS helps the model quickly find the important information without having to deal with all the details. This way, the model can finish the task faster and more efficiently, just like you tidying up the room in the game!

Glossary

Vision-Language Model

A model that combines visual and language information to understand and generate multimodal data.

In this paper, VLMs are used for video understanding tasks.

Vision Transformer

A neural network architecture based on self-attention mechanisms, used for processing visual data.

The paper uses ViT to decompose video frames into patch tokens.

Large Language Model

A deep learning model capable of processing and generating natural language, typically with a large number of parameters.

In this paper, the LLM processes the output of the ViT.

Token Pruning

A method to reduce the computational burden of a model by selectively discarding unimportant tokens.

STTS improves model efficiency through token pruning.

Spatio-Temporal Token Scoring

A scoring mechanism used to evaluate and prune vision tokens in both spatial and temporal dimensions.

The paper introduces STTS to enhance video processing efficiency.

Auxiliary Loss

An additional loss function used in training to help the model learn specific tasks.

In STTS, auxiliary loss is used for learning temporal scoring.

Downstream Gradients

Gradient information propagated back from the final task loss to adjust model parameters.

STTS uses downstream gradients to learn spatial scoring.

Efficient Packing Algorithm

A method to optimize computational resource utilization by reorganizing data to reduce computational burden.

STTS uses an efficient packing algorithm to optimize resource utilization.

Neighboring-Frame Cosine Similarity

A method to measure similarity between adjacent frames, helping identify redundant information.

STTS uses neighboring-frame cosine similarity as an auxiliary loss.

Test-Time Scaling

A method to adjust the scale of model inputs during inference to improve performance.

In long-video QA, STTS improves performance through test-time scaling.

Open Questions Unanswered questions from this research

  • 1 How can STTS's performance in extremely long videos be further optimized? Current methods may still face computational bottlenecks when dealing with a large number of frames, requiring exploration of more efficient pruning strategies.
  • 2 What is the potential for STTS's application in other multimodal tasks? Further research is needed to explore its applicability and performance in different tasks.
  • 3 How can other pruning techniques be combined to achieve more efficient computational resource utilization? While the current STTS method is effective, there is still room for improvement.
  • 4 How can model performance be maintained without significant degradation at extreme pruning rates? More intelligent pruning strategies need to be explored to balance efficiency and performance.
  • 5 How can computational overhead be further reduced without affecting model performance? More advanced pruning and optimization techniques need to be researched.

Applications

Immediate Applications

Video Surveillance

STTS can be used to improve the processing efficiency of video surveillance systems by reducing computational overhead for faster real-time monitoring.

Video QA Systems

In video QA systems, STTS can help models process and understand video content faster, improving response speed.

Video Editing

STTS can be used in video editing software to accelerate video processing and rendering through intelligent pruning.

Long-term Vision

Intelligent Traffic Systems

STTS can be applied in intelligent traffic systems to achieve smarter traffic management and monitoring through efficient video processing.

Virtual Reality

In virtual reality applications, STTS can help improve video rendering efficiency, providing a smoother user experience.

Abstract

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

cs.CV cs.AI cs.LG

References (20)

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma et al.

2026 22 citations ⭐ Influential View Analysis β†’

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Fanqing Meng, Jin Wang, Chuanhao Li et al.

2024 53 citations View Analysis β†’

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, J. Tan et al.

2022 1290 citations View Analysis β†’

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 3706 citations View Analysis β†’

A Diagram is Worth a Dozen Images

Aniruddha Kembhavi, M. Salvato, Eric Kolve et al.

2016 853 citations View Analysis β†’

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

Xiangrui Liu, Yan Shu, Zheng Liu et al.

2025 37 citations View Analysis β†’

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He et al.

2023 972 citations View Analysis β†’

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Hanning Chen, Yang Ni, Wenjun Huang et al.

2024 11 citations View Analysis β†’

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, Xingyu Fu, James Y. Huang et al.

2024 133 citations View Analysis β†’

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun, Sukjun Hwang, Su Ho Han et al.

2025 18 citations View Analysis β†’

VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

Run Luo, Renke Shan, Longze Chen et al.

2025 4 citations View Analysis β†’

MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang et al.

2023 1869 citations View Analysis β†’

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew, Dimosthenis Karatzas, R. Manmatha et al.

2020 1245 citations View Analysis β†’

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarajan, Meet Shah et al.

2019 1889 citations View Analysis β†’

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu et al.

2024 249 citations View Analysis β†’

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 8483 citations View Analysis β†’

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao et al.

2024 129 citations View Analysis β†’

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia et al.

2023 1345 citations View Analysis β†’

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, E. Mavroudi et al.

2025 55 citations View Analysis β†’

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Yash Goyal, Tejas Khot, D. Summers-Stay et al.

2016 4016 citations View Analysis β†’