EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges

TL;DR

EV-CLIP efficiently adapts CLIP for few-shot action recognition under visual challenges using visual prompts.

cs.CV 🔴 Advanced 2026-04-24 32 views
Hyo Jin Jon Longbin Jin Eun Yi Kim
action recognition visual prompts few-shot learning CLIP visual challenges

Key Findings

Methodology

EV-CLIP introduces two visual prompts: mask prompts and context prompts. Mask prompts guide the model's attention to action-relevant regions by reweighting pixels, while context prompts perform lightweight temporal modeling by compressing frame-wise features into a compact representation. These prompts adapt the frozen CLIP visual encoder without altering its internal architecture.

Key Results

  • On the UCF101 dataset under eight-shot settings, EV-CLIP achieves strong performance in terms of accuracy, throughput, and FLOPs, significantly outperforming existing parameter-efficient methods.
  • Experiments across five benchmark datasets demonstrate that EV-CLIP consistently achieves the highest overall performance in few-shot adaptation settings, especially under low-light environments and egocentric viewpoints.
  • By maintaining strong accuracy even with lightweight backbones such as ResNet50, EV-CLIP significantly reduces computational overhead without compromising recognition performance.

Significance

EV-CLIP holds significant implications for both academia and industry. It addresses long-standing pain points in action recognition under visual challenges, particularly in low-light environments and egocentric viewpoints. By introducing visual prompts, EV-CLIP enhances model adaptability and efficiency without increasing computational costs, making it suitable for deployment in resource-constrained real-world environments.

Technical Contribution

EV-CLIP's technical contributions lie in its modular visual prompt adaptation framework, which enhances visual adaptability while preserving efficiency across diverse backbone encoders. Unlike existing methods, EV-CLIP improves spatial perception and temporal modeling capabilities through lightweight mask and context prompts without altering CLIP's internal architecture.

Novelty

EV-CLIP is the first to efficiently adapt CLIP for few-shot action recognition under visual challenges using visual prompts. Compared to existing CLIP adaptation methods, EV-CLIP improves adaptability and efficiency through its modular prompt design without relying on specific backbones.

Limitations

  • EV-CLIP may perform suboptimally in extreme low-light or complex background scenarios, as mask prompts may not fully eliminate background noise.
  • Its efficiency advantage may not be as pronounced in scenarios requiring extensive computational resources.
  • EV-CLIP may require further fine-tuning on certain domain-specific datasets to achieve optimal performance.

Future Work

Future research directions include further optimizing the design of visual prompts to enhance adaptability in more complex scenarios. Additionally, exploring the application of EV-CLIP to other visual tasks such as object detection and semantic segmentation could validate its generality and extensibility.

AI Executive Summary

In today's computer vision landscape, action recognition is a crucial step towards understanding human behavior, with wide-ranging applications. However, existing methods often fall short when faced with visual challenges such as low-light environments or egocentric viewpoints. Traditional action recognition techniques primarily rely on temporal modeling, neglecting spatial perception, which is a major shortcoming in practical applications.

EV-CLIP addresses this issue by introducing two types of visual prompts: mask prompts and context prompts. Mask prompts guide the model's attention to action-relevant regions by reweighting pixels, while context prompts perform lightweight temporal modeling by compressing frame-wise features into a compact representation. These prompts adapt the frozen CLIP visual encoder without altering its internal architecture.

In experiments, EV-CLIP was comprehensively evaluated across five benchmark datasets, including UCF101, HMDB51, SSv2, ARID, and EK100Verb. Results show that EV-CLIP consistently achieves the highest overall performance in few-shot adaptation settings, particularly excelling under low-light environments and egocentric viewpoints.

The significance of EV-CLIP lies in its ability to enhance model adaptability and efficiency without increasing computational costs, providing a viable solution for deployment in resource-constrained real-world environments. This research brings a new perspective to the field of action recognition, addressing long-standing pain points.

However, EV-CLIP also has its limitations. In some extreme low-light or complex background scenarios, mask prompts may not fully eliminate background noise. Additionally, its efficiency advantage may not be as pronounced in scenarios requiring extensive computational resources. Future research directions include further optimizing the design of visual prompts to enhance adaptability in more complex scenarios.

Deep Analysis

Background

Action recognition is a significant research direction in the field of computer vision, aiming to understand human actions by analyzing video sequences. In recent years, deep neural networks, particularly convolutional neural networks (CNNs) and transformers, have made remarkable progress in video action recognition. However, in practical applications, video data often faces various visual challenges, such as variations in viewpoints and illumination, which can degrade model performance. Traditional action recognition methods primarily rely on temporal modeling, neglecting spatial perception, which is a major shortcoming in practical applications. To address these issues, researchers have begun exploring how to enhance models' spatial perception capabilities through visual prompts.

Core Problem

In real-world video action recognition, models often encounter significant domain shifts caused by variations in illumination, viewpoints, backgrounds, and camera perspectives. These domain shifts can significantly degrade model performance when deployed outside of controlled training environments. While training large-scale video models to handle such diverse visual conditions is theoretically feasible, it remains impractical in real-world deployments due to substantial data volume, annotation, and computational resources requirements. Therefore, improving model adaptability and efficiency without increasing computational costs has become an important research problem.

Innovation

The core innovations of EV-CLIP lie in its modular visual prompt adaptation framework, which enhances visual adaptability while preserving efficiency across diverse backbone encoders. Specifically:


  • �� Mask Prompts: Guide the model's attention to action-relevant regions by reweighting pixels, reducing background noise interference.

  • �� Context Prompts: Perform lightweight temporal modeling by compressing frame-wise features into a compact representation, enhancing the model's temporal perception capabilities.

  • �� Modular Design: Adapt the frozen CLIP visual encoder without altering its internal architecture, improving model adaptability and efficiency.

Methodology

The methodology of EV-CLIP is detailed as follows:


  • �� Mask Prompt Generation: Latent features are extracted using a pretrained video model and processed through the decoder architecture of Swin-Unet to generate mask prompts, emphasizing action-relevant regions.

  • �� Context Prompt Generation: Video knowledge is compressed into a prompt through pooling and linear projection, providing a global temporal flow.

  • �� Prompt Integration: Mask prompts are applied to every channel of video frames, and context prompts are integrated with frame features to enhance video-level understanding.

  • �� Consistency Loss: Introduce consistency loss to ensure coherent representations across frames, reducing unnecessary variations.

Experiments

The experimental design includes evaluating EV-CLIP across five benchmark datasets: UCF101, HMDB51, SSv2, ARID, and EK100Verb. These datasets cover a range of visual variations, ensuring a robust analysis of model performance under different real-world conditions. The experiments use ViT-B/16 as the visual encoder in CLIP, paired with Omnivore-small as the video model. Video clips are sampled with 8 frames, using randomly selected starting frames for training and center clipping for testing. Frames are resized to 224x224, using random cropping for training and center cropping for testing.

Results

Experimental results show that EV-CLIP consistently achieves the highest overall performance in few-shot adaptation settings, particularly excelling under low-light environments and egocentric viewpoints. On the UCF101 dataset under eight-shot settings, EV-CLIP achieves strong performance in terms of accuracy, throughput, and FLOPs, significantly outperforming existing parameter-efficient methods. Additionally, EV-CLIP maintains strong accuracy even with lightweight backbones such as ResNet50, significantly reducing computational overhead without compromising recognition performance.

Applications

Application scenarios for EV-CLIP include:


  • �� Surveillance Systems: Improve action recognition accuracy in low-light environments, enhancing the effectiveness of security monitoring.

  • �� Wearable Devices: Achieve efficient action recognition under egocentric viewpoints, enhancing user experience for devices like smart glasses.

  • �� Robotics: Improve action recognition capabilities in complex environments, enhancing autonomy and intelligence in robots.

Limitations & Outlook

The limitations of EV-CLIP include:


  • �� In some extreme low-light or complex background scenarios, mask prompts may not fully eliminate background noise.

  • �� Its efficiency advantage may not be as pronounced in scenarios requiring extensive computational resources.

  • �� EV-CLIP may require further fine-tuning on certain domain-specific datasets to achieve optimal performance. Future research directions include further optimizing the design of visual prompts to enhance adaptability in more complex scenarios.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. The kitchen is full of different things like pots, pans, spice bottles, and various ingredients. To make a delicious dish, you need to focus on what's really important, like the freshness of the ingredients and the right mix of spices, rather than getting distracted by the clutter in the kitchen.

EV-CLIP is like a smart kitchen assistant that helps you focus on what's truly important. In video recognition, EV-CLIP uses 'mask prompts' to highlight areas related to actions, just like telling you which ingredients are fresh and which spices are necessary.

At the same time, EV-CLIP uses 'context prompts' to help you understand the sequence of the cooking process, like reminding you what spice to add at each step and when to stir.

This way, even in a poorly lit kitchen or an unfamiliar cooking environment, EV-CLIP helps you make a delicious dish. It's like having an assistant that can help you cook well under any conditions.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you have to find hidden treasures in a dark forest. You need a super helper to guide you on the right path, right?

EV-CLIP is like that helper! It helps you find important actions in videos, like showing you where the treasure is. It uses something called 'mask prompts' to let you see important details, so you're not scared by the darkness around you.

Not only that, EV-CLIP also uses 'context prompts' to help you remember each step of the way, like planning your entire adventure route.

So even in the dark forest, you can easily find the treasure! Isn't that cool? That's the magic of EV-CLIP, helping you complete tasks under any conditions!

Glossary

CLIP (Contrastive Language-Image Pretraining)

CLIP is a visual-language model that embeds images and text into a shared semantic space, enabling recognition of novel categories without access to original training data.

In this paper, CLIP is adapted for video action recognition using visual prompts.

Visual Prompt

A visual prompt is a lightweight trainable component used to enhance a model's spatial perception and temporal modeling capabilities.

EV-CLIP introduces mask and context prompts to adapt CLIP.

Mask Prompt

Mask prompts guide the model's attention to action-relevant regions by reweighting pixels, reducing background noise interference.

In EV-CLIP, mask prompts are used to enhance spatial perception.

Context Prompt

Context prompts perform lightweight temporal modeling by compressing frame-wise features into a compact representation, enhancing temporal perception.

In EV-CLIP, context prompts are used to enhance temporal modeling.

Few-shot Learning

Few-shot learning is a method of adapting models using a small number of labeled samples, suitable for data-scarce scenarios.

EV-CLIP performs action recognition under few-shot settings.

Domain Shift

Domain shift refers to the distribution differences between training and testing data, which can degrade model performance.

Domain shift is a significant challenge in video action recognition.

Temporal Modeling

Temporal modeling refers to the process of capturing and analyzing temporal sequence information in video processing.

EV-CLIP performs lightweight temporal modeling through context prompts.

Spatial Perception

EV-CLIP enhances spatial perception through mask prompts.

Parameter-efficient Method

Parameter-efficient methods introduce lightweight components to adapt models without updating most pretrained parameters.

EV-CLIP is a parameter-efficient adaptation method for CLIP.

Omnivore

Omnivore is a unified model capable of processing diverse input modalities such as images, videos, and RGB-D data.

In EV-CLIP, Omnivore is used as a pretrained video model.

Open Questions Unanswered questions from this research

  • 1 How can EV-CLIP's performance be further improved in extreme low-light or complex background scenarios? Existing mask prompts may not fully eliminate background noise, requiring more refined prompt designs.
  • 2 How to maintain EV-CLIP's efficiency advantage in scenarios requiring extensive computational resources? Current methods may be less pronounced in resource-abundant scenarios.
  • 3 How to further optimize EV-CLIP's performance on domain-specific datasets? Existing methods may require fine-tuning for specific domains.
  • 4 How to apply EV-CLIP to other visual tasks such as object detection and semantic segmentation? Its generality and extensibility need to be validated.
  • 5 How to further optimize the design of visual prompts to enhance adaptability in more complex scenarios? More efficient prompt generation mechanisms need to be explored.

Applications

Immediate Applications

Surveillance Systems

Improve action recognition accuracy in low-light environments, enhancing the effectiveness of security monitoring. Suitable for real-time surveillance scenarios like nighttime security.

Wearable Devices

Achieve efficient action recognition under egocentric viewpoints, enhancing user experience for devices like smart glasses. Suitable for augmented reality applications.

Robotics

Improve action recognition capabilities in complex environments, enhancing autonomy and intelligence in robots. Suitable for industrial automation and service robots.

Long-term Vision

Smart Cities

By enhancing the intelligence of surveillance systems, achieve more efficient city management and security assurance. Challenges include large-scale data processing.

Human-Computer Interaction

By enhancing action recognition capabilities in wearable devices, achieve more natural human-computer interaction experiences. Challenges include device computation and energy consumption.

Abstract

CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model's attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at https://github.com/AI-CV-Lab/EV-CLIP.

cs.CV

References (20)

Learning to Prompt for Vision-Language Models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy et al.

2021 3796 citations View Analysis →

ActionCLIP: A New Paradigm for Video Action Recognition

Mengmeng Wang, Jiazheng Xing, Yong Liu

2021 491 citations View Analysis →

Dual-Path Adaptation from Image to Video Transformers

Jungin Park, Jiyoung Lee, K. Sohn

2023 63 citations View Analysis →

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski et al.

2017 1895 citations View Analysis →

ViViT: A Video Vision Transformer

Anurag Arnab, Mostafa Dehghani, G. Heigold et al.

2021 2916 citations View Analysis →

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

João Carreira, Andrew Zisserman

2017 9399 citations View Analysis →

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Jonathan Munro, D. Damen

2020 235 citations View Analysis →

Visual Prompt Tuning

Menglin Jia, Luming Tang, Bor-Chun Chen et al.

2022 2496 citations View Analysis →

Learning Cross-Modal Contrastive Features for Video Domain Adaptation

Donghyun Kim, Yi-Hsuan Tsai, Bingbing Zhuang et al.

2021 87 citations View Analysis →

Kronecker Mask and Interpretive Prompts are Language-Action Video Learners

Jingyi Yang, Zitong Yu, Xiuming Ni et al.

2025 6 citations View Analysis →

SATO: Stable Text-to-Motion Framework

Wenshuo Chen, Hongru Xiao, Erhang Zhang et al.

2024 21 citations View Analysis →

Video Swin Transformer

Ze Liu, Jia Ning, Yue Cao et al.

2021 2028 citations View Analysis →

Video Transformer Network

Daniel Neimark, Omri Bar, Maya Zohar et al.

2021 489 citations View Analysis →

Adversarial Cross-Domain Action Recognition with Co-Attention

Boxiao Pan, Zhangjie Cao, Ehsan Adeli et al.

2019 114 citations View Analysis →

X3D: Expanding Architectures for Efficient Video Recognition

Christoph Feichtenhofer

2020 1284 citations View Analysis →

Anomize: Better Open Vocabulary Video Anomaly Detection

Fei Li, Wenxuan Liu, Jingjing Chen et al.

2025 13 citations View Analysis →

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Syed Talal Wasim, Muzammal Naseer, Salman H. Khan et al.

2023 122 citations View Analysis →

HMDB: A large video database for human motion recognition

Hilde Kuehne, Hueihan Jhuang, Estíbaliz Garrote et al.

2011 4245 citations

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Du Tran, Heng Wang, L. Torresani et al.

2017 3530 citations View Analysis →

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

Min-Hung Chen, Z. Kira, G. Al-Regib et al.

2019 203 citations View Analysis →