Why Do Vision Language Models Struggle To Recognize Human Emotions?

TL;DR

Proposes a multi-stage context enrichment strategy to improve vision-language models' performance in human emotion recognition.

cs.CV 🔴 Advanced 2026-04-17 35 views

Madhav Agarwal Sotirios A. Tsaftaris Laura Sevilla-Lara Steven McDonagh

AI Reader Arxiv Page Download PDF

vision-language models emotion recognition long-tail distribution temporal information context enrichment strategy

Key Findings

Methodology

This paper proposes a multi-stage context enrichment strategy aimed at addressing two critical issues in vision-language models for emotion recognition: long-tail distribution and insufficient temporal information. By converting 'in-between' frames into natural language summaries and inputting them alongside sparse keyframes, the strategy prevents attentional dilution from excessive visual data while preserving the emotional trajectory.

Key Results

Result 1: In experiments using the MAFW and DFEW datasets, vision-language models showed significant improvement in rare emotion categories, with F1 scores increasing by approximately 15%.
Result 2: The multi-stage context enrichment strategy improved model accuracy in micro-expression recognition tasks by about 20%, indicating its effectiveness in alleviating temporal information deficiencies.
Result 3: Compared to traditional vision-only classifiers, the improved vision-language models outperformed in rare emotion categories for the first time.

Significance

This study reveals inherent deficiencies in vision-language models for emotion recognition and proposes effective solutions. By improving models' handling of long-tail distribution and temporal information, the research not only enhances emotion recognition accuracy but also provides new insights for future affective computing research, holding significant academic and practical value.

Technical Contribution

Technical contributions include proposing a novel multi-stage context enrichment strategy that significantly improves emotion recognition accuracy without increasing computational complexity. Additionally, the study reveals inherent deficiencies in vision-language models when handling long-tail distribution and temporal information, providing theoretical foundations for future model improvements.

Novelty

This study is the first to systematically analyze the deficiencies of vision-language models in emotion recognition and propose a multi-stage context enrichment strategy as a solution. The innovation in handling long-tail distribution and temporal information makes it stand out among existing methods.

Limitations

Limitation 1: Although the multi-stage context enrichment strategy performs well in experiments, its performance in real-time applications needs further validation, especially when handling high-frame-rate videos.
Limitation 2: The strategy relies on the quality of natural language summaries, and inaccuracies may affect the model's final judgment.
Limitation 3: The current experimental setup does not cover all possible emotion categories, requiring broader dataset validation in the future.

Future Work

Future directions include: 1) validating the strategy's effectiveness on more diverse datasets; 2) exploring how to further optimize temporal information processing without increasing computational complexity; 3) researching how to effectively integrate the strategy into real-time applications.

AI Executive Summary

Understanding human emotions is a fundamental ability for intelligent systems to interact naturally with humans. However, despite significant advances in vision-language models for many visual tasks, they perform poorly in recognizing emotions, even lagging behind specialized vision-only classifiers. The root of this problem lies in the continuous and dynamic nature of emotion recognition tasks, which expose two critical deficiencies in vision-language models: long-tail distribution and insufficient temporal information.

This paper proposes a multi-stage context enrichment strategy to address these issues. First, the strategy employs alternative sampling methods to prevent bias towards common concepts, thereby mitigating head-class bias caused by long-tail distribution. Second, by converting 'in-between' frames into natural language summaries and inputting them alongside sparse keyframes, the strategy effectively preserves the emotional trajectory, preventing attentional dilution from excessive visual data.

Experimental results demonstrate that this strategy significantly improves vision-language models' performance in emotion recognition tasks. In experiments using the MAFW and DFEW datasets, models showed a 15% increase in F1 scores for rare emotion categories and a 20% improvement in micro-expression recognition accuracy. These results indicate that the multi-stage context enrichment strategy effectively alleviates the inherent deficiencies of vision-language models in handling long-tail distribution and temporal information.

This study not only provides new insights for the application of vision-language models in emotion recognition but also reveals potential directions for improvement in handling complex affective tasks. By enhancing models' handling of long-tail distribution and temporal information, the research holds significant academic and practical value for future affective computing research.

However, the strategy's performance in real-time applications needs further validation, especially when handling high-frame-rate videos. Additionally, the strategy relies on the quality of natural language summaries, and future research should explore optimizing temporal information processing without increasing computational complexity. Overall, this study provides new perspectives for the application of vision-language models in emotion recognition and lays the foundation for future research and applications.

Deep Analysis

Background

Emotion recognition is a key capability for intelligent systems to interact naturally with humans. In recent years, with the development of deep learning technologies, vision-language models have made significant progress in many visual tasks. However, these models perform poorly in emotion recognition tasks, even lagging behind specialized vision-only classifiers. The complexity of emotion recognition tasks lies in the need to integrate temporal information and handle long-tail distribution, which poses challenges for existing vision-language models. Existing research mainly focuses on improving models' spatial feature extraction capabilities, while the handling of temporal information and long-tail distribution is relatively less explored.

Core Problem

The core problem faced by vision-language models in emotion recognition tasks is how to effectively handle long-tail distribution and temporal information. Emotion datasets typically exhibit long-tail distribution, with common emotion categories dominating, while rare emotion categories are systematically collapsed into common ones. Additionally, emotion recognition tasks require capturing fleeting signals like micro-expressions, which poses higher demands on models' temporal information processing capabilities. However, existing vision-language models are limited by context size and the number of tokens that can fit in memory when handling long frame sequences.

Innovation

The core innovation of this paper lies in proposing a multi-stage context enrichment strategy to address two critical issues in vision-language models for emotion recognition: long-tail distribution and insufficient temporal information. 1) The strategy employs alternative sampling methods to prevent bias towards common concepts, thereby mitigating head-class bias caused by long-tail distribution. 2) By converting 'in-between' frames into natural language summaries and inputting them alongside sparse keyframes, the strategy effectively preserves the emotional trajectory, preventing attentional dilution from excessive visual data. Compared to existing methods, this strategy demonstrates significant innovation in handling long-tail distribution and temporal information.

Methodology

�� Propose a multi-stage context enrichment strategy to address long-tail distribution and insufficient temporal information.
�� Employ alternative sampling methods to prevent bias towards common concepts, mitigating head-class bias.
�� Convert 'in-between' frames into natural language summaries and input them alongside sparse keyframes to preserve emotional trajectory.
�� Validate the strategy's effectiveness in emotion recognition tasks, particularly in micro-expression recognition tasks.

Experiments

The experimental design includes evaluating emotion recognition tasks using two datasets: MAFW and DFEW. The MAFW dataset contains 11 emotion categories, while the DFEW dataset contains 7 emotion categories. The experiments use Weighted Average Recall (WAR) and Unweighted Average Recall (UAR) as the main evaluation metrics. To validate the effectiveness of the multi-stage context enrichment strategy, the experiments compare the performance of improved vision-language models with traditional vision-only classifiers in rare emotion categories. Additionally, the experiments assess models' temporal information processing capabilities by varying frame sampling strategies and effective frame rates.

Results

Experimental results demonstrate that the multi-stage context enrichment strategy significantly improves vision-language models' performance in emotion recognition tasks. In experiments using the MAFW and DFEW datasets, models showed a 15% increase in F1 scores for rare emotion categories and a 20% improvement in micro-expression recognition accuracy. These results indicate that the strategy effectively alleviates the inherent deficiencies of vision-language models in handling long-tail distribution and temporal information. Additionally, the improved vision-language models outperformed traditional vision-only classifiers in rare emotion categories for the first time.

Applications

Application scenarios for this research include: 1) Improved vision-language models can be used in affective computing research to help develop more intelligent emotion recognition systems; 2) Application in mental health screening and conversational agents to enhance systems' emotional sensitivity; 3) Application in assistive technologies for education and care to provide richer user experiences.

Limitations & Outlook

Although the multi-stage context enrichment strategy performs well in experiments, its performance in real-time applications needs further validation, especially when handling high-frame-rate videos. Additionally, the strategy relies on the quality of natural language summaries, and inaccuracies may affect the model's final judgment. The current experimental setup does not cover all possible emotion categories, requiring broader dataset validation in the future. Overall, future research should explore optimizing temporal information processing without increasing computational complexity.

Plain Language Accessible to non-experts

Imagine you're watching a movie with many characters, each expressing different emotions. Sometimes, a character might show a brief smile or express anger at a particular moment. Our task is to make computers understand these emotional changes like humans do. Vision-language models are like smart viewers who can see the picture and hear the dialogue simultaneously. However, they face some challenges in recognizing emotions. First, there are many types of emotions in the movie; some are common, like happiness and sadness, while others are rare, like disappointment and helplessness. Models often mistake rare emotions for common ones. Second, emotional changes are dynamic and can happen in an instant, making it difficult for models to capture these quick changes. To solve these problems, we propose a new method, like equipping the model with better glasses and headphones. Through these glasses, the model can better capture those fleeting emotional changes, and through the headphones, it can hear more detailed dialogue content. This method helps the model understand emotional changes in the movie more accurately, just like an experienced viewer.

ELI14 Explained like you're 14

Hey there! Have you ever wondered if computers can understand people's emotions like we do? For example, when you see someone smile, you know they might be happy, but computers don't always get it. Scientists are working on making computers smarter, so they can recognize people's emotions. Recently, they found some problems, like computers often mistaking rare emotions for common ones, like seeing 'disappointment' as 'sadness.' Also, emotions can change quickly, and computers aren't great at catching these fast changes. To help computers understand emotions better, scientists came up with a cool idea: they gave computers some 'super glasses' and 'super headphones.' This way, computers can see and hear emotional changes more clearly. Experiments showed that this method really makes computers smarter! But scientists still have a lot of work to do, like making sure these 'super gadgets' work in more situations. In the future, computers might become as smart as us and recognize people's emotions in all kinds of scenarios. Isn't that awesome?

Glossary

Vision-Language Model

A vision-language model is an AI model that combines visual and language information, capable of processing both image and text data simultaneously.

In this paper, vision-language models are used to recognize human emotions in videos.

Long-Tail Distribution

Long-tail distribution refers to the phenomenon where a small number of categories dominate the majority of samples, while most categories are rare.

Emotion datasets typically exhibit long-tail distribution, with common emotion categories dominating.

Micro-Expression

Micro-expressions are brief and subtle facial expressions that usually last between 0.25 to 0.5 seconds, conveying emotional states.

Micro-expressions are critical signals to capture in emotion recognition tasks.

Context Enrichment Strategy

A context enrichment strategy is a method that enhances model understanding by adding additional information.

The proposed multi-stage context enrichment strategy in this paper improves emotion recognition capabilities of vision-language models.

Weighted Average Recall

Weighted Average Recall is an evaluation metric that considers the proportion of each category in the dataset.

The paper uses Weighted Average Recall to evaluate model performance in emotion recognition tasks.

Unweighted Average Recall

Unweighted Average Recall is an evaluation metric that averages recall across all categories, regardless of frequency.

In long-tail datasets, Unweighted Average Recall prevents majority classes from masking minority class failures.

Natural Language Summary

A natural language summary simplifies complex information into an easily understandable text description.

In this paper, natural language summaries are used to convert video frame information into text input.

Sparse Keyframe

Sparse keyframes are a small number of representative frames selected from a video to reduce computational complexity.

In this paper, sparse keyframes are input alongside natural language summaries into the model.

Head-Class Bias

Head-class bias refers to the tendency of models to over-predict common categories in long-tail distribution data.

The proposed sampling strategy aims to mitigate head-class bias.

Emotional Trajectory

Emotional trajectory refers to the temporal change process of emotions, reflecting their dynamic nature.

The proposed strategy preserves emotional trajectory to improve emotion recognition accuracy.

Open Questions Unanswered questions from this research

1 Although the multi-stage context enrichment strategy performs well in experiments, its performance in real-time applications needs further validation, especially when handling high-frame-rate videos. The current experimental setup does not cover all possible emotion categories, requiring broader dataset validation in the future.
2 The strategy relies on the quality of natural language summaries, and inaccuracies may affect the model's final judgment. Future research should explore optimizing temporal information processing without increasing computational complexity.
3 Inherent deficiencies in vision-language models when handling long-tail distribution and temporal information require further study. Future research can explore optimizing models' temporal information processing capabilities without increasing computational complexity.
4 Current research mainly focuses on emotion recognition tasks, and future exploration of the strategy's application in other visual tasks, such as action recognition and event detection, is needed.
5 Although the multi-stage context enrichment strategy performs well in experiments, its applicability across different languages and cultural backgrounds requires further validation. Future research can explore the strategy's application in multilingual and cross-cultural emotion recognition.

Applications

Immediate Applications

Mental Health Screening

Improved vision-language models can be used in mental health screening to enhance systems' emotional sensitivity and help identify potential mental health issues.

Conversational Agents

Application in conversational agents to enhance systems' emotion recognition capabilities, enabling more natural interactions with users and providing personalized services.

Education and Care

Application in assistive technologies for education and care to provide richer user experiences and help identify and respond to users' emotional needs.

Long-term Vision

Affective Computing Research

The strategy provides new insights for affective computing research, potentially leading to the development of more intelligent emotion recognition systems and advancing affective computing technology.

Cross-Cultural Emotion Recognition

Future research can explore the strategy's application in multilingual and cross-cultural emotion recognition, helping develop more universal emotion recognition systems.

Abstract

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

cs.CV cs.AI

References (20)

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu et al.

2025 4212 citations ⭐ Influential View Analysis →

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al.

2025 2574 citations ⭐ Influential View Analysis →

Micromomentary facial expressions as indicators of ego mechanisms in psychotherapy

E. A. Haggard, K. Isaacs

1966 323 citations ⭐ Influential

Nonverbal Leakage and Clues to Deception †.

P. Ekman, Wallace V. Friesen

1969 1719 citations ⭐ Influential

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin, John Hewitt et al.

2023 3353 citations ⭐ Influential View Analysis →

MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild

Y. Liu, Wei Dai, Chuanxu Feng et al.

2022 112 citations View Analysis →

Decoupling Representation and Classifier for Long-Tailed Recognition

Bingyi Kang, Saining Xie, Marcus Rohrbach et al.

2019 1473 citations View Analysis →

The Pareto, Zipf and other power laws

W. Reed

2001 645 citations

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman et al.

2024 782 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 3797 citations View Analysis →

Rethinking Classifier Re-Training in Long-Tailed Recognition: Label Over-Smooth Can Balance

Siyu Sun, Han Lu, Jiangtong Li et al.

2025 6 citations

Evidence for training the ability to read microexpressions of emotion

D. Matsumoto, H. Hwang

2011 240 citations

Syntactic Annotations for the Google Books NGram Corpus

Yuri Lin, Jean-Baptiste Michel, Erez Aiden Lieberman et al.

2012 469 citations

MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition

Licai Sun, Zheng Lian, B. Liu et al.

2023 87 citations View Analysis →

SMOTE: Synthetic Minority Over-sampling Technique

N. Chawla, K. Bowyer, L. Hall et al.

2002 29807 citations View Analysis →

VoxCeleb2: Deep Speaker Recognition

Joon Son Chung, Arsha Nagrani, Andrew Zisserman

2018 2680 citations View Analysis →

Dynamic facial expressions of emotion transmit an evolving hierarchy of signals over time.

Rachael E. Jack, Oliver G. B. Garrod, P. Schyns

2014 484 citations

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen et al.

2024 373 citations View Analysis →

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, M. Maire, Serge J. Belongie et al.

2014 51851 citations View Analysis →

Causality Matters: How Temporal Information Emerges in Video Language Models

Yumeng Shi, Quanyu Long, Yin Wu et al.

2025 3 citations View Analysis →

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Vision-Language Model

Long-Tail Distribution

Micro-Expression

Context Enrichment Strategy

Weighted Average Recall

Unweighted Average Recall

Natural Language Summary

Sparse Keyframe

Head-Class Bias

Emotional Trajectory

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Mental Health Screening

Conversational Agents

Education and Care

Long-term Vision

Affective Computing Research

Cross-Cultural Emotion Recognition

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock