Popcorn: A Configurable Benchmark for Visual Evidence in Multimodal Movie Recommendation

TL;DR

Popcorn benchmark combines title-aligned full-movie/trailer embeddings with VLM-encoded thumbnails to evaluate visual evidence in multimodal movie recommendation.

cs.IR 🔴 Advanced 2026-06-08 47 views

Ali Tourani Fatemeh Nazary Yashar Deldjoo Tommaso Di Noia

AI Reader Arxiv Page Download PDF

multimodal recommendation visual evidence deep learning vision-language models benchmark

Key Findings

Methodology

This study introduces Popcorn, a configurable benchmark for evaluating visual evidence in multimodal movie recommendation. It integrates title-aligned embeddings from full movies and trailers with thumbnail features encoded by state-of-the-art visual and vision-language models like CLIP, DINOv2, and SigLIP. The framework standardizes processes such as modality assembly, fusion (e.g., concatenation, PCA, CCA), splitting, evaluation, and LLM-augmented metadata enrichment, all controlled via a unified configuration system. The dataset includes derived frame-level, shot-level, and pooled embeddings for 274 movies, while the thumbnail layer links approximately 65,000 MovieLens titles with over 300,000 visual embeddings. The pipeline supports multiple recommendation algorithms (VBPR, AMR, VMF), enabling systematic comparison of evidence sources, encoder families, and fusion strategies. This setup facilitates reproducibility and detailed ablation studies, providing insights into the role of visual evidence in recommendation performance.

Key Results

Modern visual language models (VLMs) such as SigLIP-base encode thumbnails that outperform traditional multi-frame CNN features from trailers or full movies, with nDCG@10 reaching 0.269, a 21.2% improvement over the older trailer CNN baseline. These static semantic features are scalable and effective at catalog scale, especially when combined with fusion strategies like CCA.
Comparative experiments show that trailer visual evidence generally performs better than full movies in visual-only settings (e.g., VBPR trailer nDCG@10=0.433), but after applying CCA fusion, the performance gap narrows or reverses, indicating the complementary nature of different evidence sources. Fusion strategies improve coverage but may reduce diversity, highlighting trade-offs.
Fusion strategies such as CCA significantly enhance coverage (from 0.767 to 0.918) but can lower diversity (from 0.766 to 0.749). LLM-based metadata augmentation improves recommendation quality in some settings but introduces sensitivity to prompts and models. Overall, the results demonstrate the effectiveness of static thumbnail features encoded by VLMs and the importance of fusion and metadata augmentation in optimizing recommendation performance.

Significance

This research advances the understanding of how different visual evidence sources impact multimodal movie recommendation. By explicitly controlling and comparing full movies, trailers, and thumbnails, it clarifies their distinct semantic roles and scalability trade-offs. The Popcorn benchmark provides a standardized platform for systematic evaluation, fostering reproducibility and comparability across studies. The use of modern VLMs to encode sparse but semantically rich thumbnail features offers a scalable solution for large-scale catalogs, addressing practical challenges faced by industry. The insights gained from this work can guide the design of more accurate, diverse, and explainable recommender systems, ultimately enhancing user experience and content discovery in multimedia platforms.

Technical Contribution

The core technical contributions include the development of a unified, configurable pipeline that supports multiple evidence sources, encoder families, fusion strategies, and recommendation algorithms. The framework leverages state-of-the-art VLMs (CLIP, DINOv2, SigLIP) for thumbnail encoding, enabling semantic-rich, sparse visual features suitable for large catalogs. It systematically compares evidence sources (full movies, trailers, thumbnails) under controlled settings, quantifies their impact on recommendation metrics, and introduces hyperparameter tuning for fusion methods (PCA, CCA). Additionally, it integrates LLMs for metadata augmentation, enhancing content understanding and interpretability. The benchmark's design emphasizes reproducibility, with detailed configuration logging and metric reporting, facilitating rigorous ablation studies and cross-model comparisons.

Novelty

Popcorn's novelty lies in its explicit differentiation and systematic comparison of multiple visual evidence sources—full movies, trailers, and thumbnails—within a unified benchmarking framework. Unlike prior works that focus on a single modality or lack controlled evaluation, this study isolates the effect of evidence source and fusion strategy, revealing their distinct influence on recommendation quality. The integration of modern VLMs for sparse thumbnail encoding at catalog scale, combined with configurable fusion and augmentation modules, represents a significant step forward in scalable, interpretable multimodal recommendation research. This comprehensive, source-controlled approach provides new insights into evidence effectiveness and sets a standard for future benchmarking efforts.

Limitations

The dataset relies on derived embeddings rather than raw videos, which may limit the richness of visual features and restrict the analysis to precomputed representations. The aligned full-movie subset is relatively small, constraining the study of long-form video effects.
LLM-based metadata augmentation depends heavily on pretrained models and prompt design, making it sensitive to biases and inconsistencies, which could affect reproducibility and stability.
Offline evaluation cannot fully capture real user interactions and preferences, necessitating future online or user-centered studies to validate practical effectiveness. Additionally, computational costs for encoding large-scale visual features remain high, posing challenges for real-time deployment.

Future Work

Future directions include expanding Popcorn with larger, lawfully accessible full-movie datasets, improving temporal encoding for long videos, and integrating audio modalities. Incorporating Visual RAG for retrieval-augmented reasoning and explanation is planned to enhance interpretability. Further, online evaluation with user feedback will be essential to validate the system's real-world performance. Developing more robust, less prompt-sensitive LLM augmentation techniques and exploring adaptive fusion strategies will also be prioritized. Ultimately, the goal is to build scalable, explainable, and user-centric multimodal recommendation systems that can handle diverse content types and large catalogs, transforming multimedia content discovery and personalization.

AI Executive Summary

In the rapidly evolving landscape of multimedia content, movie recommendation systems face the challenge of effectively leveraging diverse and complex visual information. Traditional approaches primarily rely on sparse metadata, posters, or short trailers, which often fail to capture the full narrative and aesthetic richness of films. As a result, recommendation accuracy, diversity, and explainability remain limited, especially at large catalog scales.

Recognizing these limitations, Ali Tourani and colleagues introduced Popcorn—a comprehensive, configurable benchmark designed to systematically evaluate the role of visual evidence in multimodal movie recommendation. The core idea is to explicitly compare different sources of visual information, including title-aligned full movies, trailers, and static thumbnails, encoded by various modern models. This setup enables researchers to dissect the contribution of each evidence type, their fusion strategies, and the impact of metadata augmentation.

The framework integrates state-of-the-art visual and vision-language models such as CLIP, DINOv2, and SigLIP to encode sparse thumbnail images into semantically rich embeddings. Simultaneously, it utilizes classical CNNs like Inception-v3 and VGG-19 to extract detailed frame-level features from full movies and trailers. These features are then combined through flexible fusion methods—concatenation, PCA, and CCA—allowing for systematic ablation studies. The entire pipeline supports multiple recommendation algorithms, including VBPR, AMR, and VMF, ensuring broad applicability.

Experimental results demonstrate that modern VLM-encoded thumbnails outperform traditional CNN features in large-scale catalog settings, providing a scalable and effective visual evidence source. Notably, the fusion of evidence sources can significantly enhance coverage and diversity, although trade-offs exist. For instance, CCA fusion improves coverage from 0.767 to 0.918 but may reduce diversity slightly. Additionally, LLM-based metadata augmentation further boosts recommendation quality by enriching sparse textual descriptions, although it introduces sensitivity to prompts and models.

Overall, Popcorn establishes a new standard for controlled, reproducible evaluation of visual evidence in movie recommendation. Its insights reveal that different evidence sources are not interchangeable but complementary, and that fusion strategies and metadata augmentation are crucial for optimizing performance. The framework's modular design and detailed logging facilitate extensive ablation studies, fostering deeper understanding and innovation.

Looking ahead, expanding the dataset to include larger, lawfully accessible full movies, improving temporal encoding, and integrating multimodal retrieval mechanisms will be key. The ultimate goal is to develop scalable, interpretable, and user-centric recommendation systems capable of handling the complexity of modern multimedia content, thereby transforming how users discover and enjoy movies in the digital age.

Deep Analysis

Background

The evolution of movie recommendation systems reflects a gradual shift from simple collaborative filtering to sophisticated multimodal approaches. Early methods relied heavily on user-item interaction data, such as ratings and clicks, with limited content understanding. As deep learning advanced, researchers incorporated visual features extracted from posters, trailers, and short clips, utilizing CNNs like VGG and Inception to capture visual cues related to genre, style, and mood. Notable works include He and McAuley’s VBPR, which integrated visual features into Bayesian personalized ranking, and Deldjoo et al.’s MicroLens dataset, which provided micro-video features for recommendation. Despite progress, these approaches often treat visual evidence as a monolithic input, lacking systematic comparison of different evidence sources. The emergence of vision-language models (VLMs) like CLIP has opened new avenues for encoding sparse but semantically rich visual signals, enabling large-scale catalog applications. However, existing benchmarks and datasets do not explicitly isolate the effect of evidence source, nor do they support comprehensive ablation studies on fusion strategies or metadata augmentation. Consequently, understanding how different visual cues contribute to recommendation performance remains an open challenge, limiting the development of more effective and explainable systems.

Core Problem

The core problem addressed by this work is the lack of a systematic, controlled framework to evaluate the impact of different visual evidence sources—full movies, trailers, and thumbnails—on multimodal recommendation performance. Existing datasets and benchmarks often conflate these sources or focus solely on trailers, making it difficult to disentangle their individual contributions. Moreover, current models do not explicitly compare classical CNN-based multi-frame features with modern VLM-based semantic features at scale, especially in terms of scalability, computational cost, and recommendation quality. This gap hampers the development of optimized fusion strategies and limits understanding of how sparse visual signals can be effectively leveraged in large catalogs. Additionally, the absence of standardized evaluation protocols and detailed logging impedes reproducibility and cross-study comparisons, slowing progress in the field. Addressing these issues requires a comprehensive benchmark that isolates evidence source effects, supports configurable fusion and augmentation strategies, and provides detailed metrics for beyond-accuracy considerations such as coverage, diversity, and fairness.

Innovation

The main innovations of this work include: 1) The creation of Popcorn, a benchmark that explicitly differentiates between visual evidence sources—full movies, trailers, and thumbnails—and supports systematic comparison under controlled settings. 2) The integration of multiple state-of-the-art visual and vision-language models (CLIP, DINOv2, SigLIP) for encoding sparse thumbnail images, enabling semantic-rich features at catalog scale. 3) The design of a flexible, configuration-driven pipeline that supports various fusion strategies (concatenation, PCA, CCA), recommendation algorithms, and LLM-based metadata augmentation, facilitating comprehensive ablation studies. 4) The release of aligned video embeddings for 274 movies and a large thumbnail layer linked to 65,000 MovieLens titles, providing a rich resource for research. 5) The emphasis on reproducibility and detailed logging, including hyperparameters, modality choices, and metrics, setting a new standard for benchmarking in multimodal recommendation. These innovations collectively enable a nuanced understanding of how different visual evidence sources and fusion strategies influence recommendation outcomes.

Methodology

�� Evidence Loading: Extract frame-level, shot-level, and pooled embeddings from full movies and trailers using CNNs like Inception-v3 and VGG-19, sampled at 1 FPS for detailed temporal representation. • Thumbnail Encoding: Construct a large-scale thumbnail layer linking approximately 65,000 MovieLens titles with visual features encoded by six models (CLIP, OpenCLIP, DINOv2, SigLIP). These features are aggregated into over 300,000 embeddings, organized into 13 image packs. • Fusion Strategies: Implement configurable fusion methods—concatenation, PCA (retaining 90% variance), and CCA (with 40 components)—as hyperparameters, allowing systematic comparison. • Recommendation Models: Use classical multi-modal recommenders such as VBPR, AMR, and VMF, trained on fused features, with hyperparameter tuning for each. • Metadata Augmentation: Employ LLMs (e.g., LLaMA, OpenAI GPT) to generate descriptive summaries from sparse metadata, embedding these texts for fusion with visual features. • Evaluation Protocol: Conduct offline experiments on MovieLens-1M and a subset of 274 aligned videos, measuring nDCG@10, Recall@10, coverage, diversity, fairness, and calibration. Hyperparameters are tuned via grid search, with detailed logging of configurations and metrics. • Ablation Studies: Vary evidence sources, encoding backbones, fusion methods, and metadata augmentation to analyze their individual and combined effects on recommendation quality, ensuring reproducibility and comprehensive understanding.

Experiments

The experiments utilize two primary datasets: one with derived embeddings for 274 movies’ full movies and trailers, and another large-scale thumbnail layer linked to approximately 65,000 MovieLens titles. The evaluation metrics include accuracy measures such as nDCG@10, Recall@10, and precision, alongside beyond-accuracy metrics like coverage, diversity, fairness, and calibration bias. Baselines include traditional CNN features from trailers and older visual models. Hyperparameters such as PCA variance retention (90%) and CCA component count (40) are systematically tuned. Experiments compare visual-only, text-only, and fused modalities across different evidence sources, analyzing the impact of fusion strategies and LLM augmentation. Results show that VLM-encoded thumbnails outperform CNN-based trailer features, with significant improvements in recommendation metrics. Fusion strategies like CCA enhance coverage but may reduce diversity, highlighting trade-offs. The study also examines the effect of model size and storage proxies, revealing that larger models do not always guarantee better performance, emphasizing deployment considerations. Overall, the experiments validate the effectiveness of static thumbnail features encoded by modern VLMs and demonstrate the importance of fusion and metadata augmentation in optimizing recommendation quality.

Results

The key results demonstrate that VLM-encoded thumbnails, such as SigLIP-base, outperform traditional trailer CNN features, with nDCG@10 reaching 0.269 versus 0.222 baseline, a 21.2% increase. Fusion strategies, particularly CCA, significantly improve coverage (from 0.767 to 0.918) but may slightly reduce diversity (from 0.766 to 0.749). Visual-only evidence from trailers surpasses full movies in some models (e.g., VBPR trailer nDCG@10=0.433), but after fusion, full movies can match or exceed trailer performance, indicating their complementary nature. LLM-based metadata augmentation yields moderate improvements in recommendation metrics, especially when combined with fusion strategies, but introduces sensitivity to prompt design. Cost-performance analysis shows that models like CLIP offer high coverage with smaller size, while DINOv2-large achieves higher diversity at increased computational cost. These findings highlight the importance of evidence source selection, fusion strategy, and model efficiency in practical deployment.

Applications

The immediate application of Popcorn lies in enhancing large-scale movie recommendation platforms by integrating multi-source visual evidence, thereby improving accuracy, coverage, and user satisfaction. The benchmark facilitates systematic evaluation and optimization of multimodal fusion strategies, guiding industry practitioners in model selection and deployment. Additionally, the framework supports content understanding and explainability, enabling platforms to generate transparent recommendations with visual and textual justifications. Long-term applications include developing adaptive, real-time recommendation systems that leverage multimodal cues for personalized content delivery, content creator tools for content analysis and tagging, and cross-media retrieval systems that utilize visual evidence for efficient content discovery. These advancements can transform multimedia content curation, user engagement, and monetization strategies across entertainment industries.

Limitations & Outlook

Despite its strengths, the current Popcorn framework faces limitations such as reliance on derived embeddings rather than raw videos, which may limit the richness of visual features. The aligned full-movie subset is relatively small, restricting long-form video analysis. LLM-based metadata augmentation depends on pretrained models and prompt engineering, which can introduce biases and inconsistencies. Computational costs for encoding and fusion at scale remain high, posing challenges for real-time deployment. Offline evaluation cannot fully capture dynamic user interactions and preferences, necessitating future online validation. Additionally, the framework’s focus on static visual features may overlook temporal dynamics and audio cues, which are also crucial for comprehensive content understanding. Addressing these limitations involves expanding datasets, optimizing encoding pipelines, and integrating multimodal temporal modeling to enhance robustness and applicability in real-world scenarios.

Plain Language Accessible to non-experts

想象你在一家超级大的图书馆，里面有各种各样的书。每本书都可以用不同的线索帮你找到你喜欢的内容。有的书有完整的故事（就像电影的全片），你可以花很长时间看完；有的书有短短的介绍视频（类似预告片），让你快速了解内容；还有一些书的封面图片（像电影的缩略图），可以让你一眼看出风格和主题。每个线索都能帮你找到喜欢的书，但效果不同。完整的故事能告诉你全部细节，但很耗时间；短视频快而直观，但信息有限；封面图片简单，却可以在很多书中快速筛选。科学家们就像在设计一个超级智能的推荐系统，他们用各种技术把这些线索变成数字，然后让电脑学习哪个线索最能帮你找到心仪的书。结果发现，用现代的封面图片标签（类似视觉语言模型）可以比传统的短视频更快帮你找到喜欢的书，而且把不同的线索结合起来，效果会更好。这个系统可以帮助你在海量的电影库中，快速、准确、智能地找到你喜欢的电影，就像一个超级聪明的图书馆助手一样。

ELI14 Explained like you're 14

想象你在一个超级大的电影院里，有很多不同的线索可以帮你挑电影：一部完整的电影（就像一本长长的小说），一个短短的预告片（电影的预告片），还有电影的海报（电影封面图片）。每个线索都能告诉你一些信息，但效果不同。完整的电影能让你了解全部故事，但很长很难看完；预告片很短，但能让你知道大概内容；海报图片很小，但可以让你快速判断电影的风格和主题。科学家们就像在用这些线索帮你推荐电影，他们用电脑把这些线索变成数字，然后分析哪个线索能帮你最快找到喜欢的电影。研究发现，用现代的视觉模型编码的电影海报（就像用智能相机拍的图片）比传统的预告片更快帮你找到喜欢的电影。而且，把不同的线索结合起来，比如海报和预告片，可以让推荐变得更准确、更丰富。就像你用不同的线索拼凑出一幅完整的画面，帮助你更快找到心仪的电影。这个研究的意思是：用各种不同的线索和技术，能让电影推荐变得更聪明、更快、更贴心，就像你在电影院里用不同的线索找到最喜欢的电影一样。未来，还可以加入更多线索，比如你的兴趣和习惯，让推荐变得更个性化、更精准。

Abstract

Movies are long-form audiovisual works, yet recommender benchmarks often rely on trailers, thumbnails, or metadata. These sources differ in semantics and scalability: full movies preserve consumption-level evidence, trailers concentrate promotional highlights, and thumbnails provide sparse but catalog-scale visual signals. We present Popcorn, a configurable benchmark for visual evidence in multimodal movie recommendation, combining title-aligned full-movie/trailer embeddings with MovieLens-linked thumbnail features encoded by modern visual and vision-language models. Popcorn standardizes modality assembly, fusion, splitting, evaluation, and LLM-augmented metadata through a single configuration contract. Experiments show that thumbnail VLMs provide strong, scalable item-side evidence, while controlled trailer/full-movie comparisons show that visual evidence sources are not interchangeable: the choice of source and fusion strategy affects ranking accuracy, coverage, diversity, and calibration. The framework is available at https://github.com/RecSys-lab/Popcorn.

cs.IR

Popcorn: A Configurable Benchmark for Visual Evidence in Multimodal Movie Recommendation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

Related Papers

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators

A Theoretical Framework for Risk Analysis of Stochastic Rankers

CQC-RAG: Robust Retrieval-Augmented Generation via Cross-Query Consistency

miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity