ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

TL;DR

ELVA employs a ranking-driven reinforcement learning framework with rule-based rewards, achieving 13.1% improvement on MRBench for multi-grain retrieval.

cs.IR 🔴 Advanced 2026-06-18 10 views

Yuhan Liu Pei Fu Hang Li Yukun Qi Chao Jiang Jingwen Fu Zhen Liu Bin Qin Zhenbo Luo Jian Luan Jingmin Xin

AI Reader Arxiv Page Download PDF

multimodal retrieval contrastive learning reinforcement learning grain blindness ranking optimization

Key Findings

Methodology

ELVA introduces a rule-based verifiable reward system integrated with reinforcement learning (RLVR) to enhance the ranking ability of Multimodal Large Language Models (MLLMs). The framework involves: 1) extending RLVR to retrieval tasks without relying on explicit ranking labels, enabling the model to explore new ranking behaviors; 2) designing ranking rewards (Ranking Reward) that incentivize placing positive samples higher while structuring negative samples hierarchically; 3) incorporating margin rewards (Margin Reward) to enforce a similarity gap between positive and negative samples, thus capturing multi-grain semantic information more effectively. During training, a balanced negative sampling strategy is employed, combining hard negatives and random negatives to ensure diversity and stability. The model performs G independent rollouts per query, with rewards computed via the proposed functions, and optimized using the GRPO algorithm, maintaining proximity to a reference policy through KL divergence. Extensive experiments on benchmarks such as MRBench demonstrate a 13.1% performance boost, validating the approach's effectiveness in mitigating grain blindness.

Key Results

On the MRBench benchmark, ELVA achieved a 13.1% increase in retrieval accuracy, significantly outperforming previous contrastive learning-based methods, confirming its robustness in complex multi-grain scenarios.
In standard retrieval tasks like F200K and COCO, ELVA reached state-of-the-art results, with an average Recall@10 improvement of 4.3%, indicating strong generalization across diverse datasets.
Ablation studies revealed that combining ranking and margin rewards yields superior performance compared to using either alone, highlighting the importance of multi-reward design for capturing fine-grained semantic information.

Significance

This research addresses a fundamental challenge in multimodal retrieval—grain blindness—by introducing a novel reinforcement learning framework that dynamically optimizes ranking behavior through rule-based rewards. The approach allows models to autonomously discover hierarchical negative sample structures, significantly improving their sensitivity to multi-layered semantic cues. Such advancements have profound implications for real-world applications like content-based search engines, multimedia content management, and intelligent virtual assistants, where complex, multi-label queries are common. The methodology bridges the gap between generative pretraining and discriminative retrieval, pushing the frontier of multimodal understanding. Long-term, this work paves the way for more adaptive, explainable, and efficient retrieval systems capable of handling increasingly complex multimodal data, fostering innovations in AI-powered information retrieval and knowledge discovery.

Technical Contribution

ELVA's core technical innovations include: 1) the formulation of verifiable rule-based rewards that enable unsupervised, continuous optimization of ranking policies without explicit labels; 2) the integration of ranking and margin rewards to simultaneously optimize order and semantic gaps, effectively capturing multi-grain information; 3) the application of the GRPO algorithm for stable policy updates, combined with a balanced negative sampling strategy to ensure training diversity and robustness. These contributions fundamentally differ from prior contrastive learning approaches by enabling dynamic, hierarchical ranking exploration, and providing theoretical guarantees for the preservation of multi-grain semantic structures. The framework also introduces a novel multi-grain benchmark (MRBench) for comprehensive evaluation, setting new standards in the field.

Novelty

This work is the first to embed rule-based, verifiable rewards into a reinforcement learning framework tailored for multimodal retrieval, specifically targeting the grain blindness problem. Unlike conventional contrastive methods that treat all negatives equally, ELVA dynamically adjusts negative hierarchies based on relevance, capturing multi-level semantic nuances. The combined use of ranking and margin rewards, along with multi-round generation, represents a significant departure from existing single-objective contrastive models. This innovative approach not only improves retrieval accuracy in complex scenarios but also offers a new paradigm for unsupervised, hierarchical ranking optimization in multimodal AI.

Limitations

The computational cost of multi-round generation and reward calculation is high, potentially limiting real-time deployment in large-scale systems.
Sensitivity to hyperparameters such as reward weights and temperature parameters requires careful tuning, which may hinder scalability across different tasks.
While effective in multi-grain scenarios, the framework's performance in extremely ambiguous or noisy queries remains limited, necessitating further robustness enhancements.

Future Work

Future research could focus on developing adaptive reward weighting mechanisms to reduce reliance on manual hyperparameter tuning. Integrating more efficient training strategies, such as distillation or pruning, could lower computational costs for industrial deployment. Additionally, expanding the framework to handle more diverse and ambiguous query types, possibly through multi-task learning or multi-modal pretraining, will further enhance its robustness. Exploring explainability and interpretability of the learned ranking policies could also foster greater trust and transparency in AI retrieval systems. Lastly, extending the evaluation to real-world applications like multimedia search engines and virtual assistants will validate its practical impact.

AI Executive Summary

In the rapidly evolving landscape of artificial intelligence, the ability to retrieve relevant information across multiple modalities—such as text, images, and videos—has become a cornerstone of intelligent systems. Traditional approaches, primarily based on contrastive learning, have demonstrated remarkable success in aligning multimodal representations. However, these methods often struggle with complex queries that contain multiple semantic layers or attributes, a phenomenon known as 'grain blindness.' This limitation hampers the retrieval accuracy in real-world scenarios where user queries are inherently multi-faceted and hierarchical.

Addressing this challenge, Yuhan Liu and colleagues introduce ELVA, a novel framework that leverages ranking-driven reinforcement learning with rule-based, verifiable rewards. Unlike conventional models that rely solely on static contrastive objectives, ELVA dynamically explores and optimizes the ranking behavior of multimodal large language models (MLLMs). The core idea is to treat negative samples differently based on their relevance to the positive sample, structuring them hierarchically to better capture multi-grain semantic information.

The methodology involves a multi-stage training process. Initially, the model undergoes pretraining on natural language inference datasets and instruction tuning on diverse retrieval tasks, enhancing its generative and discriminative capabilities. Subsequently, the reinforcement learning stage employs a G轮次多轮生成机制，结合排名奖励（鼓励正样本排名靠前）和边界差奖励（确保正负样本相似度差距），通过GRPO算法优化策略。奖励设计确保模型在没有明确排序标签的情况下，仍能自主探索出优质的排序策略。这一创新使模型在复杂多模态查询中表现出色。

在多个标准和新颖的基准测试中，ELVA都取得了显著的性能提升。特别是在MRBench多粒度检索任务中，性能提升达13.1%，远超现有方法。这不仅验证了其在缓解粒度盲区方面的有效性，也为多模态检索技术的发展提供了新的思路。未来，ELVA有望在内容搜索、虚拟助手、智能推荐等领域实现更高的智能化水平，推动多模态理解的深度融合。

Deep Analysis

Background

多模态信息检索作为人工智能研究的重要方向，经历了从单模态到跨模态的快速发展。早期工作如CLIP（Contrastive Language-Image Pretraining）实现了跨模态对齐，但在处理复杂、多层次语义查询时，表现出粒度理解不足的问题。近年来，随着多模态大语言模型（MLLMs）的出现，研究者开始利用其丰富的知识和表达能力，提升检索的准确性和鲁棒性。代表性工作包括VLM-R、LamRA等，采用对比学习（Contrastive Learning）优化嵌入空间，取得了显著成果。然而，这些方法在多粒度、多标签场景中仍存在粒度盲区，难以捕获细粒度信息，限制了其应用范围。尽管如此，强化学习（RL）作为一种优化排序策略的工具逐渐受到关注，试图弥补对比学习的不足，但在多模态场景中的系统性解决方案仍待完善。

Core Problem

多模态检索的核心难题在于粒度盲区，即模型在面对多层次、多标签的复杂查询时，无法充分捕获细粒度信息，导致检索效果下降。传统对比学习方法将样本划分为正负类别，忽略了负样本之间的差异性，难以学习到不同粒度层级的特征。此外，缺乏有效的无监督排序优化机制，使模型难以自主探索更优的排序策略。这些问题限制了模型在复杂、多标签、多属性场景中的表现，亟需一种新颖的解决方案，以提升模型的多粒度理解能力。

Innovation

本研究的主要创新在于：1）提出基于规则的可验证奖励（Verifiable Rewards），实现无监督的连续排序优化，突破传统依赖显式标签的限制；2）引入排名奖励（Ranking Reward）和边界差奖励（Margin Reward），结合多轮生成机制，有效捕获多粒度语义信息；3）采用平衡负采样策略，确保训练过程中的梯度稳定性和多样性。这些创新共同推动了多模态检索中粒度盲区的缓解，为强化学习在该领域的应用提供了新路径。特别是在没有明确排序标签的情况下，模型可以自主探索出更优的排序策略，极大提升了多粒度场景下的检索性能。

Methodology

�� 预训练与指令调优：首先在NLI（自然语言推理）数据集上进行语言预训练，增强模型的语义理解能力；随后通过指令调优，适应多模态检索任务，提升模型的泛化能力。
�� 生成式特征提取：采用自回归生成机制，输出输入的文本摘要，利用特殊标记[RET]作为信息瓶颈，提取检索嵌入。
�� 多轮生成（G轮次）：模型在每个查询上进行多轮生成，输出多个候选嵌入集，用于后续奖励计算。
�� 奖励设计：设计排名奖励（鼓励正样本排名靠前）和边界差奖励（确保正负样本的相似度差距），结合连续奖励机制，优化模型排序策略。
�� 负样本采样：采用平衡采样策略，结合过滤的难负样本和随机负样本，确保训练的多样性和稳定性。
�� 训练流程：在多轮生成和奖励基础上，利用GRPO算法进行策略优化，同时通过KL散度保持模型稳定。

Experiments

�� 数据集：在NLI和M-BEIR数据集上进行预训练和调优，测试在多模态检索任务中的表现。
�� 评估指标：主要采用Recall@K（K=5或10）指标，覆盖FashionIQ、COCO、WebQA等多个数据集。
�� 实验设置：在8GPU环境下进行预训练，采用批次大小576，学习率4×10^-5，训练两轮；指令调优使用16GPU，批次960，学习率1×10^-4，训练一轮；RL阶段在8GPU上，批次较小，学习率1×10^-6，进行一轮训练。
�� ablation研究：验证排名奖励和边界差奖励的贡献，分析负样本采样策略对模型稳定性的影响。

Results

�� 在MRBench多粒度检索任务中，ELVA实现了13.1%的准确率提升，显著优于对比方法，验证了其在复杂多粒度场景中的优越性。
�� 在标准检索任务中，ELVA在Recall@10指标上平均提升4.3%，表现优于LamRA和PUMA等最新模型，显示其良好的泛化能力。
�� 消融实验显示，单独使用排名奖励或边界差奖励效果均不及结合使用的整体方案，验证了多奖励机制的有效性。

Applications

�� 立即应用：可在内容搜索引擎中部署，提升多模态内容的检索精度，特别适用于电商、图像库和虚拟助手等场景。
�� 长期愿景：未来通过结合多模态预训练和强化学习，提升模型对极端复杂查询的理解能力，实现更智能、更高效的内容理解与推荐。

Limitations & Outlook

�� 训练成本较高，尤其在多轮生成和奖励计算过程中，计算资源需求大，影响实时性。
�� 当前奖励设计对超参数敏感，参数调优复杂，可能影响模型的泛化能力。
�� 在极端复杂或模糊的查询场景中，模型仍存在粒度捕获不足的问题，未来需引入更丰富的粒度层级建模机制。

Plain Language Accessible to non-experts

想象你在一家大型图书馆里找书。每本书都包含很多信息，比如书名、作者、主题、出版年份等。当你想找到一本特定的书时，你可能会根据不同的线索，比如书的封面颜色、作者的名字，甚至是书中的某个关键词，来缩小范围。

传统的搜索系统就像是只看书的封面颜色，把所有书都分成两类：匹配和不匹配。这种方法简单，但如果你的线索很复杂，比如同时想找一本关于“火焰呼吸的宝可梦”的书，单纯的颜色匹配就不够了。它可能会漏掉一些重要的细节，比如“火焰呼吸”这个关键词。

ELVA就像是一个聪明的图书馆助手，它不仅会根据封面颜色，还会考虑书中的关键词、作者、内容层次，甚至会自己试着排序，找到最符合你所有线索的书。它会不断学习，试错，直到找到最合适的书。这就像是它在不断练习如何更聪明地帮你找到心仪的书，特别是在线索复杂、多层次的情况下，表现得更好。

ELI14 Explained like you're 14

想象你在玩一个超级复杂的拼图游戏。这个拼图不仅有很多不同的颜色和形状，还包含了很多隐藏的细节，比如某个拼块代表一只火焰呼吸的宝可梦，另一个拼块代表它的名字。你需要把这些拼块拼在一起，找到最匹配的组合。

以前的拼图助手就像是只看拼块的颜色，把所有相似颜色的拼块放在一起。虽然简单，但当拼块很多、细节复杂时，它就会迷失方向，拼错很多。

ELVA就像是一个聪明的拼图大师，它会根据每个拼块的细节、形状、颜色，自己试着排序，找到最合适的拼法。它不断试错，学习哪种拼法更接近最终的完整图像。这样，即使拼图很复杂，它也能逐渐拼出正确的样子。这个过程就像它在不断练习变得更聪明，最终能帮你拼出最漂亮、最完整的图案。

Abstract

Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.

cs.IR cs.AI

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

Related Papers

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators

A Theoretical Framework for Risk Analysis of Stochastic Rankers

CQC-RAG: Robust Retrieval-Augmented Generation via Cross-Query Consistency

miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval