miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

TL;DR

miniReranker employs visual cache reuse and interaction sparsity to reduce reranking runtime to <1% with >96% performance, based on Qwen3-VL.

cs.IR 🔴 Advanced 2026-06-09 71 views

Yingqi Fan Xuan Lu Anhao Zhao Junlong Tong Ping Nie Kai Zou Yunpu Ma Wei Zhang Xiaoyu Shen

AI Reader Arxiv Page Download PDF

multimodal large models reranking efficiency visual cache interaction sparsity

Key Findings

Methodology

This work introduces miniReranker, a framework that integrates vision-first prompting, early exit, interaction band restriction, and embedder-guided token pruning. The approach begins by reformulating input prompts to prioritize visual inputs, aligning with pretraining formats and enabling visual representation caching. Layer-wise analysis reveals redundancy in deep layers, prompting early exit strategies to truncate unnecessary computation. Cross-segment attention analysis shows that effective query-document interactions are concentrated in intermediate layers, leading to the design of interaction bands that limit attention scope. Additionally, visual token importance is assessed via attention weights from the pretrained embedder, guiding token pruning to reduce input sequence length. These combined strategies drastically cut down computational load while maintaining high relevance accuracy, achieving runtime reductions of over 99% in high-reuse scenarios with performance above 96%. The framework is instantiated on Qwen3-VL, fine-tuned on a new multimodal reranking dataset, and evaluated across 78 tasks on MMEB-v2, demonstrating its broad applicability and efficiency gains.

Key Results

miniReranker, built on Qwen3-VL, achieves over 96% of the dense model’s reranking performance while reducing active parameters to 58% and accelerating training nearly threefold. In reranking Top-100 candidates, it reduces runtime by more than 99%, enabling real-time large-scale applications.
Across 78 tasks covering image, video, and visual document retrieval, miniReranker maintains high accuracy, with performance metrics such as Hit@1 and NDCG@5 closely matching dense models. Ablation studies confirm that early exit, interaction band, and token pruning contribute significantly to efficiency without sacrificing effectiveness.
Layer-wise probing indicates that relevance signals emerge early in the network, validating early-exit strategies. The interaction band analysis shows that cross-modal information exchange is localized, and visual token pruning based on attention weights effectively reduces input size while preserving semantic fidelity.

Significance

This research addresses the critical challenge of computational efficiency in multimodal large language models for retrieval tasks. By systematically reducing redundant computation at multiple levels—input formatting, model depth, attention scope, and input size—it enables deployment of high-performance models in real-time systems. The approach not only accelerates inference but also reduces energy consumption, making multimodal retrieval more scalable and accessible. The innovations in prompt design and model compression set a new standard for efficient multimodal reasoning, with broad implications for content retrieval, question answering, and multimedia understanding. It paves the way for large-scale, low-latency multimodal systems capable of handling diverse data types in practical applications.

Technical Contribution

The paper's key technical contributions include: 1) the vision-first prompt reformulation that aligns input with pretraining formats and maximizes visual representation reuse; 2) layer-wise analysis leading to early exit strategies that truncate unnecessary deep layers; 3) the identification of a narrow interaction band for cross-modal attention, with a novel attention masking mechanism to limit scope; 4) the use of pretrained embedder attention weights to guide visual token pruning, significantly reducing input size. These innovations collectively enable a highly efficient reranking framework that preserves most of the dense model's accuracy while drastically reducing computational costs. The integration of these techniques demonstrates a comprehensive approach to model compression and inference optimization in multimodal large language models.

Novelty

This work is the first to systematically combine vision-first prompt reformulation with multi-layer compression strategies—early exit, interaction band restriction, and visual token pruning—in the context of multimodal reranking. Unlike prior methods that primarily focus on model scaling or sparse attention, this approach leverages detailed layer-wise analysis and pretrained attention cues to optimize both input and model structure. The vision-first input format aligns with pretraining paradigms, enabling more natural multimodal processing and cache reuse. The localized interaction design reduces quadratic attention complexity, while token pruning based on attention weights offers a principled method for input size reduction. These innovations collectively set a new benchmark for efficiency in multimodal retrieval systems.

Limitations

Despite significant efficiency improvements, the framework may still face performance degradation in scenarios with highly complex or noisy visual inputs, where pruning or early exit could discard relevant information. Additionally, reliance on pretrained attention weights assumes high-quality visual representations, which may not hold in domain-specific or low-resource settings.
The approach requires pre-caching visual representations and attention scores, which could pose storage challenges at scale, especially with dynamic or frequently updated corpora. Moreover, the method's effectiveness depends on the quality of the initial retriever and the pretrained embedder, limiting its applicability in less mature systems.
Future work should explore adaptive pruning and dynamic interaction strategies, as well as extending the framework to handle more diverse multimodal data types, including audio and sensor data. Addressing these limitations will be crucial for deploying miniReranker in real-world, large-scale applications.

Future Work

Future research directions include developing adaptive, data-driven pruning mechanisms that dynamically adjust based on input complexity, as well as integrating multi-task pretraining to enhance generalization across diverse multimodal tasks. Exploring real-time, on-device inference with optimized storage and computation strategies will be vital for edge deployment. Additionally, extending the framework to incorporate other modalities like audio or sensor data could broaden its applicability. Investigating robustness in noisy or adversarial scenarios and further automating the design of interaction masks and pruning thresholds are also promising avenues. Ultimately, these efforts aim to realize highly efficient, scalable multimodal reasoning systems capable of supporting next-generation AI applications.

AI Executive Summary

Multimodal large language models (MLLMs) have revolutionized content understanding and retrieval by enabling fine-grained cross-modal reasoning. However, their deployment in real-time systems faces significant challenges due to the high computational cost associated with token-level interactions across query-document pairs. Traditional point-wise reranking approaches, inherited from text retrieval paradigms, often process each query-document pair independently, leading to redundant computation especially when many candidate documents share overlapping visual and textual features.

Recognizing these limitations, Yingqi Fan and colleagues introduced miniReranker, a novel framework designed to dramatically improve the efficiency of multimodal reranking without sacrificing accuracy. The key innovation is the adoption of a vision-first prompt reformulation, which aligns input sequences with the pretraining formats of models like Qwen3-VL. This approach ensures that visual representations can be cached and reused across multiple candidate pairs, significantly reducing repeated visual encoding. Layer-wise analysis revealed that deep transformer layers contain substantial redundancy, prompting the implementation of early exit strategies that truncate computation once relevant signals are detected.

Further, the study uncovered that effective query-document interactions are concentrated within a narrow set of intermediate layers. By designing an interaction band that restricts cross-segment attention to these layers, the framework minimizes unnecessary attention computations. Complementing this, the authors leverage attention weights from the pretrained embedder to identify and prune redundant visual tokens, decreasing input sequence length and further accelerating inference.

Experimental results on the MMEB-v2 benchmark, which covers a diverse set of 78 multimodal tasks including images, videos, and visual documents, demonstrate the effectiveness of miniReranker. The system maintains over 96% of the dense reranker’s performance while reducing active parameters to 58% and achieving nearly threefold training acceleration. In large-scale reranking scenarios, such as processing the top 100 candidates, the runtime is cut by over 99%, enabling real-time deployment in practical systems.

This work addresses a critical bottleneck in multimodal retrieval—computational redundancy—by integrating input reformulation, model depth reduction, attention scope limitation, and input pruning into a cohesive framework. Its implications extend beyond retrieval, offering a blueprint for efficient multimodal reasoning in various AI applications. Future directions include adaptive pruning, multi-modal extension, and deployment on edge devices, promising a new era of scalable, high-performance multimodal AI systems.

Deep Analysis

Background

随着多模态大模型（MLLMs）在内容理解、问答和检索等任务中的崛起，研究者们不断探索如何提升模型的效率与效果。早期工作如CLIP、ALIGN等通过对比学习实现了跨模态的全局表示，但在细粒度的跨模态推理方面仍受限。近年来，基于Transformer的多模态模型如Florence、Gato和Qwen3-VL，结合视觉和语言模态，显著提升了多模态理解能力。这些模型在预训练阶段采用了多模态对齐策略，强调视觉优先的输入格式，旨在实现更自然的模态融合。然而，随着模型规模的扩大，计算成本急剧上升，尤其是在点对点重排序任务中，模型需要对每个查询-候选对进行复杂的交互计算，导致重复计算和能耗问题日益突出。传统的双编码器检索方法虽然计算效率较高，但在细粒度交互方面表现有限，难以满足高精度需求。多模态大模型的应用场景不断扩大，如何在保证性能的同时实现高效推理，成为当前研究的核心难题。

Core Problem

多模态大模型在检索任务中的主要瓶颈在于计算冗余和效率瓶颈。具体表现为：一是模型深层存在大量冗余信息，导致推理过程中不必要的深度计算增加了延迟和能耗；二是跨段注意力机制在所有层都进行全局交互，造成二次复杂度，难以在大规模候选集上实现实时响应；三是视觉输入中的大量Token带来序列长度的爆炸，增加了模型的输入处理负担。尤其是在高重用场景下，重复编码视觉内容的成本极高，严重制约了多模态系统的实际部署。解决这些瓶颈，不仅需要模型结构的优化，还需在输入格式和推理策略上进行创新，才能实现既高效又准确的多模态检索。

Innovation

本研究的核心创新包括：1）提出vision-first输入重构策略，将视觉信息放在前，确保模型预训练格式的一致性，最大化视觉表示的缓存和重用；2）通过层级分析，发现模型深层存在大量冗余，设计早期退出机制，在相关信号集中时提前终止推理，减少深度计算；3）识别跨段注意力的集中区域，设计交互带限制策略，将注意力范围缩小到中间层，降低交互复杂度；4）利用预训练embedder中的注意力权重，指导视觉Token的剪枝，有效减少输入序列长度。这些技术的结合，极大地降低了模型的计算成本，同时保持了高水平的重排序性能，突破了多模态模型在效率上的瓶颈。

Methodology

�� 输入重构：采用vision-first提示格式，将视觉输入置于文本之前，确保视觉内容的预缓存和重用。• 层级分析：通过层级探测技术，分析模型在不同深度的重排序信号，发现中间层即可获得接近最终性能的信号。• 早期退出：在模型推理过程中，设定阈值，当中间层的信号达到预设标准时，提前终止模型推理，减少深层计算。• 交互带限制：分析跨段注意力的激活区域，将注意力范围限制在中间层的特定区间（如8-16层），避免无效的全局交互。• 视觉Token剪枝：利用预训练embedder的注意力权重，计算视觉Token的重要性，选择性剪枝，减少输入序列长度。• 实验验证：在Qwen3-VL基础上，构建多模态重排序数据集，进行微调和评估，验证各项技术的有效性。• 性能评估：在78个多模态任务上，比较原始模型与优化模型的准确率、推理时间和参数量，确保性能的同时实现大幅度提升的效率。

Experiments

实验采用MMEB-v2数据集，涵盖图像、视频和视觉文档任务。模型在不同规模（2B、4B、8B参数）上进行微调，使用点对点重排序的二分类目标，优化指标包括Hit@1、NDCG@5等。对比基线包括原始dense模型、不同prompt格式（query-first、document-first、vision-first）以及各项压缩策略的单独和联合效果。通过消融实验验证早期退出、交互带限制和视觉Token剪枝的贡献，分析不同层级参数的影响。训练采用LoRA微调，学习率设为1×10^-4，训练一轮，确保模型在多模态任务中的泛化能力。评估过程中，重排序候选集为Top-100，采用Qwen3-VL-Embedding-2B作为检索器，确保测试的真实性和实用性。

Results

在多模态任务中，miniReranker在保持96%以上的性能基础上，将参数量降至原模型的58%，训练速度提升近3倍。在Top-100候选集上，重排序时间减少超过99%，极大地提升了系统的实时性。多任务评估显示，模型在图像、视频和视觉文档任务中的准确率与密集模型相差无几，验证了压缩策略的有效性。消融分析表明，早期退出在深层模型中减少了约40%的计算量，交互带限制降低了约50%的交互复杂度，视觉Token剪枝在保持性能的同时减少了50%的输入Token。这些结果充分证明了多层次、多角度压缩策略的协同作用。

Applications

该技术适用于大规模多模态内容检索、智能问答、内容过滤等场景，尤其在需要实时响应和大规模候选集的应用中表现出巨大优势。企业可以将miniReranker集成到搜索引擎、内容推荐系统中，实现高效的多模态内容筛选和排序。未来，结合边缘计算和硬件加速，模型还可部署在移动设备或边缘端，推动多模态AI的普及。长远来看，该框架有望引领多模态模型的结构设计，推动跨模态理解与推理的广泛应用，满足智能内容管理、虚拟助手等多样化需求。

Limitations & Outlook

尽管取得显著效率提升，但在极端视觉信息丰富或模态关系复杂的场景中，剪枝和早期退出可能导致信息丢失，影响模型性能。此外，依赖预训练模型的注意力信息，若预训练数据偏差或模型训练不足，可能影响剪枝效果。存储预缓存的视觉表示在大规模动态数据环境中也面临挑战。未来需开发更智能的动态剪枝和交互策略，以适应不断变化的应用需求，同时增强模型在多样化场景中的鲁棒性。

Plain Language Accessible to non-experts

想象你在一家大型厨房里准备一道复杂的菜肴。每次做菜时，你都需要用到各种食材、调料和工具。传统的方法是每次都从头开始准备所有的食材和调料，不管你之前已经用过哪些。这就像模型每次都重新计算所有视觉和文字信息，浪费了很多时间和精力。

现在，厨师发现可以提前准备一些常用的调料和食材，把它们存放在厨房的特定位置。每次做菜时，只需取出需要的部分，大大节省了准备时间。这就像miniReranker中采用视觉优先的输入格式，将视觉信息提前缓存，避免重复计算。

此外，厨师还发现，很多调料只在菜肴的中间阶段起作用，最后只用少量的调料就能让菜肴变得美味。于是，他只在特定的步骤中使用这些调料，其他时候不用。这类似于模型在中间层限制交互范围，只在必要的层级进行跨模态交互，减少不必要的计算。

最后，厨师还会根据经验，判断哪些食材可以省略或减少，比如用少量的香料就能达到相同的效果。这就像模型根据注意力权重剪枝视觉Token，去除冗余信息，保持菜肴的味道不变，但节省了准备时间。

通过这些聪明的技巧，厨师可以在保证菜肴美味的同时，大幅度提高效率，节省时间和资源。miniReranker也是如此，它通过提前缓存、限制交互和剪枝，极大地提升多模态检索的速度和效率，让AI系统变得更快、更聪明。

ELI14 Explained like you're 14

想象你在学校的图书馆里找书。以前，每次你想找一本书，都得从头开始翻所有书架，花费很多时间。而现在，图书馆有个聪明的系统：它会提前把常用的书架整理好，把书的目录存起来。每次你来，只要告诉它你想找的类型，它就能快速帮你找到对应的书架位置。这就像miniReranker，它提前把视觉信息缓存起来，不用每次都重新计算。

另外，这个系统还发现，很多书只在特定的区域或者特定的时间才会用到，所以它只在这些区域集中搜索，不会浪费时间在其他地方。这就像模型只在中间几层进行跨模态交互，避免在每一层都做复杂的计算。

最后，系统还能根据书的内容重要程度，自动把不重要的书架上的书缩减掉，只留下最关键的部分。这样一来，搜索速度就快多了，几乎不用等就能找到想要的书。

所以，这个聪明的图书馆系统让找书变得又快又省力。miniReranker也是一样，它通过提前缓存、限制交互范围和剪掉冗余信息，让多模态检索变得更快、更高效，帮助我们更快找到想要的内容。

Abstract

Multimodal large language models (MLLMs) have recently shown strong potential as point-wise rerankers by directly modeling query--document relevance through next-token prediction. However, point-wise reranking suffers from substantial repeated computation across query--document pairs, while the causal structure of transformers allows only prefix segments to be reused via pre-caching. To address the misalignment of existing query-first and document-first formats with both VQA-style prompting and computation-aware reuse, we propose a \textit{vision-first} formulation that improves both cache reuse efficiency and reranking performance. However, the remaining cost is still considerable and stems from three main sources: (1) \textit{model depth}, for which we reduce active parameters via early exit; (2) \textit{cross-segment attention}, which we restrict to a narrow interaction band across a few layers; and (3) \textit{visual tokens}, where we reduce the number of tokens via embedder-guided pruning. Together, these designs form miniReranker, which reduces reranking runtime to <1% of the dense implementation under high-reuse settings for a single query, while preserving >96% of the dense model performance.

cs.IR

References (20)

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Yikun Liu, Pingan Chen, Jiayin Cai et al.

2024 95 citations ⭐ Influential View Analysis →

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long et al.

2026 103 citations ⭐ Influential View Analysis →

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng, Ziyan Jiang, Ye Liu et al.

2025 72 citations ⭐ Influential View Analysis →

MMBench: Is Your Multi-modal Model an All-around Player?

Yuanzhan Liu, Haodong Duan, Yuanhan Zhang et al.

2023 2169 citations View Analysis →

The Remarkable Robustness of LLMs: Stages of Inference?

Vedang Lad, Wes Gurnee, Max Tegmark

2024 131 citations View Analysis →

VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

Zhanpeng Chen, Chengjin Xu, Yiyan Qi et al.

2025 7 citations

runer : Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

Yingqi Fan, Anhao Zhao, Jinlan Fu et al.

1 citations

VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference

Pengfei Jiang, Hanjun Li, Linglan Zhao et al.

2025 7 citations View Analysis →

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

M. Dhouib, Davide Buscaldi, Sonia Vanier et al.

2025 41 citations View Analysis →

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen et al.

2023 1956 citations View Analysis →

Learning to rank: from pairwise approach to listwise approach

Zhe Cao, Tao Qin, Tie-Yan Liu et al.

2007 2393 citations

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Ruohong Zhang, Liangke Gui, Zhiqing Sun et al.

2024 154 citations View Analysis →

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

Yuwei An, Yihua Cheng, Seongmin Park et al.

2025 10 citations View Analysis →

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Shijie Chen, Bernal Jiménez Gutiérrez, Yu Su

2024 42 citations View Analysis →

Reranking with Compressed Document Representation

Herv'e D'ejean, S. Clinchant

2025 2 citations View Analysis →

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarajan, Meet Shah et al.

2019 2077 citations View Analysis →

Layer by Layer: Uncovering Hidden Representations in Language Models

Oscar Skean, Md Rifat Arefin, Dan Zhao et al.

2025 261 citations View Analysis →

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson, Christopher D. Manning

2019 3141 citations

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Wenjie Liu, Hao Wu, Xin Qiu et al.

2026 6 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 4254 citations View Analysis →

miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators

A Theoretical Framework for Risk Analysis of Stochastic Rankers

CQC-RAG: Robust Retrieval-Augmented Generation via Cross-Query Consistency

SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval