Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

TL;DR

Aligning Dense Retrievers with LLM Utility via Distillation, UAE improves Recall@1 by 30.59% on QASPER benchmark.

cs.IR 🔴 Advanced 2026-04-25 34 views

Rajinder Sandhu Di Mu Cheng Chang Md Shahriar Tasjid Himanshu Rai Maksims Volkovs Ga Wu

Dense Retrieval Generative Utility Information Distillation Distribution Matching NLP

Key Findings

Methodology

The paper introduces Utility-Aligned Embeddings (UAE), a framework that distills generative utility directly into the bi-encoder's embedding space, avoiding expensive test-time LLM inference. UAE formulates retrieval as a distribution matching problem, training a bi-encoder to mimic a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference.

Key Results

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16%, and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
UAE is over 180x faster than efficient LLM re-ranking methods while preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.
In experiments, UAE achieves a Recall@1 of 54.90 on the NewsQA dataset, surpassing the computationally expensive RankGPT (49.68), indicating that embeddings aligned with generative utility can achieve reranker-level precision in a single retrieval step.

Significance

The UAE framework significantly reduces semantic distractors and improves generation quality while operating 180x faster than LLM-based re-ranking methods. By maintaining standard ANN compatibility and serving as a high-quality foundation for multi-stage pipelines, UAE provides a practical and scalable solution for utility-driven RAG systems. This research has significant implications for academia and industry as it addresses the long-standing gap between semantic similarity and generative utility.

Technical Contribution

UAE achieves efficient dense retrieval by distilling generative utility directly into the bi-encoder's embedding space, avoiding expensive test-time LLM inference. This method formulates retrieval as a distribution matching problem, using a Utility-Modulated InfoNCE objective to train the bi-encoder to mimic a utility distribution derived from perplexity reduction. This technical contribution provides new theoretical guarantees and engineering possibilities, enabling efficient and high-performance retrieval on large-scale datasets.

Novelty

UAE is the first method to distill generative utility directly into the bi-encoder embedding space via distribution matching, offering a fundamental innovation compared to existing methods that rely on semantic similarity or expensive LLM re-ranking. By avoiding test-time LLM inference, UAE significantly enhances retrieval efficiency and achieves high performance on large-scale datasets.

Limitations

UAE may struggle with very long contexts as they can introduce more semantic distractors, affecting generation quality.
The performance of UAE may be limited by the quality of the pre-trained generative model it relies on.
In certain domains or tasks, UAE may require additional fine-tuning to achieve optimal performance.

Future Work

Future research directions include exploring UAE's application across different domains and tasks, further improving its performance with long contexts, and investigating ways to enhance UAE's generative quality without compromising efficiency. Additionally, exploring how to integrate other advanced retrieval techniques to further enhance UAE's performance and applicability is a promising direction.

AI Executive Summary

Dense vector retrieval is the practical backbone of Retrieval-Augmented Generation (RAG) systems, but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference.

UAE significantly reduces semantic distractors and improves generation quality while operating 180x faster than LLM-based re-ranking methods. By maintaining standard ANN compatibility and serving as a high-quality foundation for multi-stage pipelines, UAE provides a practical and scalable solution for utility-driven RAG systems. This research has significant implications for academia and industry as it addresses the long-standing gap between semantic similarity and generative utility.

Deep Analysis

Background

In the field of information retrieval, dense vector retrieval has become the cornerstone of Retrieval-Augmented Generation (RAG) systems. These systems map queries and candidates into a shared representation space, leveraging efficient Approximate Nearest Neighbor (ANN) search to handle large-scale datasets with minimal latency. However, as technology advances, this paradigm is increasingly criticized for its reliance on semantic similarity as a proxy for generative utility. Research shows that passages with high semantic similarity (topical overlap) often fail to provide answer-critical information and can even introduce semantic distractors that mislead the generator, especially in long-context settings where incorrect but similar passages increase decoding uncertainty. To bridge this gap, current state-of-the-art approaches shift towards utility-based retrieval, where relevance is defined by how effectively a document helps a Large Language Model (LLM) produce a correct response. In practice, this is often measured via perplexity reduction: a document is considered useful if its presence as context makes the ground-truth answer more predictable.

Core Problem

Despite the conceptual soundness of utility-based approaches, they face significant practical challenges. Relying on LLMs for query generation or post-hoc re-ranking is computationally prohibitive for large-scale deployment. Furthermore, utility signals derived from perplexity are notoriously noisy and stochastic, sensitive to token-level variations and decoding dynamics that make them difficult to use as stable training targets. This necessitates complex, multi-stage architectures that improve performance at the cost of extreme inference latency and high computational overhead. Therefore, the challenge lies in enhancing retrieval's generative utility without increasing computational burden.

Innovation

The core innovation of the UAE framework lies in distilling generative utility directly into the bi-encoder's embedding space, thus avoiding expensive test-time LLM inference. Specifically:

�� UAE formulates retrieval as a distribution matching problem, using a Utility-Modulated InfoNCE objective to train the bi-encoder to mimic a utility distribution derived from perplexity reduction. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference.

�� UAE stabilizes noisy utility signals by distilling them into a parameterized reward model during training, aligning the dense retriever with this model through supervised distribution matching. This method maintains standard ANN compatibility, providing a practical and scalable solution for utility-driven RAG systems.

�� UAE significantly improves retrieval performance on the QASPER benchmark while being over 180x faster than efficient LLM re-ranking methods, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.

Methodology

The implementation of UAE involves several key steps:

�� Parameterized Utility Approximation: Estimate the utility of context documents via perplexity and distill them into a parameterized reward model to stabilize noisy utility signals.

�� Reward-Guided Embeddings Training: Use the reward model as an offline teacher to distill utility preferences into a dense bi-encoder, preserving ANN search efficiency.

�� Distribution Matching Objective: Define a teacher distribution and a student distribution, optimizing the retriever by minimizing the KL divergence between the two distributions, reshaping the embedding space to reflect the generator's preferences.

�� Utility-Aware Hard Negative Mining: Adopt the Noise Contrastive Estimation (NCE) paradigm to approximate the global distribution using a combination of gold contexts and informative negative samples, ensuring the retriever focuses on resolving semantic distractors.

Experiments

The experimental design includes evaluation on two distinct RAG benchmarks: QASPER (long-doc scientific QA) and NewsQA (short-doc news extraction). We employ a hard-negative setting where the candidate pool (50) for each query is constructed via dense retrieval (BGE-Base) and reward model utility. This setting rigorously tests the model's ability to prioritize true generative utility over high-similarity non-answers. The generation protocol uses Llama-3-8B-Instruct as the fixed generator with greedy decoding (temperature=0) for reproducibility. Dataset-specific system prompts align the generator's output with the ground-truth format: extractive phrases for NewsQA and evidence-based summaries for QASPER. Performance is quantified using Token F1 and ROUGE-L to assess informational accuracy and structural fluency.

Results

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16%, and Token F1 by 17.3% over the strong semantic baseline BGE-Base. UAE is over 180x faster than efficient LLM re-ranking methods while preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale. In experiments, UAE achieves a Recall@1 of 54.90 on the NewsQA dataset, surpassing the computationally expensive RankGPT (49.68), indicating that embeddings aligned with generative utility can achieve reranker-level precision in a single retrieval step. This advantage extends to ExpUtil@1 (average utility of the top-1 context). On NewsQA, UAE (5.818) surpasses both BGE-Base (4.738) and even the computation-heavy RankGPT (4.968), confirming that UAE prioritizes contexts maximally conducive to generation rather than mere semantic relevance.

Applications

Direct application scenarios for UAE include:

�� In large-scale information retrieval systems, UAE can serve as an efficient first-stage retriever, providing high-quality candidate contexts for subsequent processing.

�� In real-time applications requiring quick responses, UAE's low latency makes it ideal, especially in scenarios where user experience is critical.

�� In complex tasks requiring long-context processing, UAE can improve generation quality by reducing semantic distractors, applicable in scientific literature analysis and long-document question answering.

Limitations & Outlook

UAE may struggle with very long contexts as they can introduce more semantic distractors, affecting generation quality. Additionally, the performance of UAE may be limited by the quality of the pre-trained generative model it relies on. In certain domains or tasks, UAE may require additional fine-tuning to achieve optimal performance. Future research directions include exploring UAE's application across different domains and tasks, further improving its performance with long contexts, and investigating ways to enhance UAE's generative quality without compromising efficiency. Additionally, exploring how to integrate other advanced retrieval techniques to further enhance UAE's performance and applicability is a promising direction.

Plain Language Accessible to non-experts

Imagine you're in a library searching for books. Traditional methods involve finding books by title or author, similar to retrieving information based on semantic similarity. You might find a book with a similar title to what you're looking for, but its content isn't what you need. UAE acts like a smart assistant in the library, not only looking at the book titles but also quickly skimming through the content to determine if the book is truly useful to you. This way, even if the title doesn't exactly match, it can find the book that best suits your needs. UAE uses a technique called utility alignment to ensure that the books it recommends are not only relevant but also genuinely helpful in solving your problem. This process is like the assistant quickly analyzing the value of each book behind the scenes and then giving you the best suggestion. UAE's speed and accuracy save you a lot of time in the library, preventing you from being distracted by irrelevant books.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game where you need to find hidden clues to level up. The old way is to look for clues based on their color or shape, but that might lead you to things that look similar but aren't helpful. UAE is like a super helper in the game, not just looking at the clues' appearance but also quickly analyzing their content to see if they can really help you level up. This way, even if the clues don't look exactly the same, it can find the ones that are best for you. UAE uses a technique called utility alignment to make sure the clues it recommends are not only related but also truly help you win the game. It's like the helper quickly analyzing each clue's value behind the scenes and then giving you the best advice. UAE's speed and accuracy save you a lot of time in the game, preventing you from being distracted by irrelevant clues. Isn't that cool?

Glossary

Utility-Aligned Embeddings

A method that distills generative utility directly into the bi-encoder embedding space, avoiding expensive test-time LLM inference.

In this paper, UAE is used to enhance retrieval efficiency and generation quality.

Retrieval-Augmented Generation (RAG)

A method combining information retrieval and generative models to improve performance on generation tasks.

RAG systems rely on dense vector retrieval to provide high-quality context.

Approximate Nearest Neighbor (ANN)

An efficient search algorithm used to quickly find the nearest neighbors in large-scale datasets.

ANN is used for dense vector retrieval in large-scale datasets.

Utility-Modulated InfoNCE

A training objective that optimizes the bi-encoder by mimicking a utility distribution derived from perplexity reduction.

This objective is used to inject utility signals into the embedding space.

Perplexity

A measure of uncertainty in a language model, with lower values indicating higher confidence.

Perplexity is used to estimate the utility of context documents.

Distribution Matching

An optimization strategy that aligns model outputs by minimizing the difference between two distributions.

UAE formulates retrieval as a distribution matching problem.

Kullback-Leibler Divergence (KL Divergence)

A measure of difference between two probability distributions.

KL divergence is used to optimize the retriever's distribution matching objective.

Noise Contrastive Estimation (NCE)

A technique used to approximate global distribution by combining positive and negative samples.

NCE is used for utility-aware hard negative mining.

Transformer-based Encoding Model

A model using the Transformer architecture for encoding, commonly used in NLP tasks.

The reward model uses a Transformer-based encoding model to capture utility.

Low-Rank Adaptation (LoRA)

A parameter-efficient fine-tuning technique for adapting large language models.

LoRA is used for parameter-efficient fine-tuning in UAE.

Open Questions Unanswered questions from this research

1 How can UAE's performance with long contexts be further improved without increasing computational burden? Current methods may introduce more semantic distractors when handling long contexts, affecting generation quality. New techniques are needed to reduce this distraction while maintaining efficient retrieval performance.
2 How can UAE be applied across different domains and tasks? While UAE performs well on QASPER and NewsQA, its applicability in other domains and tasks has not been fully validated. More experiments are needed to evaluate its performance in different scenarios.
3 How can other advanced retrieval techniques be integrated to further enhance UAE's performance and applicability? UAE has shown potential in utility-driven retrieval, but combining other techniques may bring greater performance improvements.
4 How can UAE's generative quality be enhanced without compromising efficiency? While UAE improves generative quality through utility alignment, there may still be shortcomings in some cases. New methods need to be explored to further optimize generative outcomes.
5 UAE may struggle with very long contexts as they can introduce more semantic distractors, affecting generation quality. New techniques are needed to reduce this distraction while maintaining efficient retrieval performance.

Applications

Immediate Applications

Large-Scale Information Retrieval Systems

UAE can serve as an efficient first-stage retriever, providing high-quality candidate contexts for subsequent processing, suitable for real-time applications requiring quick responses.

Scientific Literature Analysis

In analyzing long scientific documents, UAE can improve generation quality by reducing semantic distractors, helping researchers quickly find relevant information.

Long-Document Question Answering Systems

UAE performs well in long-document QA systems, providing high-quality contexts to help generative models produce accurate answers.

Long-term Vision

Cross-Domain Applications

UAE's utility alignment technique has the potential to be applied in multiple domains, including legal, medical, and financial industries that require efficient information retrieval.

Intelligent Assistants

UAE can serve as a core technology for intelligent assistants, helping users quickly find relevant information and improve work efficiency.

Abstract

Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.

cs.IR cs.AI cs.LG

References (19)

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang et al.

2023 525 citations ⭐ Influential View Analysis →

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy et al.

2021 480 citations ⭐ Influential View Analysis →

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin

2004 20076 citations ⭐ Influential

Bridging the Preference Gap between Retrievers and LLMs

Zixuan Ke, Weize Kong, Cheng Li et al.

2024 81 citations View Analysis →

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev et al.

2016 9267 citations View Analysis →

SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction

Lu Dai, Yijie Xu, Jinhui Ye et al.

2025 17 citations View Analysis →

SPLADE-v3: New baselines for SPLADE

Carlos Lassance, Herv'e D'ejean, Thibault Formal et al.

2024 83 citations View Analysis →

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu et al.

2024 474 citations View Analysis →

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min et al.

2020 5549 citations View Analysis →

Is Relevance Propagated from Retriever to Generator in RAG?

Fangzheng Tian, Debasis Ganguly, Craig Macdonald

2025 16 citations View Analysis →

The Power of Noise: Redefining Retrieval for RAG Systems

Florin Cuconasu, Giovanni Trappolini, F. Siciliano et al.

2024 374 citations View Analysis →

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang et al.

2024 1225 citations View Analysis →

GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis

Yi Jiang, Sendong Zhao, Jianbo Li et al.

2025 12 citations View Analysis →

Making Retrieval-Augmented Language Models Robust to Irrelevant Context

Ori Yoran, Tomer Wolfson, Ori Ram et al.

2023 357 citations View Analysis →

Robust Loss Functions under Label Noise for Deep Neural Networks

Aritra Ghosh, Himanshu Kumar, P. Sastry

2017 1129 citations View Analysis →

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang et al.

2018 4251 citations View Analysis →

Response time in man-computer conversational transactions

Robert B. Miller

1899 1069 citations

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao et al.

2020 3697 citations View Analysis →

Sequence-Level Training for Non-Autoregressive Neural Machine Translation

Chenze Shao, Yang Feng, Jinchao Zhang et al.

2021 30 citations View Analysis →

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Utility-Aligned Embeddings

Retrieval-Augmented Generation (RAG)

Approximate Nearest Neighbor (ANN)

Utility-Modulated InfoNCE

Perplexity

Distribution Matching

Kullback-Leibler Divergence (KL Divergence)

Noise Contrastive Estimation (NCE)

Transformer-based Encoding Model

Low-Rank Adaptation (LoRA)

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Large-Scale Information Retrieval Systems

Scientific Literature Analysis

Long-Document Question Answering Systems

Long-term Vision

Cross-Domain Applications

Intelligent Assistants

Abstract

References (19)

Related Papers

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

ECLASS-Augmented Semantic Product Search for Electronic Components

Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference