Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Aligning Dense Retrievers with LLM Utility via Distillation, UAE improves Recall@1 by 30.59% on QASPER benchmark.
Key Findings
Methodology
The paper introduces Utility-Aligned Embeddings (UAE), a framework that distills generative utility directly into the bi-encoder's embedding space, avoiding expensive test-time LLM inference. UAE formulates retrieval as a distribution matching problem, training a bi-encoder to mimic a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference.
Key Results
- On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16%, and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
- UAE is over 180x faster than efficient LLM re-ranking methods while preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.
- In experiments, UAE achieves a Recall@1 of 54.90 on the NewsQA dataset, surpassing the computationally expensive RankGPT (49.68), indicating that embeddings aligned with generative utility can achieve reranker-level precision in a single retrieval step.
Significance
The UAE framework significantly reduces semantic distractors and improves generation quality while operating 180x faster than LLM-based re-ranking methods. By maintaining standard ANN compatibility and serving as a high-quality foundation for multi-stage pipelines, UAE provides a practical and scalable solution for utility-driven RAG systems. This research has significant implications for academia and industry as it addresses the long-standing gap between semantic similarity and generative utility.
Technical Contribution
UAE achieves efficient dense retrieval by distilling generative utility directly into the bi-encoder's embedding space, avoiding expensive test-time LLM inference. This method formulates retrieval as a distribution matching problem, using a Utility-Modulated InfoNCE objective to train the bi-encoder to mimic a utility distribution derived from perplexity reduction. This technical contribution provides new theoretical guarantees and engineering possibilities, enabling efficient and high-performance retrieval on large-scale datasets.
Novelty
UAE is the first method to distill generative utility directly into the bi-encoder embedding space via distribution matching, offering a fundamental innovation compared to existing methods that rely on semantic similarity or expensive LLM re-ranking. By avoiding test-time LLM inference, UAE significantly enhances retrieval efficiency and achieves high performance on large-scale datasets.
Limitations
- UAE may struggle with very long contexts as they can introduce more semantic distractors, affecting generation quality.
- The performance of UAE may be limited by the quality of the pre-trained generative model it relies on.
- In certain domains or tasks, UAE may require additional fine-tuning to achieve optimal performance.
Future Work
Future research directions include exploring UAE's application across different domains and tasks, further improving its performance with long contexts, and investigating ways to enhance UAE's generative quality without compromising efficiency. Additionally, exploring how to integrate other advanced retrieval techniques to further enhance UAE's performance and applicability is a promising direction.
AI Executive Summary
Dense vector retrieval is the practical backbone of Retrieval-Augmented Generation (RAG) systems, but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference.
On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16%, and Token F1 by 17.3% over the strong semantic baseline BGE-Base. UAE is over 180x faster than efficient LLM re-ranking methods while preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.
UAE significantly reduces semantic distractors and improves generation quality while operating 180x faster than LLM-based re-ranking methods. By maintaining standard ANN compatibility and serving as a high-quality foundation for multi-stage pipelines, UAE provides a practical and scalable solution for utility-driven RAG systems. This research has significant implications for academia and industry as it addresses the long-standing gap between semantic similarity and generative utility.
UAE achieves efficient dense retrieval by distilling generative utility directly into the bi-encoder's embedding space, avoiding expensive test-time LLM inference. This method formulates retrieval as a distribution matching problem, using a Utility-Modulated InfoNCE objective to train the bi-encoder to mimic a utility distribution derived from perplexity reduction. This technical contribution provides new theoretical guarantees and engineering possibilities, enabling efficient and high-performance retrieval on large-scale datasets.
Future research directions include exploring UAE's application across different domains and tasks, further improving its performance with long contexts, and investigating ways to enhance UAE's generative quality without compromising efficiency. Additionally, exploring how to integrate other advanced retrieval techniques to further enhance UAE's performance and applicability is a promising direction.
Deep Analysis
Background
In the field of information retrieval, dense vector retrieval has become the cornerstone of Retrieval-Augmented Generation (RAG) systems. These systems map queries and candidates into a shared representation space, leveraging efficient Approximate Nearest Neighbor (ANN) search to handle large-scale datasets with minimal latency. However, as technology advances, this paradigm is increasingly criticized for its reliance on semantic similarity as a proxy for generative utility. Research shows that passages with high semantic similarity (topical overlap) often fail to provide answer-critical information and can even introduce semantic distractors that mislead the generator, especially in long-context settings where incorrect but similar passages increase decoding uncertainty. To bridge this gap, current state-of-the-art approaches shift towards utility-based retrieval, where relevance is defined by how effectively a document helps a Large Language Model (LLM) produce a correct response. In practice, this is often measured via perplexity reduction: a document is considered useful if its presence as context makes the ground-truth answer more predictable.
Core Problem
Despite the conceptual soundness of utility-based approaches, they face significant practical challenges. Relying on LLMs for query generation or post-hoc re-ranking is computationally prohibitive for large-scale deployment. Furthermore, utility signals derived from perplexity are notoriously noisy and stochastic, sensitive to token-level variations and decoding dynamics that make them difficult to use as stable training targets. This necessitates complex, multi-stage architectures that improve performance at the cost of extreme inference latency and high computational overhead. Therefore, the challenge lies in enhancing retrieval's generative utility without increasing computational burden.
Innovation
The core innovation of the UAE framework lies in distilling generative utility directly into the bi-encoder's embedding space, thus avoiding expensive test-time LLM inference. Specifically:
- �� UAE formulates retrieval as a distribution matching problem, using a Utility-Modulated InfoNCE objective to train the bi-encoder to mimic a utility distribution derived from perplexity reduction. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference.
- �� UAE stabilizes noisy utility signals by distilling them into a parameterized reward model during training, aligning the dense retriever with this model through supervised distribution matching. This method maintains standard ANN compatibility, providing a practical and scalable solution for utility-driven RAG systems.
- �� UAE significantly improves retrieval performance on the QASPER benchmark while being over 180x faster than efficient LLM re-ranking methods, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.
Methodology
The implementation of UAE involves several key steps:
- �� Parameterized Utility Approximation: Estimate the utility of context documents via perplexity and distill them into a parameterized reward model to stabilize noisy utility signals.
- �� Reward-Guided Embeddings Training: Use the reward model as an offline teacher to distill utility preferences into a dense bi-encoder, preserving ANN search efficiency.
- �� Distribution Matching Objective: Define a teacher distribution and a student distribution, optimizing the retriever by minimizing the KL divergence between the two distributions, reshaping the embedding space to reflect the generator's preferences.
- �� Utility-Aware Hard Negative Mining: Adopt the Noise Contrastive Estimation (NCE) paradigm to approximate the global distribution using a combination of gold contexts and informative negative samples, ensuring the retriever focuses on resolving semantic distractors.
Experiments
The experimental design includes evaluation on two distinct RAG benchmarks: QASPER (long-doc scientific QA) and NewsQA (short-doc news extraction). We employ a hard-negative setting where the candidate pool (50) for each query is constructed via dense retrieval (BGE-Base) and reward model utility. This setting rigorously tests the model's ability to prioritize true generative utility over high-similarity non-answers. The generation protocol uses Llama-3-8B-Instruct as the fixed generator with greedy decoding (temperature=0) for reproducibility. Dataset-specific system prompts align the generator's output with the ground-truth format: extractive phrases for NewsQA and evidence-based summaries for QASPER. Performance is quantified using Token F1 and ROUGE-L to assess informational accuracy and structural fluency.
Results
On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16%, and Token F1 by 17.3% over the strong semantic baseline BGE-Base. UAE is over 180x faster than efficient LLM re-ranking methods while preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale. In experiments, UAE achieves a Recall@1 of 54.90 on the NewsQA dataset, surpassing the computationally expensive RankGPT (49.68), indicating that embeddings aligned with generative utility can achieve reranker-level precision in a single retrieval step. This advantage extends to ExpUtil@1 (average utility of the top-1 context). On NewsQA, UAE (5.818) surpasses both BGE-Base (4.738) and even the computation-heavy RankGPT (4.968), confirming that UAE prioritizes contexts maximally conducive to generation rather than mere semantic relevance.
Applications
Direct application scenarios for UAE include:
- �� In large-scale information retrieval systems, UAE can serve as an efficient first-stage retriever, providing high-quality candidate contexts for subsequent processing.
- �� In real-time applications requiring quick responses, UAE's low latency makes it ideal, especially in scenarios where user experience is critical.
- �� In complex tasks requiring long-context processing, UAE can improve generation quality by reducing semantic distractors, applicable in scientific literature analysis and long-document question answering.
Limitations & Outlook
UAE may struggle with very long contexts as they can introduce more semantic distractors, affecting generation quality. Additionally, the performance of UAE may be limited by the quality of the pre-trained generative model it relies on. In certain domains or tasks, UAE may require additional fine-tuning to achieve optimal performance. Future research directions include exploring UAE's application across different domains and tasks, further improving its performance with long contexts, and investigating ways to enhance UAE's generative quality without compromising efficiency. Additionally, exploring how to integrate other advanced retrieval techniques to further enhance UAE's performance and applicability is a promising direction.
Plain Language Accessible to non-experts
Imagine you're in a library searching for books. Traditional methods involve finding books by title or author, similar to retrieving information based on semantic similarity. You might find a book with a similar title to what you're looking for, but its content isn't what you need. UAE acts like a smart assistant in the library, not only looking at the book titles but also quickly skimming through the content to determine if the book is truly useful to you. This way, even if the title doesn't exactly match, it can find the book that best suits your needs. UAE uses a technique called utility alignment to ensure that the books it recommends are not only relevant but also genuinely helpful in solving your problem. This process is like the assistant quickly analyzing the value of each book behind the scenes and then giving you the best suggestion. UAE's speed and accuracy save you a lot of time in the library, preventing you from being distracted by irrelevant books.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super complex game where you need to find hidden clues to level up. The old way is to look for clues based on their color or shape, but that might lead you to things that look similar but aren't helpful. UAE is like a super helper in the game, not just looking at the clues' appearance but also quickly analyzing their content to see if they can really help you level up. This way, even if the clues don't look exactly the same, it can find the ones that are best for you. UAE uses a technique called utility alignment to make sure the clues it recommends are not only related but also truly help you win the game. It's like the helper quickly analyzing each clue's value behind the scenes and then giving you the best advice. UAE's speed and accuracy save you a lot of time in the game, preventing you from being distracted by irrelevant clues. Isn't that cool?
Glossary
Utility-Aligned Embeddings
A method that distills generative utility directly into the bi-encoder embedding space, avoiding expensive test-time LLM inference.
In this paper, UAE is used to enhance retrieval efficiency and generation quality.
Retrieval-Augmented Generation (RAG)
A method combining information retrieval and generative models to improve performance on generation tasks.
RAG systems rely on dense vector retrieval to provide high-quality context.
Approximate Nearest Neighbor (ANN)
An efficient search algorithm used to quickly find the nearest neighbors in large-scale datasets.
ANN is used for dense vector retrieval in large-scale datasets.
Utility-Modulated InfoNCE
A training objective that optimizes the bi-encoder by mimicking a utility distribution derived from perplexity reduction.
This objective is used to inject utility signals into the embedding space.
Perplexity
A measure of uncertainty in a language model, with lower values indicating higher confidence.
Perplexity is used to estimate the utility of context documents.
Distribution Matching
An optimization strategy that aligns model outputs by minimizing the difference between two distributions.
UAE formulates retrieval as a distribution matching problem.
Kullback-Leibler Divergence (KL Divergence)
A measure of difference between two probability distributions.
KL divergence is used to optimize the retriever's distribution matching objective.
Noise Contrastive Estimation (NCE)
A technique used to approximate global distribution by combining positive and negative samples.
NCE is used for utility-aware hard negative mining.
Transformer-based Encoding Model
A model using the Transformer architecture for encoding, commonly used in NLP tasks.
The reward model uses a Transformer-based encoding model to capture utility.
Low-Rank Adaptation (LoRA)
A parameter-efficient fine-tuning technique for adapting large language models.
LoRA is used for parameter-efficient fine-tuning in UAE.
Open Questions Unanswered questions from this research
- 1 How can UAE's performance with long contexts be further improved without increasing computational burden? Current methods may introduce more semantic distractors when handling long contexts, affecting generation quality. New techniques are needed to reduce this distraction while maintaining efficient retrieval performance.
- 2 How can UAE be applied across different domains and tasks? While UAE performs well on QASPER and NewsQA, its applicability in other domains and tasks has not been fully validated. More experiments are needed to evaluate its performance in different scenarios.
- 3 How can other advanced retrieval techniques be integrated to further enhance UAE's performance and applicability? UAE has shown potential in utility-driven retrieval, but combining other techniques may bring greater performance improvements.
- 4 How can UAE's generative quality be enhanced without compromising efficiency? While UAE improves generative quality through utility alignment, there may still be shortcomings in some cases. New methods need to be explored to further optimize generative outcomes.
- 5 UAE may struggle with very long contexts as they can introduce more semantic distractors, affecting generation quality. New techniques are needed to reduce this distraction while maintaining efficient retrieval performance.
Applications
Immediate Applications
Large-Scale Information Retrieval Systems
UAE can serve as an efficient first-stage retriever, providing high-quality candidate contexts for subsequent processing, suitable for real-time applications requiring quick responses.
Scientific Literature Analysis
In analyzing long scientific documents, UAE can improve generation quality by reducing semantic distractors, helping researchers quickly find relevant information.
Long-Document Question Answering Systems
UAE performs well in long-document QA systems, providing high-quality contexts to help generative models produce accurate answers.
Long-term Vision
Cross-Domain Applications
UAE's utility alignment technique has the potential to be applied in multiple domains, including legal, medical, and financial industries that require efficient information retrieval.
Intelligent Assistants
UAE can serve as a core technology for intelligent assistants, helping users quickly find relevant information and improve work efficiency.
Abstract
Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.
References (19)
C-Pack: Packed Resources For General Chinese Embeddings
Shitao Xiao, Zheng Liu, Peitian Zhang et al.
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Pradeep Dasigi, Kyle Lo, Iz Beltagy et al.
ROUGE: A Package for Automatic Evaluation of Summaries
Chin-Yew Lin
Bridging the Preference Gap between Retrievers and LLMs
Zixuan Ke, Weize Kong, Cheng Li et al.
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev et al.
SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction
Lu Dai, Yijie Xu, Jinhui Ye et al.
SPLADE-v3: New baselines for SPLADE
Carlos Lassance, Herv'e D'ejean, Thibault Formal et al.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Chankyu Lee, Rajarshi Roy, Mengyao Xu et al.
Dense Passage Retrieval for Open-Domain Question Answering
Vladimir Karpukhin, Barlas Oğuz, Sewon Min et al.
Is Relevance Propagated from Retriever to Generator in RAG?
Fangzheng Tian, Debasis Ganguly, Craig Macdonald
The Power of Noise: Redefining Retrieval for RAG Systems
Florin Cuconasu, Giovanni Trappolini, F. Siciliano et al.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Jianlv Chen, Shitao Xiao, Peitian Zhang et al.
GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis
Yi Jiang, Sendong Zhao, Jianbo Li et al.
Making Retrieval-Augmented Language Models Robust to Irrelevant Context
Ori Yoran, Tomer Wolfson, Ori Ram et al.
Robust Loss Functions under Label Noise for Deep Neural Networks
Aritra Ghosh, Himanshu Kumar, P. Sastry
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang et al.
Response time in man-computer conversational transactions
Robert B. Miller
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao et al.
Sequence-Level Training for Non-Autoregressive Neural Machine Translation
Chenze Shao, Yang Feng, Jinchao Zhang et al.