Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

TL;DR

Research evaluates QPP for selecting the best query variant in RAG pipelines to enhance generation quality.

cs.IR 🔴 Advanced 2026-04-24 31 views

Negar Arabzadeh Andrew Drozdov Michael Bendersky Matei Zaharia

Query Performance Prediction Query Variant Selection Retrieval-Augmented Generation Large Language Models Information Retrieval

Key Findings

Methodology

This study investigates the application of Query Performance Prediction (QPP) for selecting the optimal query variant in Retrieval-Augmented Generation (RAG) pipelines. Large-scale experiments were conducted on the TREC-RAG dataset, evaluating pre-retrieval and post-retrieval predictors under sparse and dense retrievers. The study assesses these predictors using correlation- and decision-based metrics.

Key Results

On the TREC-RAG dataset, lightweight pre-retrieval predictors often match or outperform more expensive post-retrieval methods, significantly reducing latency while improving generation quality.
The study reveals a systematic divergence between retrieval and generation objectives: variants maximizing ranking metrics like nDCG often fail to produce the best generated answers, exposing a 'utility gap' between retrieval relevance and generation fidelity.
QPP can reliably identify variants that improve end-to-end quality over the original query, especially in terms of generation quality.

Significance

This research holds significant implications for both academia and industry. It highlights the importance of selecting the best query variant in RAG pipelines, particularly when there is a utility gap between generation quality and retrieval relevance. By using QPP, systems can enhance the quality of generated answers without significantly increasing computational costs, which is crucial for real-world applications that require efficient handling of large volumes of queries.

Technical Contribution

Technical contributions include introducing a novel application of QPP in selecting the best query variant within RAG pipelines. This contrasts with traditional QPP methods, which primarily focus on estimating query difficulty across topics, while this study focuses on intra-topic variant selection. Additionally, the study demonstrates how pre-retrieval predictors can effectively improve generation quality without increasing computational complexity.

Novelty

Limitations

The study is primarily conducted on the TREC-RAG dataset, and its performance on other datasets may vary.
While pre-retrieval predictors perform well in many cases, they may not be as effective as post-retrieval methods for certain complex queries.

Future Work

Future research directions include validating the method's effectiveness on more diverse datasets and exploring ways to further bridge the utility gap between retrieval relevance and generation fidelity. Additionally, research could explore combining multiple QPP methods to enhance variant selection accuracy.

AI Executive Summary

In modern information retrieval systems, Retrieval-Augmented Generation (RAG) has become a dominant architectural paradigm. Unlike traditional ad-hoc retrieval, RAG inserts a Large Language Model (LLM) between retrieval and the user, delegating answer synthesis to a generative model conditioned on retrieved evidence. However, executing the full pipeline for every query reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs?

This study explores Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, this study examines intra-topic discrimination—selecting the optimal reformulation among competing variants of the same information need. Large-scale experiments on the TREC-RAG dataset evaluate pre-retrieval and post-retrieval predictors under sparse and dense retrievers.

The results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a 'utility gap' between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

Deep Analysis

Background

Retrieval-Augmented Generation (RAG) has rapidly become a dominant architectural paradigm for modern information systems. Traditional ad-hoc retrieval methods typically provide a ranked list for user consumption, whereas RAG inserts a Large Language Model (LLM) between retrieval and the user, delegating answer synthesis to a generative model conditioned on retrieved evidence. This shift fundamentally alters both the objective and the economics of search. In this setting, the role of query reformulation is significantly emphasized. A user's original query may fail to retrieve passages that adequately ground generation, exacerbating vocabulary mismatch, intent drift, and underspecification issues. LLM-based query reformulation has become common practice to mitigate this problem by generating multiple semantically equivalent query variants to improve recall and coverage.

Core Problem

Executing the full pipeline for every query reformulation is computationally expensive, especially in production settings where exhaustive execution is often infeasible. This necessitates a more efficient alternative: can we identify the best query variant before incurring the downstream generation cost? Query Performance Prediction (QPP) offers a natural mechanism for this problem. Traditionally, QPP estimates retrieval effectiveness without relevance judgments and has been used for tasks such as selective query expansion, system routing, and risk-sensitive retrieval. However, its evaluation has largely relied on correlation with ranking metrics such as nDCG or Average Precision.

Innovation

This study is the first to apply QPP to query variant selection in RAG pipelines, proposing a new evaluation framework that significantly enhances generation quality without increasing computational costs. This contrasts sharply with traditional QPP methods, which mainly focus on estimating retrieval effectiveness. The study assesses these predictors using correlation- and decision-based metrics, revealing a systematic divergence between retrieval and generation objectives: variants maximizing ranking metrics like nDCG often fail to produce the best generated answers, exposing a 'utility gap' between retrieval relevance and generation fidelity.

Methodology

�� Conduct large-scale experiments on the TREC-RAG dataset, evaluating pre-retrieval and post-retrieval predictors under sparse and dense retrievers.

�� Use correlation- and decision-based metrics to assess these predictors' effectiveness.

�� Reveal a systematic divergence between retrieval and generation objectives, where variants maximizing ranking metrics like nDCG often fail to produce the best generated answers.

�� Demonstrate that lightweight pre-retrieval predictors often match or outperform more expensive post-retrieval methods, providing a latency-efficient approach to robust RAG.

Experiments

Experiments were conducted on the TREC-RAG 2024 benchmark, explicitly designed to evaluate RAG systems and provide evaluation protocols for retrieval and RAG tasks separately. The benchmark consists of 56 queries constructed over the MS MARCO v2.1 corpus, which contains over 138 million passages. Importantly, these queries have been carefully and thoroughly judged across retrieval and generative dimensions by both human assessors and LLM-based judges, enabling a fair comparison of performance under different pipeline configurations. The study specifically utilizes human annotations for both retrieval and nugget-based evaluations.

Results

The study reveals a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a 'utility gap' between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

Applications

Limitations & Outlook

The study is primarily conducted on the TREC-RAG dataset, and its performance on other datasets may vary. While pre-retrieval predictors perform well in many cases, they may not be as effective as post-retrieval methods for certain complex queries. Additionally, the study reveals a systematic divergence between retrieval and generation objectives, where variants maximizing ranking metrics like nDCG often fail to produce the best generated answers, exposing a 'utility gap' between retrieval relevance and generation fidelity.

Plain Language Accessible to non-experts

Imagine you're in a library trying to find the best book on a particular topic. You could ask the librarian, who might give you a list of books, but this list might not fully meet your needs. To better assist you, the librarian decides to use a new method: they generate multiple different book lists and then choose the one that best suits your needs. This is similar to the concept of 'query variant selection' discussed in the study. By generating multiple 'book lists' (i.e., query variants), the system can find the answer that best meets your needs without significantly increasing computational costs. This method is like the librarian filtering out the most relevant book list before recommending it to you, ensuring you get the recommendation that best fits your needs.

ELI14 Explained like you're 14

Imagine you're playing a game where you need to find the best path to a treasure. You could try every possible path, but that would take a lot of time and effort. So, you decide to use a smart method: first, you generate multiple possible paths, then choose the one most likely to lead to the treasure. This is similar to the concept of 'query variant selection' discussed in the study. By generating multiple 'paths' (i.e., query variants), the system can find the answer that best meets your needs without significantly increasing computational costs. This method is like choosing the best path in a game to ensure you find the treasure as quickly as possible.

Glossary

Query Performance Prediction

QPP is a method for estimating a query's performance in retrieval tasks, typically without relying on relevance judgments.

In this paper, QPP is used to select the best query variant to enhance generation quality.

Retrieval-Augmented Generation

RAG is an information retrieval architecture that combines retrieval and generation models to provide higher-quality answers.

The paper explores how to select the best query variant in RAG pipelines.

Large Language Model

LLM is a deep learning-based model capable of generating natural language text, widely used in natural language processing tasks.

The paper uses LLMs to generate query variants to improve retrieval and generation quality.

nDCG (Normalized Discounted Cumulative Gain)

nDCG is a ranking metric used to evaluate the performance of information retrieval systems, considering both relevance and position of results.

The paper uses nDCG to evaluate the retrieval effectiveness of query variants.

TREC-RAG Dataset

TREC-RAG is a dataset specifically designed to evaluate retrieval-augmented generation systems, containing carefully judged queries and passages.

The paper conducts experiments on the TREC-RAG dataset to validate the method's effectiveness.

Sparse Retriever

A sparse retriever is a retrieval method based on sparse vector representations, typically implemented using inverted indexes.

The paper evaluates query variant selection under both sparse and dense retrievers.

Dense Retriever

A dense retriever is a retrieval method based on dense vector representations, typically implemented using neural networks.

The paper evaluates query variant selection under both sparse and dense retrievers.

Utility Gap

The utility gap refers to the difference between retrieval relevance and generation fidelity, where high-ranking documents do not necessarily improve generation quality.

The paper reveals a utility gap between retrieval and generation objectives.

Pre-retrieval Predictor

A pre-retrieval predictor estimates query effectiveness before retrieval, typically based on statistical features of the query.

The paper finds that pre-retrieval predictors often match or outperform post-retrieval methods.

Post-retrieval Predictor

A post-retrieval predictor estimates query effectiveness after retrieval, typically based on statistical features of the retrieval results.

The paper compares the performance of pre-retrieval and post-retrieval predictors.

Open Questions Unanswered questions from this research

1 How can QPP's effectiveness in RAG pipelines be validated on more diverse datasets? The current study focuses primarily on the TREC-RAG dataset, and its performance on other datasets remains unclear.
2 How can the utility gap between retrieval relevance and generation fidelity be further bridged? The current study reveals this gap but does not provide specific solutions.
3 How can multiple QPP methods be combined to enhance variant selection accuracy? The current study primarily focuses on evaluating single methods and has not explored the potential of combining multiple methods.
4 How do pre-retrieval predictors perform on complex queries? The current study suggests that pre-retrieval methods may not be as effective as post-retrieval methods for certain complex queries.
5 How can generation quality be further improved without increasing computational complexity? The current study demonstrates the potential of pre-retrieval predictors, but there is still room for improvement.

Applications

Immediate Applications

Search Engine Optimization

By using QPP to select the best query variant, search engines can improve retrieval and generation quality, reducing user wait times.

Intelligent Customer Service Systems

Apply QPP in intelligent customer service systems to select the query variant that best answers user questions, improving user satisfaction.

Online Education Platforms

Use QPP in online education platforms to select the most relevant learning resources, enhancing learning outcomes.

Long-term Vision

Personalized Information Retrieval

Achieve personalized information retrieval through QPP, providing answers that better meet user needs and enhance user experience.

Automated Content Generation

Use QPP to improve the quality of automated content generation, providing more efficient tools for content creation.

Abstract

Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

cs.IR cs.CL

References (20)

GENeration

David St. John

2015 832 citations

Information Needs, Queries, and Query Performance Prediction

Oleg Zendel, Anna Shtok, Fiana Raiber et al.

2019 52 citations

Query Performance Prediction: Techniques and Applications in Modern Information Retrieval

Negar Arabzadeh, Chuan Meng, Mohammad Aliannejadi et al.

2024 9 citations

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang et al.

2023 1743 citations View Analysis →

Inferring Query Performance Using Pre-retrieval Predictors

Ben He, I. Ounis

2004 292 citations

GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation

Kaustubh D. Dhole, Eugene Agichtein

2024 30 citations View Analysis →

Breaking Flat: A Generalised Query Performance Prediction Evaluation Framework

Payel Santra, Partha Basuchowdhuri, Debasis Ganguly

2026 1 citations View Analysis →

Is Relevance Propagated from Retriever to Generator in RAG?

Fangzheng Tian, Debasis Ganguly, Craig Macdonald

2025 16 citations View Analysis →

Performance Prediction for Non-Factoid Question Answering

Helia Hashemi, Hamed Zamani, W. Bruce Croft

2019 53 citations

An Analysis of Variations in the Effectiveness of Query Performance Prediction

Debasis Ganguly, S. Datta, Mandar Mitra et al.

2022 16 citations View Analysis →

Query Performance Prediction Using Neural Query Space Proximity

Amin Bigdeli, Sajad Ebrahimi, Negar Arabzadeh et al.

2025 1 citations

METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation

Siddhant Ray, Rui Pan, Zhuohan Gu et al.

2024 10 citations View Analysis →

Noisy Perturbations for Estimating Query Difficulty in Dense Retrievers

Negar Arabzadeh, Radin Hamidi Rad, Maryam Khodabakhsh et al.

2023 25 citations

Query Performance Prediction Through Retrieval Coherency

Negar Arabzadeh, Amin Bigdeli, Morteza Zihayat et al.

2021 23 citations

Predicting Query Performance by Query-Drift Estimation

Anna Shtok, O. Kurland, David Carmel

2009 272 citations

Uncovering the Limitations of Query Performance Prediction: Failures, Insights, and Implications for Selective Query Processing

Adrian-Gabriel Chifu, S'ebastien D'ejean, Moncef Garouani et al.

2025 1 citations View Analysis →

Unsupervised Question Clarity Prediction through Retrieved Item Coherency

Negar Arabzadeh, M. Seifikar, C. Clarke

2022 27 citations View Analysis →

Generative Query Reformulation for Effective Adhoc Search

Xiao Wang, Sean MacAvaney, Craig Macdonald et al.

2023 35 citations View Analysis →

Standard Deviation as a Query Hardness Estimator

Joaquín Pérez-Iglesias, Lourdes Araujo

2010 64 citations

Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam et al.

2024 46 citations View Analysis →

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Query Performance Prediction

Retrieval-Augmented Generation

Large Language Model

nDCG (Normalized Discounted Cumulative Gain)

TREC-RAG Dataset

Sparse Retriever

Dense Retriever

Utility Gap

Pre-retrieval Predictor

Post-retrieval Predictor

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Search Engine Optimization

Intelligent Customer Service Systems

Online Education Platforms

Long-term Vision

Personalized Information Retrieval

Automated Content Generation

Abstract

References (20)

Related Papers

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

ECLASS-Augmented Semantic Product Search for Electronic Components

Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference