A Systematic Study of Pseudo-Relevance Feedback with LLMs

TL;DR

Study shows LLM-generated pseudo-relevance feedback significantly improves query performance, especially in low-resource tasks.

cs.IR 🔴 Advanced 2026-03-12 12 views
Nour Jedidi Jimmy Lin
pseudo-relevance feedback large language models information retrieval low-resource tasks experimental study

Key Findings

Methodology

This paper systematically investigates the application of pseudo-relevance feedback (PRF) in large language models (LLMs), focusing on how feedback source and feedback model affect PRF effectiveness. The study uses 13 low-resource BEIR tasks and five LLM PRF methods, controlling experimental variables to ensure reliability. By comparing different feedback sources (e.g., corpus documents, LLM-generated text, and their combinations) and feedback models (e.g., Rocchio algorithm, RM3 model), the authors reveal the impact of different design choices on PRF effectiveness.

Key Results

  • Result 1: The choice of feedback model is crucial for PRF effectiveness, especially when using LLM-generated text. The Rocchio algorithm outperforms RM3 on BM25, with an improvement of about 1 percentage point.
  • Result 2: Feedback derived solely from LLM-generated text provides the most cost-effective solution, particularly in low-resource tasks, with the HyDE method improving by 4.2% on the Contriever model.
  • Result 3: Feedback derived from the corpus is most beneficial when using a strong initial retriever, especially when combining feedback sources, significantly enhancing performance.

Significance

This study provides a systematic exploration of key factors in PRF design, offering important guidance for future PRF method development. It demonstrates that LLM-generated feedback text has significant advantages in low-resource environments, improving information retrieval effectiveness without substantial computational costs. This finding is significant for both academia and industry, especially in resource-constrained applications.

Technical Contribution

Technical contributions include: the first systematic analysis of the independent roles of feedback source and feedback model in LLM PRF; proposing best practices under different retrievers and feedback models; validating the advantages of LLM-generated text in low-resource tasks, offering new engineering possibilities.

Novelty

This study is the first to systematically separate the effects of feedback source and feedback model, revealing the unique advantages of LLM-generated text in PRF. Compared to previous studies, this research is more comprehensive in methodology, controlling confounding factors in experiments.

Limitations

  • Limitation 1: The study focuses mainly on low-resource tasks and has not been validated in high-resource environments, which may limit the generalizability of the results.
  • Limitation 2: The types of LLM models and feedback models used in the experiments are limited; future research could explore more model combinations.
  • Limitation 3: The study does not explore the performance differences of different feedback models in complex query scenarios.

Future Work

Future research could explore PRF effectiveness in high-resource environments to further validate the advantages of LLM-generated text. Additionally, the performance of different feedback models in complex query scenarios could be investigated, and new combinations of feedback models and LLMs could be explored.

AI Executive Summary

Pseudo-relevance feedback (PRF) is a technique that improves query representation by leveraging initial retrieval results. Traditionally, PRF relies on relevance signals extracted from top-ranked documents. However, with the advent of large language models (LLMs), researchers are exploring how text generated by these models can enhance PRF effectiveness.

This study focuses on two key design dimensions: the feedback source and the feedback model. The feedback source refers to where the text used to improve the query comes from, while the feedback model describes how this text is used to update the query representation. Through systematic experiments across 13 low-resource BEIR tasks and five LLM PRF methods, the authors reveal the impact of different design choices on PRF effectiveness.

The results indicate that the choice of feedback model is crucial for PRF effectiveness, especially when using LLM-generated text. The Rocchio algorithm outperforms RM3 on BM25, with an improvement of about 1 percentage point. Additionally, feedback derived solely from LLM-generated text provides the most cost-effective solution, particularly in low-resource tasks, with the HyDE method improving by 4.2% on the Contriever model.

Feedback derived from the corpus is most beneficial when using a strong initial retriever, especially when combining feedback sources, significantly enhancing performance. This finding is significant for both academia and industry, especially in resource-constrained applications.

Despite these promising results, the study has some limitations. It focuses mainly on low-resource tasks and has not been validated in high-resource environments, which may limit the generalizability of the results. Additionally, the types of LLM models and feedback models used in the experiments are limited. Future research could explore PRF effectiveness in high-resource environments to further validate the advantages of LLM-generated text. Additionally, the performance of different feedback models in complex query scenarios could be investigated, and new combinations of feedback models and LLMs could be explored.

Deep Analysis

Background

Pseudo-relevance feedback (PRF) techniques have been widely used in the field of information retrieval. Traditional PRF methods typically rely on relevance signals extracted from top-ranked documents to improve query representation. However, with the development of large language models (LLMs), researchers are exploring how text generated by these models can enhance PRF effectiveness. In recent years, LLMs have shown outstanding performance in natural language processing tasks, particularly in text generation and context understanding. Therefore, using LLM-generated text as a feedback source has become a new research direction. Although some studies have explored the application of LLMs in PRF, the independent roles of feedback source and feedback model in PRF effectiveness have not been systematically studied.

Core Problem

The core problem is how to effectively utilize LLM-generated text to improve PRF effectiveness. Specifically, the independent roles of feedback source and feedback model in PRF are unclear, as both are often entangled in empirical evaluations. Additionally, existing studies often fail to control other variables such as the number of feedback terms and feedback documents, complicating the interpretation of results. Therefore, this paper aims to reveal the independent impact of feedback source and feedback model on PRF effectiveness through systematic experiments.

Innovation

The core innovation of this paper lies in the first systematic analysis of the independent effects of feedback source and feedback model. Specifically, the authors study the impact of different feedback sources (e.g., corpus documents, LLM-generated text, and their combinations) and feedback models (e.g., Rocchio algorithm, RM3 model) on PRF effectiveness by controlling experimental variables. Additionally, the paper explores whether combining different feedback sources can outperform using a single source. Compared to previous studies, this paper is more comprehensive in methodology, controlling confounding factors in experiments to provide more reliable results.

Methodology

The research methodology includes the following steps:


  • �� Select 13 low-resource BEIR tasks as experimental datasets to ensure the broad applicability of the research results.

  • �� Implement five LLM PRF methods, representing three primary feedback sources: corpus only, LLM only, and corpus & LLM.

  • �� Control confounding factors in experiments, such as the number of feedback terms and feedback documents, to ensure the reliability of results.

  • �� Evaluate the effectiveness of each method on three different retrievers: BM25, Contriever, and Contriever MS-MARCO.

  • �� Conduct experiments using open-source toolkits such as Anserini and Pyserini to support reproducibility.

Experiments

The experimental design includes selecting 13 low-resource BEIR datasets for evaluation, covering various tasks such as news retrieval, financial question answering, entity retrieval, and biomedical information retrieval. The experiments use three retrievers: BM25, Contriever, and Contriever MS-MARCO, and compare the effects of different feedback models (e.g., Rocchio algorithm, RM3 model). To ensure the reliability of results, the experiments control the number of feedback terms and feedback documents. Additionally, ablation studies are conducted to explore the independent roles of different feedback sources and feedback models.

Results

The experimental results indicate that the choice of feedback model is crucial for PRF effectiveness, especially when using LLM-generated text. The Rocchio algorithm outperforms RM3 on BM25, with an improvement of about 1 percentage point. Additionally, feedback derived solely from LLM-generated text provides the most cost-effective solution, particularly in low-resource tasks, with the HyDE method improving by 4.2% on the Contriever model. Feedback derived from the corpus is most beneficial when using a strong initial retriever, especially when combining feedback sources, significantly enhancing performance. This finding is significant for both academia and industry, especially in resource-constrained applications.

Applications

The research results have important implications for multiple application scenarios. First, in low-resource environments, LLM-generated text feedback can significantly improve information retrieval effectiveness, suitable for resource-constrained applications. Second, combining different feedback sources can enhance retrieval effectiveness without substantial computational costs, suitable for scenarios requiring efficient retrieval. Additionally, the proposed methods can be used to improve existing PRF techniques, providing new solutions for academia and industry.

Limitations & Outlook

Despite the significant findings, the study has some limitations. First, it focuses mainly on low-resource tasks and has not been validated in high-resource environments, which may limit the generalizability of the results. Second, the types of LLM models and feedback models used in the experiments are limited; future research could explore more model combinations. Additionally, the study does not explore the performance differences of different feedback models in complex query scenarios. Future research could explore PRF effectiveness in high-resource environments to further validate the advantages of LLM-generated text.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditionally, you'd rely on a recipe to select ingredients and spices, much like traditional pseudo-relevance feedback methods that depend on extracting information from top-ranked documents to improve query representation. But now, you have an experienced chef assistant (large language model) who can quickly generate new recipes and suggestions based on your needs. This is akin to the LLM-generated text feedback used in this study, which can provide more effective query improvements in low-resource environments. By combining information from different sources, like using both the recipe and the chef assistant's suggestions, you can create a more delicious dish without significantly increasing costs. This approach not only improves efficiency but also offers new directions for future research.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a game where you need to find hidden treasures. The old way is to follow clues on a map, just like old-school search engines that rely on information from top-ranked documents. But now, you have a super helper (large language model) that can quickly generate new clues and suggestions based on your needs. This is like the LLM-generated text feedback used in this study, which can provide more effective query improvements in low-resource environments. By combining information from different sources, like using both the map and the super helper's suggestions, you can find the treasure faster. Isn't that cool? This approach not only improves efficiency but also offers new directions for future research.

Glossary

Pseudo-Relevance Feedback

A method that improves query representation by leveraging initial retrieval results, typically by extracting relevance signals from top-ranked documents.

This paper studies the application of pseudo-relevance feedback in large language models.

Large Language Model

A deep learning-based model capable of generating and understanding natural language text.

This paper uses text generated by large language models as a feedback source.

Rocchio Algorithm

A classic feedback model used to update query representation based on feedback documents.

The Rocchio algorithm is used on the BM25 retriever in this study.

RM3 Model

A feedback model that updates query representation by combining terms from the original query and feedback documents.

The study compares the RM3 model and Rocchio algorithm in experiments.

BEIR Dataset

A benchmark dataset for evaluating information retrieval systems, covering various low-resource tasks.

The study conducts experiments on 13 BEIR datasets.

Contriever Model

A model used for information retrieval, capable of efficient retrieval in large-scale corpora.

The Contriever model is used in the experiments.

HyDE Method

A method that uses large language models to generate hypothetical answer documents for improving query representation.

The HyDE method is used to generate LLM text feedback in this study.

Anserini Toolkit

An open-source toolkit for information retrieval research, supporting the implementation of various retrieval algorithms.

The experiments in this study are based on the Anserini toolkit.

Pyserini Toolkit

An open-source toolkit for information retrieval research, supporting seamless integration with Anserini.

The experiments in this study are based on the Pyserini toolkit.

Qwen3-14B Model

A large language model used as the backbone model in the experiments.

The Qwen3-14B model serves as the backbone model in the study's experiments.

Open Questions Unanswered questions from this research

  • 1 Open Question 1: In high-resource environments, do LLM-generated text feedbacks still hold significant advantages? Current research mainly focuses on low-resource tasks, and future studies need to validate this on larger datasets.
  • 2 Open Question 2: How do different feedback models perform in complex query scenarios? Existing research mainly focuses on simple queries, and future studies need to explore feedback effectiveness in complex queries.
  • 3 Open Question 3: How can more types of LLM models and feedback models be effectively combined? The models used in current research are limited, and future studies could explore more combinations.
  • 4 Open Question 4: How can the computational cost of LLM-generated text be optimized in real-time applications? Current research mainly focuses on effectiveness improvement, and future studies need to consider computational efficiency.
  • 5 Open Question 5: In multilingual environments, do LLM-generated text feedbacks have universal applicability? Current research mainly focuses on a single language, and future studies need to explore multilingual applications.
  • 6 Open Question 6: How can the effectiveness of LLM-generated text be further improved without significantly increasing computational costs? Current research has made progress in cost-effectiveness, but there is still room for improvement.
  • 7 Open Question 7: In different domain application scenarios, do LLM-generated text feedbacks have consistent effectiveness? Current research mainly focuses on information retrieval, and future studies need to explore applications in other domains.

Applications

Immediate Applications

Information Retrieval in Low-Resource Environments

LLM-generated text feedback can significantly improve information retrieval effectiveness in low-resource environments, suitable for resource-constrained applications.

Efficient Retrieval Systems

Combining different feedback sources can enhance retrieval effectiveness without substantial computational costs, suitable for scenarios requiring efficient retrieval.

PRF Improvement in Academic Research

The proposed methods can be used to improve existing PRF techniques, providing new solutions for academia.

Long-term Vision

Multilingual Information Retrieval Systems

Explore the application of LLM-generated text in multilingual environments to develop universally applicable multilingual information retrieval systems.

LLM Optimization in Real-Time Applications

Optimize the computational cost of LLM-generated text in real-time applications to develop efficient real-time information retrieval systems.

Abstract

Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.

cs.IR cs.CL

References (15)

Relevance feedback in information retrieval

J. Rocchio

1971 3480 citations

QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation

Amin Bigdeli, Radin Hamidi Rad, Mert Incesu et al.

2025 2 citations View Analysis →

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini et al.

2021 1342 citations View Analysis →

ThinkQE: Query Expansion via an Evolving Thinking Process

Yibin Lei, Tao Shen, Andrew Yates

2025 6 citations View Analysis →

Precise Zero-Shot Dense Retrieval without Relevance Labels

Luyu Gao, Xueguang Ma, Jimmy J. Lin et al.

2022 587 citations View Analysis →

Anserini: Enabling the Use of Lucene for Information Retrieval Research

Peilin Yang, Hui Fang, Jimmy J. Lin

2017 409 citations

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur, Nils Reimers, Andreas Ruckl'e et al.

2021 1498 citations View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 4910 citations View Analysis →

UMass at TREC 2004: Novelty and HARD

Nasreen Abdul Jaleel, James Allan, W. Bruce Croft et al.

2004 351 citations

GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation

Kaustubh D. Dhole, Eugene Agichtein

2024 27 citations View Analysis →

Pseudo-Relevance Feedback with Dense Retrievers in Pyserini

Hang Li

2022 5 citations

Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback

Nour Jedidi, Yung-Sung Chuang, L. Shing et al.

2024 6 citations View Analysis →

UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur et al.

2024 55 citations View Analysis →

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

Jimmy J. Lin, Xueguang Ma, Sheng-Chieh Lin et al.

2021 590 citations

Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever

Tao Shen, Guodong Long, Xiubo Geng et al.

2024 45 citations