Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Study finds RAG system improvements in retrieval do not guarantee better QA performance in AI policy analysis.
Key Findings
Methodology
This study investigates the application of RAG systems in AI policy QA using the AGORA corpus. The system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). Synthetic queries and pairwise preferences are used to adapt the system to the policy domain. Experiments evaluating retrieval quality, answer relevance, and faithfulness show that domain-specific fine-tuning improves retrieval metrics but does not consistently enhance QA performance.
Key Results
- Result 1: Fine-tuning the retriever improved retrieval metrics, such as Mean Reciprocal Rank (MRR) reaching 0.748, but did not significantly enhance QA performance.
- Result 2: In some cases, stronger retrieval led to more confident hallucinations, especially when relevant documents were missing.
- Result 3: The GPT-5.4 baseline model achieved significantly higher answer accuracy without web search compared to the RAG system.
Significance
This study highlights a critical challenge for policy-focused RAG systems: improvements to individual components do not necessarily translate into more reliable answers. This is significant for researchers and developers designing grounded QA systems over dynamic regulatory corpora. The findings provide practical insights into achieving more reliable QA in complex policy texts.
Technical Contribution
The technical contributions include: 1) proposing a RAG pipeline combining contrastive retriever fine-tuning and preference-based generator alignment for policy analysis tasks; 2) analyzing how improvements in retrieval metrics can lead to more confident hallucinations; 3) offering practical suggestions for designing QA systems over dynamic regulatory corpora.
Novelty
This study is the first to systematically analyze the application of RAG systems in AI policy QA, particularly in the context of retrieval-augmented generation. Unlike previous work, this paper emphasizes the interaction between retrieval and generation components and their impact on QA performance.
Limitations
- Limitation 1: The system may generate confident but inaccurate answers when relevant documents are missing from the corpus.
- Limitation 2: The generator may incorrectly reference unrelated documents when handling cross-jurisdictional policies.
- Limitation 3: The collection of preference data is limited by the availability of domain experts, potentially not fully capturing the expectations of policy researchers.
Future Work
Future research could explore stronger hallucination mitigation strategies, cross-document contextual grounding, and improved handling of document status changes. The authors suggest further investigation into effectively applying RAG systems in high-stakes tasks, particularly in policy analysis.
AI Executive Summary
In the rapidly evolving field of artificial intelligence governance, governments and regulatory bodies are continuously introducing new laws, guidelines, and standards. These policy documents are often lengthy, legally dense, and distributed across multiple jurisdictions, making analysis and comparison challenging. Resources such as the AI Governance and Regulatory Archive (AGORA) provide structured collections of AI policy documents, but extracting insights from these materials still requires substantial manual effort. Automated question-answering systems could help researchers and policymakers navigate this growing body of regulation.
Large language models (LLMs) offer powerful tools for analyzing complex text, but they often struggle with legal and regulatory documents due to domain-specific terminology, conceptual ambiguity, and nested references. Moreover, when applied directly to policy corpora, LLMs may generate fluent but unsupported claims. Retrieval-augmented generation (RAG) addresses this limitation by grounding responses in retrieved documents, yet the effectiveness of RAG depends heavily on both retrieval quality and generation alignment.
Despite recent advances in retriever training and preference-based alignment, it remains unclear whether improvements to individual RAG components consistently translate into better end-to-end question answering performance, particularly in complex and high-stakes domains. Domains like AI governance corpora present particularly serious challenges, given dense legal language, sometimes ambiguous policy and technical jargon, as well as evolving and cross-referenced regulatory coverage across sectors and jurisdictions.
In this work, we investigate how domain adaptation affects RAG systems for AI policy question answering. We construct a RAG pipeline over the AGORA corpus using a ColBERT-based retriever and a generator aligned to human preferences via Direct Preference Optimization (DPO). The retriever is fine-tuned using contrastive learning with synthetically generated queries and manually labeled examples, while the generator is aligned using pairwise preference data collected from policy-focused question-answer tasks.
Our experiments evaluate retrieval performance, answer relevance, and response faithfulness. We find that while retriever fine-tuning improves retrieval metrics, it does not consistently improve end-to-end question answer performance. In some cases, stronger retrieval produces more confident hallucinations when relevant documents are absent from the corpus. These findings highlight an important challenge for policy-focused RAG systems: improvements to individual components do not necessarily translate into more reliable grounded responses. Our contributions include:
1) An empirical study of retrieval-augmented generation for question answering over AI governance documents in the AGORA corpus.
2) A domain-adapted RAG pipeline combining contrastive retriever fine-tuning and preference-based generator alignment for policy analysis tasks.
3) An analysis showing that improvements in retrieval metrics may not translate into better end-to-end question answering performance, and can increase confident hallucinations when the underlying corpus lacks coverage.
Deep Analysis
Background
The field of artificial intelligence (AI) governance is rapidly evolving, with governments and regulatory bodies continuously introducing new laws, guidelines, and standards. These policy documents are often lengthy, legally dense, and distributed across multiple jurisdictions, making analysis and comparison challenging. Resources such as the AI Governance and Regulatory Archive (AGORA) provide structured collections of AI policy documents, but extracting insights from these materials still requires substantial manual effort. Automated question-answering systems could help researchers and policymakers navigate this growing body of regulation.
Large language models (LLMs) offer powerful tools for analyzing complex text, but they often struggle with legal and regulatory documents due to domain-specific terminology, conceptual ambiguity, and nested references. Moreover, when applied directly to policy corpora, LLMs may generate fluent but unsupported claims. Retrieval-augmented generation (RAG) addresses this limitation by grounding responses in retrieved documents, yet the effectiveness of RAG depends heavily on both retrieval quality and generation alignment.
Despite recent advances in retriever training and preference-based alignment, it remains unclear whether improvements to individual RAG components consistently translate into better end-to-end question answering performance, particularly in complex and high-stakes domains. Domains like AI governance corpora present particularly serious challenges, given dense legal language, sometimes ambiguous policy and technical jargon, as well as evolving and cross-referenced regulatory coverage across sectors and jurisdictions.
Core Problem
In the field of AI governance, policy documents are often lengthy, legally dense, and distributed across multiple jurisdictions, making analysis and comparison challenging. Although resources like AGORA provide structured collections of AI policy documents, extracting insights from these materials still requires substantial manual effort. Automated question-answering systems could help researchers and policymakers navigate this growing body of regulation. However, existing QA systems often struggle with legal and regulatory documents due to domain-specific terminology, conceptual ambiguity, and nested references. Moreover, when applied directly to policy corpora, LLMs may generate fluent but unsupported claims.
Innovation
The core innovations of this paper include proposing a RAG pipeline that combines contrastive learning and preference alignment for policy analysis tasks. Specifically:
1) The retriever is fine-tuned using contrastive learning, while the generator is aligned using pairwise preference data. This method better adapts to the specific needs of the policy domain.
2) Analyzing how improvements in retrieval metrics can lead to more confident hallucinations, particularly when relevant documents are missing.
3) Providing practical suggestions for designing QA systems over dynamic regulatory corpora, helping researchers and developers better address the complexity of policy texts.
Methodology
The paper proposes a RAG pipeline that combines contrastive learning and preference alignment for policy analysis tasks. The specific steps are as follows:
- �� Use a ColBERT-based retriever for retrieval. The retriever is fine-tuned using contrastive learning, while the generator is aligned using pairwise preference data.
- �� Conduct experiments using the AGORA corpus to evaluate retrieval performance, answer relevance, and response faithfulness.
- �� In some cases, when relevant documents are missing from the corpus, stronger retrieval leads to more confident hallucinations.
- �� Provide practical suggestions for designing QA systems over dynamic regulatory corpora, helping researchers and developers better address the complexity of policy texts.
Experiments
The experimental design includes evaluating retrieval-augmented generation using the AGORA corpus. Specifically, the retriever is fine-tuned using contrastive learning, while the generator is aligned using pairwise preference data. Experiments evaluate retrieval performance, answer relevance, and response faithfulness.
Various metrics are used to evaluate retrieval performance, including Mean Reciprocal Rank (MRR), Recall@k, and MAP@k. The generator alignment is performed using Direct Preference Optimization (DPO) with pairwise preference data.
Results show that while retriever fine-tuning improves retrieval metrics, it does not consistently enhance end-to-end QA performance. In some cases, stronger retrieval leads to more confident hallucinations when relevant documents are missing.
Results
Results show that while retriever fine-tuning improves retrieval metrics, it does not consistently enhance end-to-end QA performance. In some cases, stronger retrieval leads to more confident hallucinations when relevant documents are missing.
Specifically, fine-tuning the retriever improved retrieval metrics, such as Mean Reciprocal Rank (MRR) reaching 0.748, but did not significantly enhance QA performance. The GPT-5.4 baseline model achieved significantly higher answer accuracy without web search compared to the RAG system.
These results suggest that improvements to individual components do not necessarily translate into more reliable answers. This is significant for researchers and developers designing grounded QA systems over dynamic regulatory corpora.
Applications
The proposed method can be applied to various policy analysis scenarios, helping researchers and policymakers better navigate complex regulatory systems. Specifically:
1) Automated QA systems can help researchers and policymakers navigate the growing body of regulation, reducing the workload of manual analysis.
2) In the field of AI governance, the system can be used to analyze and compare policy documents across different jurisdictions, providing more comprehensive policy insights.
3) The system can also be applied to policy analysis in other fields, such as healthcare and finance, helping researchers and policymakers better understand and address complex policy texts.
Limitations & Outlook
Despite the progress made by the proposed method, there are still some limitations:
1) The system may generate confident but inaccurate answers when relevant documents are missing from the corpus.
2) The generator may incorrectly reference unrelated documents when handling cross-jurisdictional policies.
3) The collection of preference data is limited by the availability of domain experts, potentially not fully capturing the expectations of policy researchers.
Future research could explore stronger hallucination mitigation strategies, cross-document contextual grounding, and improved handling of document status changes.
Plain Language Accessible to non-experts
Imagine you're in a gigantic library with thousands of books, each covering different laws and policies. You need to find a specific book to answer a question about AI policy. At this point, you can use a super-smart librarian assistant who can quickly browse all the books and find the most relevant chapters to help you answer the question.
This assistant is our RAG system. It works by first finding the most relevant books from the library (this is the job of the retriever), and then extracting the most useful information from these books to answer your question (this is the job of the generator).
However, sometimes the library might not have the book you need, and the assistant might make some educated guesses based on the available information. This is like the assistant trying to give you the best possible answer even when there's not enough information.
Our research found that even if the assistant gets better at finding books, it doesn't necessarily give more accurate answers, especially when the library lacks relevant books. Therefore, we need to keep improving the assistant's ability to provide more reliable answers even when information is scarce.
ELI14 Explained like you're 14
Hey there, friends! Today we're going to talk about something super cool called the RAG system. Imagine you're playing a huge treasure hunt game, and there are countless treasure spots on the map. Your mission is to find the most valuable treasure!
The RAG system is like your super assistant, helping you find the closest spots to the treasure on the map. First, it uses a tool called a retriever to scan the entire map and find places that might have treasure.
Next, it uses another tool called a generator to extract the most useful information from these places to help you find the real treasure!
But sometimes, the map might not show all the treasure spots, and the assistant might make some guesses based on the clues it has. This is like when you're playing a game and guessing where the treasure might be based on the clues.
Our research found that even if the assistant gets better at finding clues, it doesn't always find all the treasures, especially when the map lacks some important clues. So, we need to keep improving the assistant's ability to help you find more treasures even when information is scarce!
Glossary
RAG (Retrieval-Augmented Generation)
RAG is a technique that combines retrieval and generation to locate and generate answers in complex texts. It enhances the generator's output by retrieving relevant documents.
In this paper, RAG is used to analyze AI policy documents.
ColBERT
ColBERT is a BERT-based retrieval model focused on efficient semantic search. It is optimized for retrieval performance through contrastive learning.
The paper uses ColBERT as the basis for the retriever.
DPO (Direct Preference Optimization)
DPO is an optimization technique used to align the generator's output with human preferences. It trains using pairwise preference data to improve output quality.
In this paper, DPO is used for generator preference alignment.
AGORA (AI Governance and Regulatory Archive)
AGORA is a structured collection of AI policy documents from multiple jurisdictions, including laws, regulations, and policy guidelines.
The paper uses the AGORA corpus for experiments.
MRR (Mean Reciprocal Rank)
MRR is a metric for evaluating retrieval system performance, representing the average of the reciprocal ranks of the first relevant document in the retrieval results.
Used in the paper to evaluate retriever performance.
Recall@k
Recall@k is a metric for evaluating retrieval system performance, representing the proportion of relevant documents found in the top k retrieval results.
Used in the paper to evaluate retriever performance.
MAP@k (Mean Average Precision)
MAP@k is a metric for evaluating retrieval system performance, representing the average precision in the top k retrieval results.
Used in the paper to evaluate retriever performance.
Hallucination
In generation models, hallucination refers to the situation where the model generates content that is inconsistent with or unsupported by the input.
Discussed in the paper regarding the issue of enhanced retrieval leading to hallucinations.
Preference Alignment
Preference alignment is a technique that optimizes the generator's output to better match human preferences and expectations.
DPO is used in the paper for preference alignment.
Contrastive Learning
Contrastive learning is a machine learning technique that improves model discrimination by comparing positive and negative samples.
Used in the paper for retriever fine-tuning.
Open Questions Unanswered questions from this research
- 1 Open Question 1: How can the accuracy of RAG systems be improved when relevant documents are missing? Current methods perform poorly when the corpus is incomplete, requiring stronger hallucination mitigation strategies.
- 2 Open Question 2: How can cross-jurisdictional policy documents be better handled? Existing systems struggle with similar terminology across jurisdictions, requiring more nuanced semantic understanding.
- 3 Open Question 3: How can the system remain updated and accurate in a dynamically changing policy environment? Existing systems may fail to update timely when handling new policies, requiring more efficient update mechanisms.
- 4 Open Question 4: How can generator preference alignment be improved to better meet the expectations of policy researchers? The collection of preference data is limited by the availability of domain experts, requiring broader expert involvement.
- 5 Open Question 5: How can the performance of RAG systems be improved without increasing computational costs? Existing systems perform poorly under limited computational resources, requiring more efficient algorithm design.
Applications
Immediate Applications
Policy Analysis Automation
RAG systems can help researchers and policymakers automate the analysis of complex policy documents, reducing manual workload and increasing efficiency.
Cross-Jurisdictional Policy Comparison
The system can be used to analyze and compare policy documents across different jurisdictions, providing more comprehensive policy insights.
Dynamic Regulation Monitoring
The system can be used for real-time monitoring of policy changes, helping policymakers stay informed about the latest regulatory developments.
Long-term Vision
Global Policy Coordination
RAG systems can facilitate global policy coordination and harmonization, helping governments better address cross-border policy challenges.
Intelligent Policy Recommendations
In the future, the system could evolve into an intelligent policy recommendation tool, helping governments formulate more effective policies and drive societal progress.
Abstract
Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
References (20)
LoRA: Low-Rank Adaptation of Large Language Models
J. Hu, Yelong Shen, Phillip Wallis et al.
A Question Answering Software for Assessing AI Policies of OECD Countries
Konstantinos Mavrogiorgos, Athanasios Kiourtis, Argyro Mavrogiorgou et al.
Artificial intelligence policy frameworks in China, the European Union and the United States: An analysis based on structure topic model
Shangrui Wang, Yuanmeng Zhang, Yi-Po Xiao et al.
What do governments plan in the field of artificial intelligence?: Analysing national AI strategies using NLP
T. Papadopoulos, Y. Charalabidis
Language (Technology) is Power: A Critical Survey of “Bias” in NLP
Su Lin Blodgett, Solon Barocas, Hal Daum'e et al.
PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization
Jiayi Wu, Hengyi Cai, Lingyong Yan et al.
Worldwide AI ethics: A review of 200 guidelines and recommendations for AI governance
N. Corrêa, Camila Galvão, J. Santos et al.
Government by Algorithm: Artificial Intelligence in Federal Administrative Agencies
D. Engstrom, Daniel E. Ho, Catherine M. Sharkey et al.
U.S. Public Opinion on the Governance of Artificial Intelligence
Baobao Zhang, Allan Dafoe
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder et al.
The use of AI in public services: results from a preliminary mapping across the EU
Gianluca Misuraca, C. V. Noordt, Anys Boukli
BLT: Can Large Language Models Handle Basic Legal Text?
Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme
JEC-QA: A Legal-Domain Question Answering Dataset
Haoxiang Zhong, Chaojun Xiao, Cunchao Tu et al.
Methodological Details
Drew Dimmery, Edward Kennedy
Do RAG Systems Really Suffer From Positional Bias?
Florin Cuconasu, Simone Filice, Guy Horowitz et al.
Question Answering for Privacy Policies: Combining Computational and Legal Perspectives
Abhilasha Ravichander, A. Black, Shomir Wilson et al.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.
On Synthetic Data Strategies for Domain-Specific Generative Retrieval
Haoyang Wen, Jiang Guo, Yi Zhang et al.
Can GPT-3 Perform Statutory Reasoning?
Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
Keshav Santhanam, O. Khattab, Jon Saad-Falcon et al.