Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

TL;DR

Study finds RAG system improvements in retrieval do not guarantee better QA performance in AI policy analysis.

cs.CL 🔴 Advanced 2026-03-26 48 views

Saahil Mathur Ryan David Rittner Vedant Ajit Thakur Daniel Stuart Schiff Tunazzina Islam

RAG AI Governance Retrieval-Augmented Generation Policy Analysis Machine Learning

Key Findings

Methodology

This study investigates the application of RAG systems in AI policy QA using the AGORA corpus. The system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). Synthetic queries and pairwise preferences are used to adapt the system to the policy domain. Experiments evaluating retrieval quality, answer relevance, and faithfulness show that domain-specific fine-tuning improves retrieval metrics but does not consistently enhance QA performance.

Key Results

Result 1: Fine-tuning the retriever improved retrieval metrics, such as Mean Reciprocal Rank (MRR) reaching 0.748, but did not significantly enhance QA performance.
Result 2: In some cases, stronger retrieval led to more confident hallucinations, especially when relevant documents were missing.
Result 3: The GPT-5.4 baseline model achieved significantly higher answer accuracy without web search compared to the RAG system.

Significance

This study highlights a critical challenge for policy-focused RAG systems: improvements to individual components do not necessarily translate into more reliable answers. This is significant for researchers and developers designing grounded QA systems over dynamic regulatory corpora. The findings provide practical insights into achieving more reliable QA in complex policy texts.

Technical Contribution

The technical contributions include: 1) proposing a RAG pipeline combining contrastive retriever fine-tuning and preference-based generator alignment for policy analysis tasks; 2) analyzing how improvements in retrieval metrics can lead to more confident hallucinations; 3) offering practical suggestions for designing QA systems over dynamic regulatory corpora.

Novelty

This study is the first to systematically analyze the application of RAG systems in AI policy QA, particularly in the context of retrieval-augmented generation. Unlike previous work, this paper emphasizes the interaction between retrieval and generation components and their impact on QA performance.

Limitations

Limitation 1: The system may generate confident but inaccurate answers when relevant documents are missing from the corpus.
Limitation 2: The generator may incorrectly reference unrelated documents when handling cross-jurisdictional policies.
Limitation 3: The collection of preference data is limited by the availability of domain experts, potentially not fully capturing the expectations of policy researchers.

Future Work

Future research could explore stronger hallucination mitigation strategies, cross-document contextual grounding, and improved handling of document status changes. The authors suggest further investigation into effectively applying RAG systems in high-stakes tasks, particularly in policy analysis.

AI Executive Summary

In the rapidly evolving field of artificial intelligence governance, governments and regulatory bodies are continuously introducing new laws, guidelines, and standards. These policy documents are often lengthy, legally dense, and distributed across multiple jurisdictions, making analysis and comparison challenging. Resources such as the AI Governance and Regulatory Archive (AGORA) provide structured collections of AI policy documents, but extracting insights from these materials still requires substantial manual effort. Automated question-answering systems could help researchers and policymakers navigate this growing body of regulation.

Large language models (LLMs) offer powerful tools for analyzing complex text, but they often struggle with legal and regulatory documents due to domain-specific terminology, conceptual ambiguity, and nested references. Moreover, when applied directly to policy corpora, LLMs may generate fluent but unsupported claims. Retrieval-augmented generation (RAG) addresses this limitation by grounding responses in retrieved documents, yet the effectiveness of RAG depends heavily on both retrieval quality and generation alignment.

Despite recent advances in retriever training and preference-based alignment, it remains unclear whether improvements to individual RAG components consistently translate into better end-to-end question answering performance, particularly in complex and high-stakes domains. Domains like AI governance corpora present particularly serious challenges, given dense legal language, sometimes ambiguous policy and technical jargon, as well as evolving and cross-referenced regulatory coverage across sectors and jurisdictions.

In this work, we investigate how domain adaptation affects RAG systems for AI policy question answering. We construct a RAG pipeline over the AGORA corpus using a ColBERT-based retriever and a generator aligned to human preferences via Direct Preference Optimization (DPO). The retriever is fine-tuned using contrastive learning with synthetically generated queries and manually labeled examples, while the generator is aligned using pairwise preference data collected from policy-focused question-answer tasks.

Our experiments evaluate retrieval performance, answer relevance, and response faithfulness. We find that while retriever fine-tuning improves retrieval metrics, it does not consistently improve end-to-end question answer performance. In some cases, stronger retrieval produces more confident hallucinations when relevant documents are absent from the corpus. These findings highlight an important challenge for policy-focused RAG systems: improvements to individual components do not necessarily translate into more reliable grounded responses. Our contributions include:

1) An empirical study of retrieval-augmented generation for question answering over AI governance documents in the AGORA corpus.

2) A domain-adapted RAG pipeline combining contrastive retriever fine-tuning and preference-based generator alignment for policy analysis tasks.

3) An analysis showing that improvements in retrieval metrics may not translate into better end-to-end question answering performance, and can increase confident hallucinations when the underlying corpus lacks coverage.

Deep Analysis

Background

The field of artificial intelligence (AI) governance is rapidly evolving, with governments and regulatory bodies continuously introducing new laws, guidelines, and standards. These policy documents are often lengthy, legally dense, and distributed across multiple jurisdictions, making analysis and comparison challenging. Resources such as the AI Governance and Regulatory Archive (AGORA) provide structured collections of AI policy documents, but extracting insights from these materials still requires substantial manual effort. Automated question-answering systems could help researchers and policymakers navigate this growing body of regulation.

Core Problem

In the field of AI governance, policy documents are often lengthy, legally dense, and distributed across multiple jurisdictions, making analysis and comparison challenging. Although resources like AGORA provide structured collections of AI policy documents, extracting insights from these materials still requires substantial manual effort. Automated question-answering systems could help researchers and policymakers navigate this growing body of regulation. However, existing QA systems often struggle with legal and regulatory documents due to domain-specific terminology, conceptual ambiguity, and nested references. Moreover, when applied directly to policy corpora, LLMs may generate fluent but unsupported claims.

Innovation

The core innovations of this paper include proposing a RAG pipeline that combines contrastive learning and preference alignment for policy analysis tasks. Specifically:

1) The retriever is fine-tuned using contrastive learning, while the generator is aligned using pairwise preference data. This method better adapts to the specific needs of the policy domain.

2) Analyzing how improvements in retrieval metrics can lead to more confident hallucinations, particularly when relevant documents are missing.

3) Providing practical suggestions for designing QA systems over dynamic regulatory corpora, helping researchers and developers better address the complexity of policy texts.

Methodology

The paper proposes a RAG pipeline that combines contrastive learning and preference alignment for policy analysis tasks. The specific steps are as follows:

�� Use a ColBERT-based retriever for retrieval. The retriever is fine-tuned using contrastive learning, while the generator is aligned using pairwise preference data.

�� Conduct experiments using the AGORA corpus to evaluate retrieval performance, answer relevance, and response faithfulness.

�� In some cases, when relevant documents are missing from the corpus, stronger retrieval leads to more confident hallucinations.

�� Provide practical suggestions for designing QA systems over dynamic regulatory corpora, helping researchers and developers better address the complexity of policy texts.

Experiments

The experimental design includes evaluating retrieval-augmented generation using the AGORA corpus. Specifically, the retriever is fine-tuned using contrastive learning, while the generator is aligned using pairwise preference data. Experiments evaluate retrieval performance, answer relevance, and response faithfulness.

Various metrics are used to evaluate retrieval performance, including Mean Reciprocal Rank (MRR), Recall@k, and MAP@k. The generator alignment is performed using Direct Preference Optimization (DPO) with pairwise preference data.

Results show that while retriever fine-tuning improves retrieval metrics, it does not consistently enhance end-to-end QA performance. In some cases, stronger retrieval leads to more confident hallucinations when relevant documents are missing.

Results

Specifically, fine-tuning the retriever improved retrieval metrics, such as Mean Reciprocal Rank (MRR) reaching 0.748, but did not significantly enhance QA performance. The GPT-5.4 baseline model achieved significantly higher answer accuracy without web search compared to the RAG system.

These results suggest that improvements to individual components do not necessarily translate into more reliable answers. This is significant for researchers and developers designing grounded QA systems over dynamic regulatory corpora.

Applications

The proposed method can be applied to various policy analysis scenarios, helping researchers and policymakers better navigate complex regulatory systems. Specifically:

1) Automated QA systems can help researchers and policymakers navigate the growing body of regulation, reducing the workload of manual analysis.

2) In the field of AI governance, the system can be used to analyze and compare policy documents across different jurisdictions, providing more comprehensive policy insights.

3) The system can also be applied to policy analysis in other fields, such as healthcare and finance, helping researchers and policymakers better understand and address complex policy texts.

Limitations & Outlook

Despite the progress made by the proposed method, there are still some limitations:

1) The system may generate confident but inaccurate answers when relevant documents are missing from the corpus.

2) The generator may incorrectly reference unrelated documents when handling cross-jurisdictional policies.

3) The collection of preference data is limited by the availability of domain experts, potentially not fully capturing the expectations of policy researchers.

Future research could explore stronger hallucination mitigation strategies, cross-document contextual grounding, and improved handling of document status changes.

Plain Language Accessible to non-experts

Imagine you're in a gigantic library with thousands of books, each covering different laws and policies. You need to find a specific book to answer a question about AI policy. At this point, you can use a super-smart librarian assistant who can quickly browse all the books and find the most relevant chapters to help you answer the question.

This assistant is our RAG system. It works by first finding the most relevant books from the library (this is the job of the retriever), and then extracting the most useful information from these books to answer your question (this is the job of the generator).

However, sometimes the library might not have the book you need, and the assistant might make some educated guesses based on the available information. This is like the assistant trying to give you the best possible answer even when there's not enough information.

Our research found that even if the assistant gets better at finding books, it doesn't necessarily give more accurate answers, especially when the library lacks relevant books. Therefore, we need to keep improving the assistant's ability to provide more reliable answers even when information is scarce.

ELI14 Explained like you're 14

Hey there, friends! Today we're going to talk about something super cool called the RAG system. Imagine you're playing a huge treasure hunt game, and there are countless treasure spots on the map. Your mission is to find the most valuable treasure!

The RAG system is like your super assistant, helping you find the closest spots to the treasure on the map. First, it uses a tool called a retriever to scan the entire map and find places that might have treasure.

Next, it uses another tool called a generator to extract the most useful information from these places to help you find the real treasure!

But sometimes, the map might not show all the treasure spots, and the assistant might make some guesses based on the clues it has. This is like when you're playing a game and guessing where the treasure might be based on the clues.

Our research found that even if the assistant gets better at finding clues, it doesn't always find all the treasures, especially when the map lacks some important clues. So, we need to keep improving the assistant's ability to help you find more treasures even when information is scarce!

Glossary

RAG (Retrieval-Augmented Generation)

RAG is a technique that combines retrieval and generation to locate and generate answers in complex texts. It enhances the generator's output by retrieving relevant documents.

In this paper, RAG is used to analyze AI policy documents.

ColBERT

ColBERT is a BERT-based retrieval model focused on efficient semantic search. It is optimized for retrieval performance through contrastive learning.

The paper uses ColBERT as the basis for the retriever.

DPO (Direct Preference Optimization)

DPO is an optimization technique used to align the generator's output with human preferences. It trains using pairwise preference data to improve output quality.

In this paper, DPO is used for generator preference alignment.

AGORA (AI Governance and Regulatory Archive)

AGORA is a structured collection of AI policy documents from multiple jurisdictions, including laws, regulations, and policy guidelines.

The paper uses the AGORA corpus for experiments.

MRR (Mean Reciprocal Rank)

MRR is a metric for evaluating retrieval system performance, representing the average of the reciprocal ranks of the first relevant document in the retrieval results.

Used in the paper to evaluate retriever performance.

Recall@k

Recall@k is a metric for evaluating retrieval system performance, representing the proportion of relevant documents found in the top k retrieval results.

Used in the paper to evaluate retriever performance.

MAP@k (Mean Average Precision)

MAP@k is a metric for evaluating retrieval system performance, representing the average precision in the top k retrieval results.

Used in the paper to evaluate retriever performance.

Hallucination

In generation models, hallucination refers to the situation where the model generates content that is inconsistent with or unsupported by the input.

Discussed in the paper regarding the issue of enhanced retrieval leading to hallucinations.

Preference Alignment

Preference alignment is a technique that optimizes the generator's output to better match human preferences and expectations.

DPO is used in the paper for preference alignment.

Contrastive Learning

Contrastive learning is a machine learning technique that improves model discrimination by comparing positive and negative samples.

Used in the paper for retriever fine-tuning.

Open Questions Unanswered questions from this research

1 Open Question 1: How can the accuracy of RAG systems be improved when relevant documents are missing? Current methods perform poorly when the corpus is incomplete, requiring stronger hallucination mitigation strategies.
2 Open Question 2: How can cross-jurisdictional policy documents be better handled? Existing systems struggle with similar terminology across jurisdictions, requiring more nuanced semantic understanding.
3 Open Question 3: How can the system remain updated and accurate in a dynamically changing policy environment? Existing systems may fail to update timely when handling new policies, requiring more efficient update mechanisms.
4 Open Question 4: How can generator preference alignment be improved to better meet the expectations of policy researchers? The collection of preference data is limited by the availability of domain experts, requiring broader expert involvement.
5 Open Question 5: How can the performance of RAG systems be improved without increasing computational costs? Existing systems perform poorly under limited computational resources, requiring more efficient algorithm design.

Applications

Immediate Applications

Policy Analysis Automation

RAG systems can help researchers and policymakers automate the analysis of complex policy documents, reducing manual workload and increasing efficiency.

Cross-Jurisdictional Policy Comparison

The system can be used to analyze and compare policy documents across different jurisdictions, providing more comprehensive policy insights.

Dynamic Regulation Monitoring

The system can be used for real-time monitoring of policy changes, helping policymakers stay informed about the latest regulatory developments.

Long-term Vision

Global Policy Coordination

RAG systems can facilitate global policy coordination and harmonization, helping governments better address cross-border policy challenges.

Intelligent Policy Recommendations

In the future, the system could evolve into an intelligent policy recommendation tool, helping governments formulate more effective policies and drive societal progress.

Abstract

Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.

cs.CL cs.AI cs.CY cs.IR cs.LG

References (20)

LoRA: Low-Rank Adaptation of Large Language Models

J. Hu, Yelong Shen, Phillip Wallis et al.

2021 17465 citations ⭐ Influential View Analysis →

A Question Answering Software for Assessing AI Policies of OECD Countries

Konstantinos Mavrogiorgos, Athanasios Kiourtis, Argyro Mavrogiorgou et al.

2023 2 citations

Artificial intelligence policy frameworks in China, the European Union and the United States: An analysis based on structure topic model

Shangrui Wang, Yuanmeng Zhang, Yi-Po Xiao et al.

2025 23 citations

What do governments plan in the field of artificial intelligence?: Analysing national AI strategies using NLP

T. Papadopoulos, Y. Charalabidis

2020 11 citations

Language (Technology) is Power: A Critical Survey of “Bias” in NLP

Su Lin Blodgett, Solon Barocas, Hal Daum'e et al.

2020 1581 citations View Analysis →

PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization

Jiayi Wu, Hengyi Cai, Lingyong Yan et al.

2024 8 citations View Analysis →

Worldwide AI ethics: A review of 200 guidelines and recommendations for AI governance

N. Corrêa, Camila Galvão, J. Santos et al.

2022 205 citations View Analysis →

Government by Algorithm: Artificial Intelligence in Federal Administrative Agencies

D. Engstrom, Daniel E. Ho, Catherine M. Sharkey et al.

2020 189 citations

U.S. Public Opinion on the Governance of Artificial Intelligence

Baobao Zhang, Allan Dafoe

2019 79 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 55753 citations View Analysis →

The use of AI in public services: results from a preliminary mapping across the EU

Gianluca Misuraca, C. V. Noordt, Anys Boukli

2020 85 citations

BLT: Can Large Language Models Handle Basic Legal Text?

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

2023 14 citations View Analysis →

JEC-QA: A Legal-Domain Question Answering Dataset

Haoxiang Zhong, Chaojun Xiao, Cunchao Tu et al.

2019 203 citations View Analysis →

Methodological Details

Drew Dimmery, Edward Kennedy

7 citations

Do RAG Systems Really Suffer From Positional Bias?

Florin Cuconasu, Simone Filice, Guy Horowitz et al.

2025 6 citations View Analysis →

Question Answering for Privacy Policies: Combining Computational and Legal Perspectives

Abhilasha Ravichander, A. Black, Shomir Wilson et al.

2019 146 citations View Analysis →

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.

2020 12298 citations View Analysis →

On Synthetic Data Strategies for Domain-Specific Generative Retrieval

Haoyang Wen, Jiang Guo, Yi Zhang et al.

2025 5 citations View Analysis →

Can GPT-3 Perform Statutory Reasoning?

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

2023 128 citations View Analysis →

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

Keshav Santhanam, O. Khattab, Jon Saad-Falcon et al.

2021 645 citations View Analysis →

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

RAG (Retrieval-Augmented Generation)

ColBERT

DPO (Direct Preference Optimization)

AGORA (AI Governance and Regulatory Archive)

MRR (Mean Reciprocal Rank)

Recall@k

MAP@k (Mean Average Precision)

Hallucination

Preference Alignment

Contrastive Learning

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Policy Analysis Automation

Cross-Jurisdictional Policy Comparison

Dynamic Regulation Monitoring

Long-term Vision

Global Policy Coordination

Intelligent Policy Recommendations

Abstract

References (20)

Related Papers

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Improving Robustness of Tabular Retrieval via Representational Stability

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

CRAFT: Clustered Regression for Adaptive Filtering of Training data

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering