MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
MARCH framework significantly reduces LLM hallucination using multi-agent reinforced self-check, enhancing factual consistency in an 8B parameter model.
Key Findings
Methodology
The MARCH framework employs multi-agent reinforcement learning to achieve self-checking of hallucinations in large language models. It consists of three specialized agents: the Solver generates initial responses, the Proposer decomposes these into verifiable atomic propositions, and the Checker validates these propositions without referencing the Solver's output. This information asymmetry design breaks the cycle of confirmation bias. Multi-agent reinforcement learning allows the agents to co-evolve and optimize factual adherence.
Key Results
- Result 1: MARCH significantly reduces hallucination rates across hallucination benchmarks. An 8B parameter LLM equipped with MARCH achieves performance comparable to powerful closed-source models, demonstrating its effectiveness in RAG tasks.
- Result 2: On the RAGTruth and FaithBench benchmarks, MARCH-STEM and MARCH-General increase average accuracy to 74.93% and 75.23%, respectively, a significant improvement from the base model's 55.20%.
- Result 3: In the Facts Grounding benchmark, MARCH-STEM and MARCH-General achieve factuality scores of 85.23% and 80.12%, significantly higher than the base model's 57.09%.
Significance
The MARCH framework significantly enhances factual consistency in large language models, particularly in data-intensive tasks, by breaking the cycle of confirmation bias. This method provides a scalable path for self-improvement of LLMs, crucial for improving reliability in high-stakes domains such as finance, law, and healthcare. MARCH's success demonstrates the potential of multi-agent collaboration in complex tasks, boosting the credibility of LLMs in real-world applications.
Technical Contribution
MARCH achieves self-checking of LLM hallucinations through multi-agent reinforcement learning, overcoming the limitations of traditional methods. Its innovation lies in introducing an information asymmetry-based collaborative mechanism that breaks confirmation bias. The framework operates without additional human annotations or external fact-checking tools, showcasing the potential of multi-agent collaboration in complex tasks and providing new engineering possibilities for LLM self-improvement.
Novelty
MARCH is the first to achieve self-checking of LLM hallucinations through multi-agent reinforcement learning, breaking the cycle of confirmation bias present in traditional methods. Compared to existing supervised fine-tuning and RLHF methods, MARCH offers a more granular factual verification mechanism through its information asymmetry design.
Limitations
- Limitation 1: MARCH's performance depends on the quality and diversity of training data. In high-noise and highly heterogeneous documents, agents may struggle to effectively perform factual verification.
- Limitation 2: Although MARCH performs well across multiple benchmarks, its generalization ability in specific domains still needs further validation.
- Limitation 3: MARCH's computational cost is high, especially when training on large-scale datasets, potentially requiring substantial computational resources.
Future Work
Future work can focus on optimizing MARCH's computational efficiency and extending its generalization capabilities across different domains. Further research could explore applying MARCH on larger-scale datasets and evaluating its performance in other complex tasks. Additionally, integrating other advanced reinforcement learning techniques may further enhance MARCH's factual consistency.
AI Executive Summary
Hallucination remains a critical bottleneck for large language models (LLMs), particularly in Retrieval-Augmented Generation (RAG) systems. Existing hallucination detection methods often employ LLM-as-a-judge to verify outputs, but this approach suffers from inherent confirmation bias, leading verifiers to inadvertently reproduce the original generation's errors.
To address this issue, the paper introduces the Multi-Agent Reinforced Self-Check (MARCH) framework, which enforces rigorous factual alignment by leveraging deliberate information asymmetry. The MARCH framework orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, the Proposer decomposes it into claim-level verifiable atomic propositions, and the Checker validates these propositions in isolation, without access to the Solver's original output.
This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning, the agents can co-evolve and optimize factual adherence. Extensive experiments demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models.
MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The framework operates without additional human annotations or external fact-checking tools, showcasing the potential of multi-agent collaboration in complex tasks.
However, MARCH's performance depends on the quality and diversity of training data. In high-noise and highly heterogeneous documents, agents may struggle to effectively perform factual verification. Additionally, although MARCH performs well across multiple benchmarks, its generalization ability in specific domains still needs further validation. Future research can focus on optimizing MARCH's computational efficiency and extending its generalization capabilities across different domains.
Deep Analysis
Background
Large language models (LLMs) have made significant strides in natural language processing, particularly in generation and comprehension tasks. However, LLMs often exhibit hallucinations, generating content that contradicts factual information. This phenomenon is especially prevalent in Retrieval-Augmented Generation (RAG) systems, which rely on retrieving information from external documents to generate responses. Existing hallucination detection methods typically use LLMs as judges to verify outputs, but these methods suffer from inherent confirmation bias, leading verifiers to inadvertently reproduce the original generation's errors. To enhance LLMs' factual consistency, researchers have explored various methods, including supervised fine-tuning and reinforcement learning from human feedback (RLHF). However, these methods still fall short in providing fine-grained factual consistency.
Core Problem
The hallucination problem in LLMs is a major challenge for their practical application. Hallucinations not only undermine the model's credibility but can also have serious consequences in high-stakes domains such as finance, law, and healthcare. Existing hallucination detection methods often rely on LLMs for verification, but this approach is prone to confirmation bias, leading to inaccurate verification results. Additionally, traditional reinforcement learning methods lack the granularity needed to supervise fine-grained factual consistency, making it difficult to meet the complex demands of RAG tasks.
Innovation
The MARCH framework achieves self-checking of LLM hallucinations through multi-agent reinforcement learning, overcoming the limitations of traditional methods. Its innovations include:
- �� Introducing an information asymmetry-based collaborative mechanism that breaks confirmation bias.
- �� Designing three specialized agents: the Solver generates initial responses, the Proposer decomposes these into verifiable atomic propositions, and the Checker validates these propositions without referencing the Solver's output.
- �� Enabling agents to co-evolve and optimize factual adherence through multi-agent reinforcement learning.
These innovations allow MARCH to significantly enhance LLMs' factual consistency without relying on additional human annotations or external tools.
Methodology
The MARCH framework achieves self-checking of LLM hallucinations through the following steps:
- �� Solver generates initial RAG response: Generates an initial response based on the input query and retrieved documents.
- �� Proposer decomposes response into atomic propositions: Breaks down the generated response into a series of verifiable atomic propositions for subsequent validation.
- �� Checker validates atomic propositions: Validates these atomic propositions based on retrieved documents without referencing the Solver's output.
- �� Multi-agent reinforcement learning: Trains this pipeline with multi-agent reinforcement learning, allowing agents to co-evolve and optimize factual adherence.
This method breaks the cycle of self-confirmation bias through its information asymmetry design, enhancing LLMs' factual consistency.
Experiments
The experimental design includes evaluating MARCH's performance across multiple hallucination benchmarks. Datasets used include BioASQ, 2WikiMultiHopQA, and MuSiQue, covering different domains and task types. The Meta-Llama3.1-8B-Instruct is used as the initialization policy, and multi-agent reinforcement learning is employed for training. Key hyperparameters include learning rates, batch size, and training epochs. Ablation studies are conducted to verify the contribution of each component.
Results
Experimental results show that MARCH significantly reduces hallucination rates across multiple benchmarks. Specifically, on the RAGTruth and FaithBench benchmarks, MARCH-STEM and MARCH-General increase average accuracy to 74.93% and 75.23%, respectively, a significant improvement from the base model's 55.20%. In the Facts Grounding benchmark, MARCH-STEM and MARCH-General achieve factuality scores of 85.23% and 80.12%, significantly higher than the base model's 57.09%. These results demonstrate MARCH's effectiveness in enhancing LLMs' factual consistency.
Applications
The MARCH framework has broad application potential across multiple domains. Direct application scenarios include:
- �� Financial analysis: Enhancing the accuracy of financial reports and analyses, reducing risks associated with misinformation, and aiding financial professionals in making more accurate decisions.
- �� Legal document drafting: Assisting legal professionals in case analysis and legal document drafting, ensuring information accuracy and improving the quality of legal services.
- �� Medical literature analysis: Supporting the retrieval and analysis of medical literature, providing accurate medical advice, and aiding healthcare professionals in making better diagnostic and treatment decisions.
These application scenarios require high-quality data and robust computational capabilities to fully leverage MARCH's potential.
Limitations & Outlook
Although MARCH performs well across multiple benchmarks, its generalization ability in specific domains still needs further validation. Additionally, MARCH's computational cost is high, especially when training on large-scale datasets, potentially requiring substantial computational resources. Future research can focus on optimizing MARCH's computational efficiency and extending its generalization capabilities across different domains.
Plain Language Accessible to non-experts
Imagine you're in a kitchen preparing a meal. You have a helper (Solver) who is responsible for preparing all the ingredients according to the recipe. Then, you have an inspector (Proposer) who breaks down each dish into specific steps, like chopping vegetables or frying. Finally, you have a taster (Checker) who independently tastes each dish without looking at the recipe to ensure they match the intended flavors.
In this process, the helper might make mistakes, such as using the wrong ingredients or following incorrect steps. The taster's job is to independently verify the taste of each dish, ensuring quality rather than simply trusting the helper's judgment.
The MARCH framework is like this kitchen team, using multi-agent collaboration to ensure that the content generated by large language models aligns with factual information. Each agent has its role and task, and by leveraging information asymmetry, MARCH breaks the cycle of confirmation bias present in traditional methods, significantly improving the model's accuracy and reliability.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super complex game that requires you and your friends to work together. You're the main character (Solver) in the game, responsible for making decisions based on the mission guide. Then, one of your friends (Proposer) breaks your mission into smaller tasks, like fighting monsters or collecting items. Finally, there's another friend (Checker) who independently checks if each small task is done well without watching what you did.
This game has a big challenge: sometimes you might make mistakes, like fighting the wrong monster or collecting the wrong items. The Checker’s job is to ensure each small task is done correctly, rather than just trusting what you did.
This is like the MARCH framework, using multi-agent collaboration to ensure that the content generated by large language models aligns with factual information. Each agent has its role and task, and by leveraging information asymmetry, MARCH breaks the cycle of confirmation bias present in traditional methods, significantly improving the model's accuracy and reliability. Isn't that cool?
Glossary
Multi-Agent Reinforcement Learning
A machine learning approach where multiple agents learn to solve complex tasks through collaboration and competition.
MARCH uses multi-agent reinforcement learning to train agents for factual verification.
Hallucination
In natural language processing, the phenomenon where a model generates content that contradicts factual information.
MARCH aims to reduce hallucinations in large language models.
Information Asymmetry
A situation in information processing where different participants have varying amounts of information.
MARCH leverages information asymmetry to break confirmation bias.
Retrieval-Augmented Generation (RAG)
A generation method that enhances content accuracy by retrieving information from external documents.
MARCH is applied in RAG systems to improve factual consistency.
Confirmation Bias
A cognitive bias where individuals tend to validate existing beliefs while ignoring contradictory evidence.
MARCH breaks the cycle of confirmation bias through its design.
Supervised Fine-Tuning (SFT)
Fine-tuning a model using labeled data to improve its performance on specific tasks.
Traditional SFT methods have limitations in fine-grained factual consistency.
Reinforcement Learning from Human Feedback (RLHF)
A reinforcement learning approach that optimizes model decisions based on human feedback.
RLHF is typically used to enhance factual consistency.
Atomic Proposition
A minimal unit of fact that can be independently verified.
The Proposer decomposes responses into verifiable atomic propositions.
Ablation Study
An experimental approach to assess the impact of removing or replacing certain parts of a model on its overall performance.
Ablation studies were conducted to verify the contribution of each component.
Factual Consistency
The alignment of generated content with real-world facts.
MARCH aims to enhance the factual consistency of large language models.
Open Questions Unanswered questions from this research
- 1 Open Question 1: How can effective factual verification be performed in high-noise and highly heterogeneous documents? Current methods perform poorly in these complex scenarios, requiring more robust algorithms to improve verification accuracy.
- 2 Open Question 2: What is MARCH's generalization ability in specific domains? Although it performs well across multiple benchmarks, its adaptability to specific domains still needs further validation.
- 3 Open Question 3: How can MARCH's computational cost be reduced? The current computational cost is high, especially when training on large-scale datasets, requiring substantial computational resources.
- 4 Open Question 4: How can MARCH's computational efficiency be further optimized? Future research can focus on optimizing the algorithm's computational efficiency for application on larger-scale datasets.
- 5 Open Question 5: How can other advanced reinforcement learning techniques be integrated to enhance MARCH's factual consistency? Exploring new technology combinations may further improve model performance.
- 6 Open Question 6: How can MARCH's performance be further improved without additional human annotations? Exploring new data augmentation and self-supervised learning methods is needed.
- 7 Open Question 7: How can MARCH's application be extended across different domains and tasks? Its performance and adaptability need to be verified in other complex tasks.
Applications
Immediate Applications
Financial Analysis
Enhancing the accuracy of financial reports and analyses, reducing risks associated with misinformation, and aiding financial professionals in making more accurate decisions.
Legal Document Drafting
Assisting legal professionals in case analysis and legal document drafting, ensuring information accuracy and improving the quality of legal services.
Medical Literature Analysis
Supporting the retrieval and analysis of medical literature, providing accurate medical advice, and aiding healthcare professionals in making better diagnostic and treatment decisions.
Long-term Vision
Intelligent Assistants
Developing more intelligent personal assistants capable of providing accurate information and advice across multiple domains, enhancing user experience and satisfaction.
Automated Decision Systems
Applying automated decision systems across various industries to improve efficiency and accuracy, reducing human errors and biases.
Abstract
Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at https://github.com/Qwen-Applications/MARCH.
References (20)
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du, Shuang Li, A. Torralba et al.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.
MetaGPT: Meta Programming for Multi-Agent Collaborative Framework
Sirui Hong, Xiawu Zheng, Jonathan P. Chen et al.
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal et al.
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.
♫ MuSiQue: Multihop Questions via Single-hop Question Composition
H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.
LEGO: A Multi-agent Collaborative Framework with Role-playing and Iterative Feedback for Causality Explanation Generation
Zhitao He, Pengfei Cao, Yubo Chen et al.
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
Austin Xu, Srijan Bansal, Yifei Ming et al.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
G. Li, Hasan Hammoud, Hani Itani et al.
Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration
Ran Xu, Wenqi Shi, Yuchen Zhuang et al.
RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS
R. Bradley, M. E. Terry
Qwen2.5 Technical Report
Qwen An Yang, Baosong Yang, Beichen Zhang et al.
Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy
Zhihong Shao, Yeyun Gong, Yelong Shen et al.
Retrieval Augmentation Reduces Hallucination in Conversation
Kurt Shuster, Spencer Poff, Moya Chen et al.
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau et al.
MACRec: A Multi-Agent Collaboration Framework for Recommendation
Zhefan Wang, Yuanqing Yu, Wen-Xun Zheng et al.
GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking
Darshan Deshpande, Selvan Sunitha Ravi, Sky CH-Wang et al.
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie C. Lin, Jacob Hilton, Owain Evans
Gemini
M. Yates