MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

TL;DR

MARCH framework significantly reduces LLM hallucination using multi-agent reinforced self-check, enhancing factual consistency in an 8B parameter model.

cs.CL 🔴 Advanced 2026-03-26 224 views
Zhuo Li Yupeng Zhang Pengyu Cheng Jiajun Song Mengyu Zhou Hao Li Shujie Hu Yu Qin Erchao Zhao Xiaoxi Jiang Guanjun Jiang
multi-agent reinforcement learning hallucination detection large language model factual consistency

Key Findings

Methodology

The MARCH framework employs multi-agent reinforcement learning to achieve self-checking of hallucinations in large language models. It consists of three specialized agents: the Solver generates initial responses, the Proposer decomposes these into verifiable atomic propositions, and the Checker validates these propositions without referencing the Solver's output. This information asymmetry design breaks the cycle of confirmation bias. Multi-agent reinforcement learning allows the agents to co-evolve and optimize factual adherence.

Key Results

  • Result 1: MARCH significantly reduces hallucination rates across hallucination benchmarks. An 8B parameter LLM equipped with MARCH achieves performance comparable to powerful closed-source models, demonstrating its effectiveness in RAG tasks.
  • Result 2: On the RAGTruth and FaithBench benchmarks, MARCH-STEM and MARCH-General increase average accuracy to 74.93% and 75.23%, respectively, a significant improvement from the base model's 55.20%.
  • Result 3: In the Facts Grounding benchmark, MARCH-STEM and MARCH-General achieve factuality scores of 85.23% and 80.12%, significantly higher than the base model's 57.09%.

Significance

The MARCH framework significantly enhances factual consistency in large language models, particularly in data-intensive tasks, by breaking the cycle of confirmation bias. This method provides a scalable path for self-improvement of LLMs, crucial for improving reliability in high-stakes domains such as finance, law, and healthcare. MARCH's success demonstrates the potential of multi-agent collaboration in complex tasks, boosting the credibility of LLMs in real-world applications.

Technical Contribution

MARCH achieves self-checking of LLM hallucinations through multi-agent reinforcement learning, overcoming the limitations of traditional methods. Its innovation lies in introducing an information asymmetry-based collaborative mechanism that breaks confirmation bias. The framework operates without additional human annotations or external fact-checking tools, showcasing the potential of multi-agent collaboration in complex tasks and providing new engineering possibilities for LLM self-improvement.

Novelty

MARCH is the first to achieve self-checking of LLM hallucinations through multi-agent reinforcement learning, breaking the cycle of confirmation bias present in traditional methods. Compared to existing supervised fine-tuning and RLHF methods, MARCH offers a more granular factual verification mechanism through its information asymmetry design.

Limitations

  • Limitation 1: MARCH's performance depends on the quality and diversity of training data. In high-noise and highly heterogeneous documents, agents may struggle to effectively perform factual verification.
  • Limitation 2: Although MARCH performs well across multiple benchmarks, its generalization ability in specific domains still needs further validation.
  • Limitation 3: MARCH's computational cost is high, especially when training on large-scale datasets, potentially requiring substantial computational resources.

Future Work

Future work can focus on optimizing MARCH's computational efficiency and extending its generalization capabilities across different domains. Further research could explore applying MARCH on larger-scale datasets and evaluating its performance in other complex tasks. Additionally, integrating other advanced reinforcement learning techniques may further enhance MARCH's factual consistency.

AI Executive Summary

Hallucination remains a critical bottleneck for large language models (LLMs), particularly in Retrieval-Augmented Generation (RAG) systems. Existing hallucination detection methods often employ LLM-as-a-judge to verify outputs, but this approach suffers from inherent confirmation bias, leading verifiers to inadvertently reproduce the original generation's errors.

To address this issue, the paper introduces the Multi-Agent Reinforced Self-Check (MARCH) framework, which enforces rigorous factual alignment by leveraging deliberate information asymmetry. The MARCH framework orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, the Proposer decomposes it into claim-level verifiable atomic propositions, and the Checker validates these propositions in isolation, without access to the Solver's original output.

This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning, the agents can co-evolve and optimize factual adherence. Extensive experiments demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models.

MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The framework operates without additional human annotations or external fact-checking tools, showcasing the potential of multi-agent collaboration in complex tasks.

However, MARCH's performance depends on the quality and diversity of training data. In high-noise and highly heterogeneous documents, agents may struggle to effectively perform factual verification. Additionally, although MARCH performs well across multiple benchmarks, its generalization ability in specific domains still needs further validation. Future research can focus on optimizing MARCH's computational efficiency and extending its generalization capabilities across different domains.

Deep Analysis

Background

Large language models (LLMs) have made significant strides in natural language processing, particularly in generation and comprehension tasks. However, LLMs often exhibit hallucinations, generating content that contradicts factual information. This phenomenon is especially prevalent in Retrieval-Augmented Generation (RAG) systems, which rely on retrieving information from external documents to generate responses. Existing hallucination detection methods typically use LLMs as judges to verify outputs, but these methods suffer from inherent confirmation bias, leading verifiers to inadvertently reproduce the original generation's errors. To enhance LLMs' factual consistency, researchers have explored various methods, including supervised fine-tuning and reinforcement learning from human feedback (RLHF). However, these methods still fall short in providing fine-grained factual consistency.

Core Problem

The hallucination problem in LLMs is a major challenge for their practical application. Hallucinations not only undermine the model's credibility but can also have serious consequences in high-stakes domains such as finance, law, and healthcare. Existing hallucination detection methods often rely on LLMs for verification, but this approach is prone to confirmation bias, leading to inaccurate verification results. Additionally, traditional reinforcement learning methods lack the granularity needed to supervise fine-grained factual consistency, making it difficult to meet the complex demands of RAG tasks.

Innovation

The MARCH framework achieves self-checking of LLM hallucinations through multi-agent reinforcement learning, overcoming the limitations of traditional methods. Its innovations include:


  • �� Introducing an information asymmetry-based collaborative mechanism that breaks confirmation bias.

  • �� Designing three specialized agents: the Solver generates initial responses, the Proposer decomposes these into verifiable atomic propositions, and the Checker validates these propositions without referencing the Solver's output.

  • �� Enabling agents to co-evolve and optimize factual adherence through multi-agent reinforcement learning.

These innovations allow MARCH to significantly enhance LLMs' factual consistency without relying on additional human annotations or external tools.

Methodology

The MARCH framework achieves self-checking of LLM hallucinations through the following steps:


  • �� Solver generates initial RAG response: Generates an initial response based on the input query and retrieved documents.

  • �� Proposer decomposes response into atomic propositions: Breaks down the generated response into a series of verifiable atomic propositions for subsequent validation.

  • �� Checker validates atomic propositions: Validates these atomic propositions based on retrieved documents without referencing the Solver's output.

  • �� Multi-agent reinforcement learning: Trains this pipeline with multi-agent reinforcement learning, allowing agents to co-evolve and optimize factual adherence.

This method breaks the cycle of self-confirmation bias through its information asymmetry design, enhancing LLMs' factual consistency.

Experiments

The experimental design includes evaluating MARCH's performance across multiple hallucination benchmarks. Datasets used include BioASQ, 2WikiMultiHopQA, and MuSiQue, covering different domains and task types. The Meta-Llama3.1-8B-Instruct is used as the initialization policy, and multi-agent reinforcement learning is employed for training. Key hyperparameters include learning rates, batch size, and training epochs. Ablation studies are conducted to verify the contribution of each component.

Results

Experimental results show that MARCH significantly reduces hallucination rates across multiple benchmarks. Specifically, on the RAGTruth and FaithBench benchmarks, MARCH-STEM and MARCH-General increase average accuracy to 74.93% and 75.23%, respectively, a significant improvement from the base model's 55.20%. In the Facts Grounding benchmark, MARCH-STEM and MARCH-General achieve factuality scores of 85.23% and 80.12%, significantly higher than the base model's 57.09%. These results demonstrate MARCH's effectiveness in enhancing LLMs' factual consistency.

Applications

The MARCH framework has broad application potential across multiple domains. Direct application scenarios include:


  • �� Financial analysis: Enhancing the accuracy of financial reports and analyses, reducing risks associated with misinformation, and aiding financial professionals in making more accurate decisions.

  • �� Legal document drafting: Assisting legal professionals in case analysis and legal document drafting, ensuring information accuracy and improving the quality of legal services.

  • �� Medical literature analysis: Supporting the retrieval and analysis of medical literature, providing accurate medical advice, and aiding healthcare professionals in making better diagnostic and treatment decisions.

These application scenarios require high-quality data and robust computational capabilities to fully leverage MARCH's potential.

Limitations & Outlook

Although MARCH performs well across multiple benchmarks, its generalization ability in specific domains still needs further validation. Additionally, MARCH's computational cost is high, especially when training on large-scale datasets, potentially requiring substantial computational resources. Future research can focus on optimizing MARCH's computational efficiency and extending its generalization capabilities across different domains.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a meal. You have a helper (Solver) who is responsible for preparing all the ingredients according to the recipe. Then, you have an inspector (Proposer) who breaks down each dish into specific steps, like chopping vegetables or frying. Finally, you have a taster (Checker) who independently tastes each dish without looking at the recipe to ensure they match the intended flavors.

In this process, the helper might make mistakes, such as using the wrong ingredients or following incorrect steps. The taster's job is to independently verify the taste of each dish, ensuring quality rather than simply trusting the helper's judgment.

The MARCH framework is like this kitchen team, using multi-agent collaboration to ensure that the content generated by large language models aligns with factual information. Each agent has its role and task, and by leveraging information asymmetry, MARCH breaks the cycle of confirmation bias present in traditional methods, significantly improving the model's accuracy and reliability.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game that requires you and your friends to work together. You're the main character (Solver) in the game, responsible for making decisions based on the mission guide. Then, one of your friends (Proposer) breaks your mission into smaller tasks, like fighting monsters or collecting items. Finally, there's another friend (Checker) who independently checks if each small task is done well without watching what you did.

This game has a big challenge: sometimes you might make mistakes, like fighting the wrong monster or collecting the wrong items. The Checker’s job is to ensure each small task is done correctly, rather than just trusting what you did.

This is like the MARCH framework, using multi-agent collaboration to ensure that the content generated by large language models aligns with factual information. Each agent has its role and task, and by leveraging information asymmetry, MARCH breaks the cycle of confirmation bias present in traditional methods, significantly improving the model's accuracy and reliability. Isn't that cool?

Glossary

Multi-Agent Reinforcement Learning

A machine learning approach where multiple agents learn to solve complex tasks through collaboration and competition.

MARCH uses multi-agent reinforcement learning to train agents for factual verification.

Hallucination

In natural language processing, the phenomenon where a model generates content that contradicts factual information.

MARCH aims to reduce hallucinations in large language models.

Information Asymmetry

A situation in information processing where different participants have varying amounts of information.

MARCH leverages information asymmetry to break confirmation bias.

Retrieval-Augmented Generation (RAG)

A generation method that enhances content accuracy by retrieving information from external documents.

MARCH is applied in RAG systems to improve factual consistency.

Confirmation Bias

A cognitive bias where individuals tend to validate existing beliefs while ignoring contradictory evidence.

MARCH breaks the cycle of confirmation bias through its design.

Supervised Fine-Tuning (SFT)

Fine-tuning a model using labeled data to improve its performance on specific tasks.

Traditional SFT methods have limitations in fine-grained factual consistency.

Reinforcement Learning from Human Feedback (RLHF)

A reinforcement learning approach that optimizes model decisions based on human feedback.

RLHF is typically used to enhance factual consistency.

Atomic Proposition

A minimal unit of fact that can be independently verified.

The Proposer decomposes responses into verifiable atomic propositions.

Ablation Study

An experimental approach to assess the impact of removing or replacing certain parts of a model on its overall performance.

Ablation studies were conducted to verify the contribution of each component.

Factual Consistency

The alignment of generated content with real-world facts.

MARCH aims to enhance the factual consistency of large language models.

Open Questions Unanswered questions from this research

  • 1 Open Question 1: How can effective factual verification be performed in high-noise and highly heterogeneous documents? Current methods perform poorly in these complex scenarios, requiring more robust algorithms to improve verification accuracy.
  • 2 Open Question 2: What is MARCH's generalization ability in specific domains? Although it performs well across multiple benchmarks, its adaptability to specific domains still needs further validation.
  • 3 Open Question 3: How can MARCH's computational cost be reduced? The current computational cost is high, especially when training on large-scale datasets, requiring substantial computational resources.
  • 4 Open Question 4: How can MARCH's computational efficiency be further optimized? Future research can focus on optimizing the algorithm's computational efficiency for application on larger-scale datasets.
  • 5 Open Question 5: How can other advanced reinforcement learning techniques be integrated to enhance MARCH's factual consistency? Exploring new technology combinations may further improve model performance.
  • 6 Open Question 6: How can MARCH's performance be further improved without additional human annotations? Exploring new data augmentation and self-supervised learning methods is needed.
  • 7 Open Question 7: How can MARCH's application be extended across different domains and tasks? Its performance and adaptability need to be verified in other complex tasks.

Applications

Immediate Applications

Financial Analysis

Enhancing the accuracy of financial reports and analyses, reducing risks associated with misinformation, and aiding financial professionals in making more accurate decisions.

Legal Document Drafting

Assisting legal professionals in case analysis and legal document drafting, ensuring information accuracy and improving the quality of legal services.

Medical Literature Analysis

Supporting the retrieval and analysis of medical literature, providing accurate medical advice, and aiding healthcare professionals in making better diagnostic and treatment decisions.

Long-term Vision

Intelligent Assistants

Developing more intelligent personal assistants capable of providing accurate information and advice across multiple domains, enhancing user experience and satisfaction.

Automated Decision Systems

Applying automated decision systems across various industries to improve efficiency and accuracy, reducing human errors and biases.

Abstract

Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at https://github.com/Qwen-Applications/MARCH.

cs.CL

References (20)

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, A. Torralba et al.

2023 1399 citations ⭐ Influential View Analysis →

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.

2020 12304 citations ⭐ Influential View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 5077 citations ⭐ Influential View Analysis →

MetaGPT: Meta Programming for Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan P. Chen et al.

2023 1517 citations ⭐ Influential View Analysis →

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal et al.

2017 26067 citations ⭐ Influential View Analysis →

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.

2022 871 citations ⭐ Influential View Analysis →

♫ MuSiQue: Multihop Questions via Single-hop Question Composition

H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.

2021 695 citations ⭐ Influential View Analysis →

LEGO: A Multi-agent Collaborative Framework with Role-playing and Iterative Feedback for Causality Explanation Generation

Zhitao He, Pengfei Cao, Yubo Chen et al.

2023 36 citations

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings

Austin Xu, Srijan Bansal, Yifei Ming et al.

2025 16 citations View Analysis →

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

G. Li, Hasan Hammoud, Hani Itani et al.

2023 1174 citations View Analysis →

Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

Ran Xu, Wenqi Shi, Yuchen Zhuang et al.

2025 21 citations View Analysis →

RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS

R. Bradley, M. E. Terry

1952 4022 citations

Qwen2.5 Technical Report

Qwen An Yang, Baosong Yang, Beichen Zhang et al.

2024 3457 citations View Analysis →

Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Zhihong Shao, Yeyun Gong, Yelong Shen et al.

2023 446 citations View Analysis →

Retrieval Augmentation Reduces Hallucination in Conversation

Kurt Shuster, Spencer Poff, Moya Chen et al.

2021 1006 citations View Analysis →

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau et al.

2025 114 citations View Analysis →

MACRec: A Multi-Agent Collaboration Framework for Recommendation

Zhefan Wang, Yuanqing Yu, Wen-Xun Zheng et al.

2024 72 citations View Analysis →

GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking

Darshan Deshpande, Selvan Sunitha Ravi, Sky CH-Wang et al.

2024 10 citations View Analysis →

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie C. Lin, Jacob Hilton, Owain Evans

2021 2997 citations View Analysis →

Gemini

M. Yates

2009 490 citations