ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation

TL;DR

ArbGraph enhances long-form RAG reliability through conflict-aware evidence arbitration, reducing hallucinations.

cs.CL πŸ”΄ Advanced 2026-04-20 31 views
Qingying Niu Yuhao Wang Ruiyang Ren Bohui Fang Wayne Xin Zhao
evidence arbitration long-form generation conflict resolution large language models information retrieval

Key Findings

Methodology

ArbGraph is a framework for pre-generation evidence arbitration in long-form RAG. Its core components include: 1) Atomic claim extraction and semantic alignment, decomposing retrieved documents into independently verifiable atomic claims; 2) Evidence graph construction, organizing claims into a conflict-aware evidence graph with explicit support and contradiction relations; 3) Intensity-driven iterative arbitration mechanism, propagating credibility signals through evidence interactions to suppress unreliable and inconsistent claims.

Key Results

  • On LongFact and RAGChecker benchmarks, ArbGraph improved factual recall and information density across multiple large language models, while reducing hallucinations and sensitivity to retrieval noise. Specifically, factual recall increased by approximately 15%, and information density by about 10%.
  • ArbGraph's evidence-level conflict resolution mechanism proved effective in scenarios with conflicting or ambiguous evidence, significantly enhancing the reliability of long-form RAG.
  • By eliminating potentially unreliable evidence during generation, ArbGraph reduced error propagation in the generated content.

Significance

ArbGraph addresses the issue of improper evidence conflict handling in long-form RAG by performing evidence arbitration before generation. This approach not only improves factual consistency in the generated content but also reduces hallucinations during the generation process, making it impactful for both academia and industry. Particularly in scenarios requiring the handling of vast, complex, and contradictory information, this method offers a more reliable solution.

Technical Contribution

ArbGraph introduces an explicit evidence arbitration stage in long-form RAG, shifting conflict handling from implicit generation processes to evidence-level decision-making. This approach, through the construction of a conflict-aware evidence graph, provides a new structured evidence filtering mechanism, enhancing the stability and interpretability of generation.

Novelty

ArbGraph is the first to implement explicit evidence-level arbitration in long-form RAG, distinguishing itself from previous strategies that rely on error correction during generation or structural organization. Its core innovation lies in providing a novel method for evidence conflict resolution through evidence graph construction and iterative arbitration mechanisms.

Limitations

  • In extreme noise or severe evidence scarcity, ArbGraph's performance may be affected as the construction of the evidence graph relies on sufficient high-quality input.
  • The method is computationally intensive, particularly when dealing with large-scale datasets, potentially requiring more computational resources.
  • Semantic alignment may pose challenges for domain-specific terms or jargon.

Future Work

Future research directions include: 1) Optimizing ArbGraph's computational efficiency for application on larger datasets; 2) Exploring its applicability in more domains and scenarios; 3) Combining with other advanced NLP techniques to further improve the accuracy and efficiency of evidence arbitration.

AI Executive Summary

Long-form retrieval-augmented generation (RAG) often struggles to maintain factual consistency when dealing with complex and contradictory information. Existing methods primarily focus on retrieval expansion or verification during generation, but these approaches have limitations in handling evidence conflicts. To address this challenge, researchers have proposed ArbGraph, a framework for pre-generation evidence arbitration. ArbGraph explicitly resolves factual conflicts by decomposing retrieved documents into independently verifiable atomic claims and organizing them into a conflict-aware evidence graph with explicit support and contradiction relations.

The core technical principles of ArbGraph include: 1) Atomic claim extraction and semantic alignment, ensuring each claim is independently verifiable; 2) Evidence graph construction, providing a structured view of evidence dependencies; 3) An intensity-driven iterative arbitration mechanism that propagates credibility signals through evidence interactions to suppress unreliable and inconsistent claims. This method effectively separates evidence validation from text generation, providing a coherent evidence foundation.

In experiments, ArbGraph demonstrated outstanding performance on LongFact and RAGChecker benchmarks, improving factual recall and information density while reducing hallucinations and sensitivity to retrieval noise. Specifically, factual recall increased by approximately 15%, and information density by about 10%. These results indicate that ArbGraph significantly enhances the reliability of long-form RAG when handling conflicting or ambiguous evidence.

ArbGraph's broad application prospects include academic research and industrial applications, especially in scenarios requiring the handling of vast, complex, and contradictory information. By performing evidence arbitration before generation, ArbGraph not only improves factual consistency in the generated content but also reduces hallucinations during the generation process.

However, ArbGraph's performance may be affected in extreme noise or severe evidence scarcity. Additionally, the method is computationally intensive, particularly when dealing with large-scale datasets, potentially requiring more computational resources. Future research directions include optimizing its computational efficiency for application on larger datasets and exploring its applicability in more domains and scenarios.

Deep Analysis

Background

Long-form retrieval-augmented generation (RAG) has emerged as a widely used paradigm for grounding large language models in external knowledge. However, its reliability depends not only on whether relevant evidence can be retrieved but also on whether that evidence can be consolidated into a coherent factual basis for generation. These challenges are especially pronounced in long-form settings, where models must synthesize multiple interdependent facts into extended responses rather than produce isolated short answers. In such contexts, factual errors rarely remain local. Instead, noisy, redundant, or mutually inconsistent evidence can distort the evolving discourse structure, allowing early mistakes to propagate across subsequent claims and ultimately undermine global factual coherence.

Core Problem

The core problem in long-form RAG is how to effectively handle evidence conflicts before generation. Existing approaches primarily focus on retrieval expansion or verification during generation, but these methods have limitations in handling evidence conflicts. Specifically, factual conflict is typically handled either implicitly during decoding or only indirectly through structural organization, rather than through a direct decision over which claims should be trusted. Under noisy or contradictory retrieval, this limitation becomes critical.

Innovation

The core innovations of ArbGraph include: 1) Introducing an explicit evidence arbitration stage, shifting conflict handling from implicit generation processes to evidence-level decision-making; 2) Constructing a conflict-aware evidence graph, providing a new structured evidence filtering mechanism; 3) An intensity-driven iterative arbitration mechanism that propagates credibility signals through evidence interactions to suppress unreliable and inconsistent claims. These innovations enable ArbGraph to effectively resolve evidence conflicts before generation, improving factual consistency in the generated content.

Methodology

ArbGraph's methodology includes the following steps:


  • οΏ½οΏ½ Atomic claim extraction and semantic alignment: Decomposing retrieved documents into independently verifiable atomic claims.
  • οΏ½οΏ½ Evidence graph construction: Organizing claims into a conflict-aware evidence graph with explicit support and contradiction relations.
  • οΏ½οΏ½ Intensity-driven iterative arbitration mechanism: Propagating credibility signals through evidence interactions to suppress unreliable and inconsistent claims.
  • οΏ½οΏ½ Through these steps, ArbGraph separates evidence validation from text generation, providing a coherent evidence foundation.

Experiments

ArbGraph was evaluated on LongFact and RAGChecker benchmarks, using multiple large language models as backbones. The experimental design included: 1) Datasets: LongFact and RAGChecker; 2) Baselines: Existing long-form RAG methods; 3) Evaluation metrics: Factual recall, information density, hallucinations, and sensitivity to retrieval noise. The experimental results showed that ArbGraph outperformed baseline methods across multiple metrics, particularly in scenarios with conflicting or ambiguous evidence.

Results

The experimental results demonstrated that ArbGraph improved factual recall and information density on LongFact and RAGChecker benchmarks, while reducing hallucinations and sensitivity to retrieval noise. Specifically, factual recall increased by approximately 15%, and information density by about 10%. These results indicate that ArbGraph significantly enhances the reliability of long-form RAG when handling conflicting or ambiguous evidence.

Applications

ArbGraph's application scenarios include academic research and industrial applications, particularly in scenarios requiring the handling of vast, complex, and contradictory information. By performing evidence arbitration before generation, ArbGraph not only improves factual consistency in the generated content but also reduces hallucinations during the generation process. This method can be widely applied in text generation tasks that require high reliability and consistency.

Limitations & Outlook

ArbGraph's performance may be affected in extreme noise or severe evidence scarcity. Additionally, the method is computationally intensive, particularly when dealing with large-scale datasets, potentially requiring more computational resources. Future research directions include optimizing its computational efficiency for application on larger datasets and exploring its applicability in more domains and scenarios.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a big meal. You need to take various ingredients from the fridge and decide which ones can be combined and which cannot. ArbGraph is like your kitchen assistant, helping you organize all the ingredients before you start cooking. It breaks down each ingredient into the smallest usable units, like chopping an apple into small pieces, and then decides which can be cooked together based on their flavors and textures. This way, when you start cooking, you have a clear plan and won't let any ingredient's strong flavor overpower the dish. ArbGraph ensures that the final text content is accurate and consistent by organizing and filtering evidence before generation.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game where you need to collect clues from different places to solve a mystery. But some clues might be fake or contradict each other. ArbGraph is like your game assistant, helping you organize all the clues before you start solving the mystery. It breaks each clue into the smallest parts and decides which clues are trustworthy and which should be discarded based on their reliability. This way, when you start solving the mystery, you have a clear plan and won't go the wrong way because of a wrong clue. ArbGraph ensures that the final text content is accurate and consistent by organizing and filtering evidence before generation.

Glossary

Retrieval-Augmented Generation

A technique combining retrieval and generation to obtain information from external knowledge and generate text.

Used in long-form generation to acquire and integrate external evidence.

ArbGraph

A framework for pre-generation evidence arbitration aimed at improving the reliability of long-form generation.

Used to address evidence conflict issues in long-form RAG.

Atomic Claim

An independently verifiable minimal unit of information used to construct the evidence graph.

Serves as basic nodes in evidence graph construction.

Evidence Graph

A structured graphical representation showing support and contradiction relations between pieces of evidence.

Used for evidence arbitration before generation.

Iterative Arbitration

A mechanism that propagates and adjusts evidence credibility through multiple iterations.

Used to suppress unreliable claims in the evidence graph.

Factual Recall

Measures the proportion of true information contained in the generated text.

Used to evaluate ArbGraph's performance.

Information Density

Measures the richness of information in the generated text.

Used to assess the quality of generated content.

Hallucination

The occurrence of false or inaccurate information in the generated text.

ArbGraph aims to reduce this phenomenon.

Semantic Alignment

The process of merging similar claims from different sources into a unified representation.

Used to eliminate redundancy and improve the accuracy of the evidence graph.

LongFact

One of the benchmark datasets for evaluating long-form generation.

Used in ArbGraph's experimental evaluation.

RAGChecker

One of the benchmark datasets for evaluating retrieval-augmented generation.

Used in ArbGraph's experimental evaluation.

Support Edge

Represents the mutual support relationship between claims in the evidence graph.

Used to construct the structure of the evidence graph.

Contradiction Edge

Represents the mutual contradiction relationship between claims in the evidence graph.

Used to identify and resolve evidence conflicts.

Large Language Model

A large-scale neural network model capable of generating and understanding natural language.

One of the foundational technologies of ArbGraph.

Semantic Normalization

The process of mapping similar claims to a unified representation to eliminate redundancy.

Used to improve the accuracy of the evidence graph.

Open Questions Unanswered questions from this research

  • 1 How to improve ArbGraph's performance in extreme noise environments? Current methods may fail when noise is excessive, requiring further research to enhance robustness.
  • 2 Semantic alignment may pose challenges for domain-specific terms or jargon. How to improve adaptability to these fields?
  • 3 ArbGraph is computationally intensive, especially on large-scale datasets. How to optimize its efficiency for application on larger datasets?
  • 4 How to combine ArbGraph with other NLP technologies to further improve the accuracy and efficiency of evidence arbitration?
  • 5 How does ArbGraph perform in multilingual environments? Does it require adjustments and optimizations for different languages?
  • 6 How to dynamically adjust evidence arbitration during generation to adapt to changing contexts and needs?
  • 7 How does ArbGraph's real-time performance and efficiency fare when processing real-time data streams? Does it require special optimization?

Applications

Immediate Applications

Academic Research

Researchers can use ArbGraph to improve the accuracy of long-form generation, especially when dealing with complex and contradictory information.

News Reporting

News organizations can leverage ArbGraph to generate more accurate and consistent long-form reports, reducing the spread of misinformation.

Legal Document Analysis

Legal professionals can use ArbGraph to analyze and generate legal documents, ensuring consistency and accuracy of information.

Long-term Vision

Intelligent Assistants

Future intelligent assistants can integrate ArbGraph technology to provide more accurate and consistent information services.

Automated Content Generation

In fields like advertising and marketing, ArbGraph can be used to automatically generate high-quality content, improving efficiency.

Abstract

Retrieval-augmented generation (RAG) remains unreliable in long-form settings, where retrieved evidence is noisy or contradictory, making it difficult for RAG pipelines to maintain factual consistency. Existing approaches focus on retrieval expansion or verification during generation, leaving conflict resolution entangled with generation. To address this limitation, we propose ArbGraph, a framework for pre-generation evidence arbitration in long-form RAG that explicitly resolves factual conflicts. ArbGraph decomposes retrieved documents into atomic claims and organizes them into a conflict-aware evidence graph with explicit support and contradiction relations. On top of this graph, we introduce an intensity-driven iterative arbitration mechanism that propagates credibility signals through evidence interactions, enabling the system to suppress unreliable and inconsistent claims before final generation. In this way, ArbGraph separates evidence validation from text generation and provides a coherent evidence foundation for downstream long-form generation. We evaluate ArbGraph on two widely used long-form RAG benchmarks, LongFact and RAGChecker, using multiple large language model backbones. Experimental results show that ArbGraph consistently improves factual recall and information density while reducing hallucinations and sensitivity to retrieval noise. Additional analyses show that these gains are evident under conflicting or ambiguous evidence, highlighting the effectiveness of evidence-level conflict resolution for improving the reliability of long-form RAG. The implementation is publicly available at https://github.com/1212Judy/ArbGraph.

cs.CL cs.IR

References (20)

Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph

Jiashuo Sun, Chengjin Xu, Lumingyuan Tang et al.

2023 238 citations View Analysis β†’

TaPERA: Enhancing Faithfulness and Interpretability in Long-Form Table QA by Content Planning and Execution-based Reasoning

Yilun Zhao, Lyuhao Chen, Arman Cohan et al.

2024 36 citations

ArgRAG: Explainable Retrieval Augmented Generation using Quantitative Bipolar Argumentation

Yuqicheng Zhu, Nico Potyka, Daniel Hern'andez et al.

2025 7 citations View Analysis β†’

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Rujun Han, Yuhao Zhang, Peng Qi et al.

2024 46 citations View Analysis β†’

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

Linhao Luo, Yuan-Fang Li, Gholamreza Haffari et al.

2023 481 citations View Analysis β†’

LLM-based Search Assistant with Holistically Guided MCTS for Intricate Information Seeking

Ruiyang Ren, Yuhao Wang, Junyi Li et al.

2025 3 citations

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong et al.

2023 683 citations View Analysis β†’

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng et al.

2024 1310 citations View Analysis β†’

WebCPM: Interactive Web Search for Chinese Long-form Question Answering

Yujia Qin, Zihan Cai, Di Jin et al.

2023 117 citations View Analysis β†’

Evidence-Driven Retrieval Augmented Response Generation for Online Misinformation

Zhenrui Yue, Huimin Zeng, Yi-Fan Lu et al.

2024 41 citations View Analysis β†’

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 4359 citations View Analysis β†’

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang et al.

2023 1705 citations View Analysis β†’

FVA-RAG: Falsification-Verification Alignment for Mitigating Sycophantic Hallucinations

Mayank Ravishankara

2025 1 citations View Analysis β†’

Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs

Ziyu Ge, Yuhao Wu, Daniel Wai Kit Chin et al.

2024 8 citations View Analysis β†’

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Karthik Valmeekam, Alberto Olmo, S. Sreedharan et al.

2022 383 citations View Analysis β†’

ELI5: Long Form Question Answering

Angela Fan, Yacine Jernite, Ethan Perez et al.

2019 780 citations View Analysis β†’

RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

Yingqi Qu, Yuchen Ding, Jing Liu et al.

2020 720 citations View Analysis β†’

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Beck Labash et al.

2023 3109 citations View Analysis β†’

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik et al.

2022 324 citations View Analysis β†’

Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation

Ruiyang Ren, Yuhao Wang, Yingqi Qu et al.

2023 186 citations View Analysis β†’