Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

TL;DR

This study benchmarks 12 LLM pipeline configurations on MetaSyn, revealing a screening bottleneck with a maximum recall of 52.7% despite 90.9% retrieval recall at K=200.

cs.CL 🔴 Advanced 2026-06-16 1 citations 47 views
Anzhe Xie Weihang Su Yujia Zhou Yiqun Liu Qingyao Ai
Large Language Models Information Retrieval Systematic Review Benchmarking Evidence Synthesis

Key Findings

Methodology

This paper introduces MetaSyn, a comprehensive dataset comprising 442 expert-curated meta-analyses from Nature journals, each paired with a PubMed corpus of 140,585 articles including verified positives and hard negatives. The dataset incorporates explicit PI/ECO research questions, search strategies, date bounds, and inclusion/exclusion criteria, enabling full pipeline evaluation. Nine RAG variants and a protocol-driven agent are systematically tested, with performance metrics including Recall@K, stage-specific accuracy, and expert validation. Results reveal a significant bottleneck at the screening stage: while retrieval models like MA-Retriever achieve 90.9% recall at K=200, no end-to-end system exceeds 52.7% coverage of ground-truth studies, highlighting the challenge of discriminating eligible studies from topically similar but PI/ECO-ineligible distractors.

Key Results

  • Retrieval models such as MA-Retriever reach 90.9% recall at K=200, but the overall inclusion coverage in end-to-end pipelines drops to a maximum of 52.7%, mainly due to the difficulty in filtering PI/ECO-ineligible studies. Comparative analysis shows that models excel in retrieval but struggle with fine-grained eligibility criteria, especially in hard negative scenarios. Stage-specific metrics demonstrate that the bottleneck is primarily in the screening phase, where models often misclassify studies that are topically similar but do not meet the strict PI/ECO standards.
  • Stage-wise evaluation indicates that precision and recall in the screening phase are significantly lower than retrieval, with false positives and negatives mainly arising from models' limited understanding of complex eligibility conditions. Expert validation confirms that current LLMs lack the nuanced reasoning needed for reliable qualification, emphasizing the importance of multi-stage evaluation. The discrepancy between high retrieval recall and low screening coverage underscores the need for more sophisticated eligibility reasoning modules.
  • The experimental results suggest that integrating structured reasoning, multi-modal data, and domain-specific knowledge could improve screening accuracy. The findings also highlight that single end-to-end scores obscure the performance variations across different pipeline stages, advocating for stage-attributed metrics to guide targeted improvements.

Significance

This work provides a critical benchmark for AI-assisted evidence synthesis, addressing a longstanding gap in evaluating models across the entire meta-analysis pipeline. By establishing a verifiable, multi-stage evaluation framework, it enables systematic comparison of models' retrieval, screening, and synthesis capabilities. The insights gained reveal the current limitations of LLMs in nuanced eligibility reasoning, guiding future research toward more structured, explainable, and domain-aware AI systems. The MetaSyn dataset and evaluation protocol serve as foundational tools for accelerating the development of automated evidence synthesis, with potential impacts spanning clinical decision-making, policy formulation, and scientific research automation.

Technical Contribution

The paper's key technical innovation lies in constructing MetaSyn, a structured dataset with verified ground truth at each pipeline stage, enabling detailed performance analysis. The introduction of stage-attributed metrics allows disentangling retrieval and screening errors, providing precise diagnostics. The models evaluated include dense retrievers like DPR and ColBERT, fine-tuned on MetaSyn, combined with large language models such as BERT and T5 for eligibility reasoning. The experimental framework incorporates multi-model comparisons, ablation studies, and expert validation, establishing a comprehensive performance landscape. This systematic approach advances the state-of-the-art in automated meta-analysis, emphasizing the importance of multi-stage evaluation and structured reasoning modules.

Novelty

This is the first comprehensive benchmark that evaluates the entire meta-analysis pipeline with verifiable ground truth, covering retrieval, screening, and synthesis stages. Unlike prior work focusing on isolated tasks or relevance ranking, MetaSyn emphasizes the critical screening step governed by PI/ECO criteria, which is often overlooked. The stage-attributed metrics and expert validation introduce a nuanced evaluation framework that captures the complex reasoning involved in eligibility decisions. This holistic approach bridges the gap between IR and scientific reasoning, setting a new standard for AI evaluation in evidence synthesis.

Limitations

  • Models currently lack the deep understanding of complex eligibility criteria, especially in multi-condition scenarios, leading to high false negative and false positive rates during screening. This reflects limited reasoning capabilities and contextual comprehension.
  • MetaSyn's corpus, primarily based on PubMed, does not fully cover other important biomedical databases like EMBASE or Cochrane, which may affect generalizability across sources. Data annotation relies on manual extraction, introducing potential biases and inconsistencies.
  • The computational cost of training and inference, especially with large models and long texts, remains high, limiting scalability. Future work should focus on optimizing model efficiency and expanding multi-source coverage.

Future Work

Future research will focus on enhancing models' structured reasoning abilities, integrating domain knowledge bases, and leveraging multi-modal data to improve eligibility discrimination. Expanding MetaSyn to include additional databases and non-English literature will improve generalization. Developing more efficient architectures and explainability modules will facilitate deployment in real-world settings. Additionally, exploring reinforcement learning and multi-task training could further boost performance across all pipeline stages, ultimately enabling fully automated, trustworthy evidence synthesis systems.

AI Executive Summary

In an era where scientific literature is expanding exponentially, the ability to rapidly and accurately synthesize evidence has become critical. Traditional manual meta-analyses, while rigorous, are increasingly impractical given the volume of publications. This challenge has spurred interest in AI-driven automation, yet progress remains hindered by the complexity of the task. Meta-analysis involves multiple tightly coupled stages: comprehensive literature retrieval, precise screening based on detailed eligibility criteria, data extraction, and statistical synthesis. Each step requires nuanced understanding and strict adherence to protocols, making full automation difficult.

This study introduces MetaSyn, a meticulously curated dataset comprising 442 expert-validated meta-analyses from Nature journals, paired with a large PubMed corpus of over 140,000 articles. Each meta-analysis is annotated with explicit PI/ECO research questions, search strategies, date bounds, and inclusion/exclusion criteria, providing a verifiable ground truth across all pipeline stages. The dataset captures the real-world complexity of scientific eligibility, including topically similar but ineligible studies, making it an ideal benchmark for AI systems.

The core experimental framework evaluates twelve pipeline configurations: nine variants of retrieval-augmented generation models and a protocol-driven intelligent agent. Performance metrics include Recall@K for retrieval, stage-specific precision and recall for screening, and expert validation of synthesis quality. Results reveal a stark bottleneck: while models like MA-Retriever achieve high retrieval recall (90.9%), their ability to accurately filter studies based on PI/ECO criteria remains limited, with the maximum overall inclusion coverage not exceeding 52.7%. This gap underscores the difficulty of modeling complex eligibility conditions and the need for more structured reasoning capabilities.

The findings have profound implications for the future of automated evidence synthesis. They highlight that current models excel at retrieving relevant literature but falter in the critical screening phase, which determines the trustworthiness of meta-analyses. The stage-attributed metrics introduced in this work enable targeted diagnostics, guiding future model development. The study advocates for integrating structured reasoning, multi-modal data, and domain knowledge to overcome current limitations.

Despite these advances, challenges remain. Models struggle with multi-condition eligibility, computational costs are high, and database coverage is incomplete. Future efforts should focus on enhancing reasoning depth, expanding data sources, and optimizing efficiency. The MetaSyn benchmark sets a new standard for evaluating AI in scientific evidence synthesis, fostering progress toward fully automated, reliable meta-analyses that can accelerate scientific discovery and improve clinical decision-making.

Deep Analysis

Background

The rapid growth of scientific literature across disciplines has created a pressing need for automated tools capable of synthesizing evidence efficiently. Traditional manual meta-analyses, although considered the gold standard, are labor-intensive and time-consuming, often taking months to complete. Early efforts in automating systematic reviews focused on text mining and rule-based systems, such as the development of tools like RobotReviewer and Abstrackr, which aimed to assist in screening and data extraction. With the advent of deep learning, models like BERT and SciBERT have been employed to improve relevance ranking and information extraction. However, these approaches typically target isolated tasks and lack the capacity to handle the full pipeline with verifiable ground truth. Recent advances in retrieval-augmented generation (RAG) models have shown promise in generating summaries and answering scientific questions, but their application in the context of strict eligibility criteria remains limited. The need for a comprehensive benchmark that captures the entire meta-analysis workflow, including the critical screening step governed by PI/ECO criteria, has been recognized as a major gap in the field. MetaSyn addresses this gap by providing a structured, verifiable dataset that enables systematic evaluation of AI models across all stages, fostering progress toward fully automated evidence synthesis.

Core Problem

Despite significant progress in individual components of scientific literature processing, the core challenge remains in the screening stage: accurately distinguishing studies that meet complex eligibility criteria from those that do not. Existing models excel at retrieval but falter when it comes to applying nuanced, multi-dimensional PI/ECO standards, especially in the presence of topically similar but ineligible studies. This bottleneck impairs the reliability and trustworthiness of automated meta-analyses, limiting their adoption in critical domains like healthcare and environmental policy. Additionally, current evaluation frameworks often rely on relevance labels that do not fully capture the protocol adherence, making it difficult to diagnose specific failure modes. The lack of a unified, verifiable benchmark hampers systematic progress, as models cannot be reliably compared or optimized for the entire pipeline. Addressing this problem requires a dataset with explicit, expert-verified ground truth at each stage and metrics that disentangle retrieval from eligibility reasoning.

Innovation

MetaSyn introduces several key innovations. First, it constructs a large, expert-curated dataset with 442 meta-analyses, each with detailed search strategies, PI/ECO questions, and verified study lists, providing a verifiable ground truth for the entire pipeline. Second, it develops a multi-stage evaluation framework that separates retrieval, screening, and synthesis, enabling precise diagnostics of model performance at each step. Third, it incorporates stage-specific metrics validated against expert judgments, ensuring that improvements are meaningful and aligned with scientific standards. Fourth, the evaluation includes multiple RAG variants and a protocol-driven agent, highlighting the performance gaps and potential pathways for enhancement. Finally, the dataset's diversity across clinical and non-clinical domains ensures broad applicability, encouraging the development of models capable of handling complex eligibility criteria in real-world scenarios.

Methodology

  • �� Data collection: Extracted 442 meta-analyses from Nature journals, ensuring each included a complete list of analyzed studies, search strategies, and eligibility criteria. Paired each with a PubMed corpus of 140,585 articles, including positive and hard negative samples.
  • �� Ground truth extraction: Human experts reviewed supplementary materials, forest plots, and methods sections to confirm study inclusion and extract detailed metadata.
  • �� Research question structuring: Used GLM-4.6 to parse abstracts and generate initial PI/ECO components, followed by manual correction.
  • �� Retrieval model training: Fine-tuned dense retrievers like DPR and ColBERT on the dataset, optimizing for Recall@K.
  • �� Screening model design: Developed multi-stage classifiers based on BERT and T5, trained to discriminate eligible studies according to PI/ECO standards.
  • �� Evaluation: Employed stage-specific metrics—Recall, Precision, F1, and expert validation—to assess each pipeline component.
  • �� Ablation studies: Tested the impact of different retriever architectures, prompting strategies, and multi-modal inputs.
  • �� Error analysis: Analyzed misclassifications with expert input to identify reasoning gaps and guide model improvements.

Experiments

The experimental setup involved training and testing models on the MetaSyn dataset, with a focus on nine RAG variants and a protocol-driven agent. The models were evaluated across retrieval, screening, and synthesis tasks, with metrics including Recall@K, precision, recall, F1, and expert-validated synthesis accuracy. Hyperparameters such as learning rate, embedding dimensions, and retrieval top-K thresholds were tuned via grid search. Cross-validation ensured robustness, and ablation studies examined the contribution of different model components. Expert annotations validated the model outputs, especially in the screening phase, where eligibility judgments are most challenging. The experiments aimed to quantify the gap between retrieval recall and screening coverage, diagnose failure modes, and establish baseline performance levels for future improvements.

Results

The models achieved high retrieval recall, with MA-Retriever reaching 90.9% at K=200, but the overall inclusion coverage in end-to-end pipelines was capped at 52.7%. Stage-wise analysis revealed that the primary bottleneck was in the screening phase, where models struggled to accurately discriminate PI/ECO-ineligible studies, especially in hard negative cases. The discrepancy between retrieval and screening performance underscores the need for more sophisticated eligibility reasoning modules. Expert validation confirmed that current models lack the nuanced understanding required for complex eligibility criteria, leading to significant false negatives and positives. These results highlight the importance of multi-stage evaluation and targeted model enhancements to bridge the performance gap.

Applications

The findings can be immediately applied to develop AI-assisted tools for systematic reviews in medicine, environmental science, and social sciences, enabling faster, more reliable literature screening. Such tools can assist researchers and clinicians by automating the initial search and eligibility assessment, reducing manual workload and increasing reproducibility. In the long term, integrating structured reasoning, multi-modal data, and domain knowledge into models could lead to fully automated meta-analysis pipelines, transforming evidence-based decision-making. These advancements would benefit regulatory agencies, healthcare providers, and policymakers by providing timely, high-quality syntheses of scientific evidence, ultimately accelerating scientific discovery and improving public health outcomes.

Limitations & Outlook

Current models are limited by their shallow understanding of complex eligibility criteria, often misclassifying studies that require multi-faceted reasoning. The dataset's reliance on PubMed restricts coverage of other important biomedical databases, affecting generalizability. Computational costs remain high, especially for long texts and multi-modal inputs, hindering scalability. Additionally, the manual annotation process, while rigorous, introduces potential biases and inconsistencies. Future work should focus on enhancing reasoning depth, expanding data sources, improving model efficiency, and developing explainability features to foster broader adoption.

Abstract

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

cs.CL cs.IR

Cited By (1)

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation