MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

TL;DR

MedHopQA evaluates biomedical QA via multi-hop reasoning with 1,000 expert-curated question-answer pairs.

cs.CL 🔴 Advanced 2026-05-13 86 views

Rezarta Islamaj Robert Leaman Joey Chan Nicholas Wan Qiao Jin Natalie Xie John Wilbur Shubo Tian Lana Yeganova Po-Ting Lai Chih-Hsuan Wei Yifan Yang Yao Ge Qingqing Zhu Zhizheng Wang Zhiyong Lu

AI Reader Arxiv Page Download PDF

multi-hop QA biomedical NLP benchmark evaluation LLM evaluation dataset contamination

Key Findings

Methodology

The MedHopQA dataset was constructed through a multi-stage human-AI collaborative process, combining structured human annotation, AI-assisted augmentation, and multi-stage validation. Each question requires synthesizing information from two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation.

Key Results

In a zero-shot setting, evaluation of four frontier LLMs (GPT-5.1, Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-4o) reveals performance variation across answer types, with overall accuracy ranging from 66.3% to 83.4%. Performance is strongest on chemical and anatomical questions and most variable on disease and gene/protein categories, where fine-grained semantic discrimination is required.
The design of the MedHopQA dataset makes it resistant to performance saturation and training data contamination, providing a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance.
By embedding the 1,000 scored questions within a publicly downloadable set of 10,000 questions and withholding answers on a CodaBench leaderboard, MedHopQA reduces leaderboard gaming and contamination risk.

Significance

MedHopQA provides a new benchmark and evaluation framework focused on multi-hop reasoning, which is crucial for clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation. It fills a gap in existing biomedical QA benchmarks, addressing long-standing pain points such as performance saturation and training data contamination. By using an open-ended answer format and multi-hop structure, MedHopQA advances deeper evaluation of LLM reasoning capabilities.

Technical Contribution

MedHopQA's technical contributions include its explicit multi-hop structure, open-ended answer format, and community-scale evaluation. These design features overcome structural limitations of existing benchmarks, such as format rigidity, saturation, contamination vulnerability, and shallow reasoning issues. By using Wikipedia as a knowledge source, MedHopQA ensures that the challenge is relational rather than encyclopedic.

Novelty

MedHopQA is the first biomedical QA benchmark to combine explicit multi-hop structure, open-ended answer format, and community-scale evaluation. Compared to existing multi-hop evaluation methods, it eliminates answer cueing effects and requires models to generate rather than select the correct inferential output, broadening the surface of reasoning that must be engaged.

Limitations

One limitation of MedHopQA is its reliance on Wikipedia as a knowledge source, which may lead to insufficient coverage of certain specific domains or the latest research.
The complexity of question design requires significant human and time investment in dataset construction and validation.
Despite measures to reduce contamination risk, potential similar instances in training data cannot be completely ruled out.

Future Work

Future work directions include expanding MedHopQA to cover more biomedical domains and knowledge sources, further increasing the diversity and challenge of the questions. Additionally, exploring automated generation and validation processes could improve dataset construction efficiency. The community can leverage the MedHopQA framework to develop new evaluation benchmarks focused on other complex reasoning tasks.

AI Executive Summary

Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format rather than as multiple-choice selections.

Gold annotations are augmented with ontology-grounded synonym sets (MONDO, NCBI Gene, NCBI Taxonomy) to support both lexical and concept-level evaluation. The dataset was constructed through a multi-stage human-AI pipeline combining structured human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard.

Evaluation of four frontier LLMs under a zero-shot setting (GPT-5.1, Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-4o) reveals performance variation across answer types, with overall accuracy ranging from 66.3% to 83.4%. Performance is strongest on chemical and anatomical questions and most variable on disease and gene/protein categories, where fine-grained semantic discrimination is required.

MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination mitigation as design constraints. It addresses long-standing pain points such as performance saturation and training data contamination, advancing deeper evaluation of LLM reasoning capabilities.

However, MedHopQA also has its limitations. Its reliance on Wikipedia as a knowledge source may lead to insufficient coverage of certain specific domains or the latest research. Additionally, the complexity of question design requires significant human and time investment in dataset construction and validation. Despite measures to reduce contamination risk, potential similar instances in training data cannot be completely ruled out. Future work directions include expanding MedHopQA to cover more biomedical domains and knowledge sources, further increasing the diversity and challenge of the questions.

Deep Analysis

Background

Biomedical text mining has traditionally been organized around task-specific information extraction. Named entity recognition, relation extraction, event detection, and document classification have been developed and evaluated as discrete, pipelined subtasks, each with dedicated models, training data, and evaluation benchmarks. This task-specific paradigm has supported large-scale biocuration efforts by enabling the systematic extraction of structured knowledge from the literature through pre-defined schemas.

In contrast, natural language question answering provides end users with a more flexible interaction, in which information needs are posed directly as ad-hoc questions and system responses returned in fluent natural language. The rise of large language models (LLMs) has made this setting increasingly practical, enabling systems to retrieve, combine, and interpret information to produce coherent answers. However, natural language question answering exposes different aspects of system behavior, particularly the ability to reliably integrate information across sources and perform multi-step reasoning. As a result, this shift introduces new demands on dataset construction and evaluation, requiring benchmarks that capture not only factual correctness but – crucially – can reliably evoke and evaluate reasoning over biomedical knowledge.

Core Problem

Existing biomedical QA benchmarks are limited in evaluating LLM reasoning capabilities. Multiple-choice formats allow models to succeed by answer elimination rather than inference, and widely circulated exam-style datasets are subject to performance saturation and training data contamination. Multi-hop reasoning, the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks.

Innovation

MedHopQA addresses the limitations of existing benchmarks through the following innovations:

�� Multi-hop structure: Each question requires synthesizing information from two distinct Wikipedia articles, ensuring reasoning challenge.

�� Open-ended answer format: Answers are provided in free-text format, eliminating answer cueing effects and requiring models to generate rather than select the correct inferential output.

�� Community-scale evaluation: By embedding and withholding answers on a CodaBench leaderboard, MedHopQA reduces leaderboard gaming and contamination risk.

�� Ontology enhancement: Gold annotations are augmented with ontology-grounded synonym sets (MONDO, NCBI Gene, NCBI Taxonomy) to support both lexical and concept-level evaluation.

Methodology

The MedHopQA dataset was constructed through a multi-stage human-AI collaborative process, with the following steps:

�� Source Material and Seed Page Selection: The seed dataset was constructed from Wikipedia's curated list of disease pages, each paired with pages reachable via its outgoing hyperlinks.

�� Human Annotation: Sixteen researchers selected page pairs and formulated questions requiring synthesis of information from both articles.

�� AI Data Augmentation: An AI generation module was used to produce additional QA pairs, which entered the same triage pool as human-authored ones.

�� Triage and Verification: All QA pairs entered a shared triage pool and were distributed to reviewers for validation, ensuring question quality.

Experiments

In a zero-shot setting, MedHopQA evaluated four frontier LLMs (GPT-5.1, Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-4o). The experimental design included:

�� Dataset: 1,000 expert-curated question-answer pairs embedded within a publicly downloadable set of 10,000 questions.

�� Baselines: Comparison with existing biomedical QA benchmarks.

�� Evaluation Metrics: Accuracy, semantic discrimination ability, and performance variation across answer types.

�� Hyperparameters: Default model settings, no fine-tuning performed.

�� Ablation Studies: Analysis of performance variation across different answer types.

Results

Experimental results show:

�� Overall accuracy ranges from 66.3% to 83.4%, with strongest performance on chemical and anatomical questions.

�� Performance variation is most pronounced on disease and gene/protein categories, where fine-grained semantic discrimination is required.

�� In a zero-shot setting, LLMs still face challenges in multi-hop reasoning tasks, indicating the need for further model improvements.

Applications

Application scenarios for MedHopQA include:

�� Diagnostic Support: Assisting physicians in integrating information from different sources to make more accurate diagnostic decisions through multi-hop reasoning.

�� Literature Discovery: Supporting researchers in discovering new associations and hypotheses in the literature by integrating information from different documents.

�� Hypothesis Generation: Providing scientists with new hypotheses and research directions by discovering new research paths through multi-hop reasoning.

Limitations & Outlook

Limitations of MedHopQA include:

�� Reliance on Wikipedia as a knowledge source, which may lead to insufficient coverage of certain specific domains or the latest research.

�� The complexity of question design requires significant human and time investment in dataset construction and validation.

�� Despite measures to reduce contamination risk, potential similar instances in training data cannot be completely ruled out. Future work directions include expanding MedHopQA to cover more biomedical domains and knowledge sources, further increasing the diversity and challenge of the questions.

Plain Language Accessible to non-experts

Imagine you're in a library trying to find a book about a rare disease. You know the book might mention a specific gene, but you're not sure which one. You need to look through multiple books, maybe one book mentions the symptoms of the disease, and another mentions the related gene. You need to piece this information together to find the answer you need.

MedHopQA is like a guide in this library, helping you find the right books and telling you how to integrate information from different books. It's not just about finding the answer simply, but requiring you to reason and synthesize.

This approach is crucial in the medical field because doctors often need to integrate information from different studies and literature when diagnosing patients. MedHopQA simulates this complex reasoning process, helping to evaluate and improve large language models' capabilities in the biomedical domain.

In this way, MedHopQA is not just a simple question-answering tool, but a powerful tool that helps us better understand and apply complex information.

ELI14 Explained like you're 14

Hey there! Have you ever wondered how doctors know what to do when we're sick? They don't just rely on one book, you know!

Imagine you're playing a super complex puzzle game. Each puzzle piece comes from a different box, and you need to put them together to see the full picture. Doctors are like that when diagnosing; they need to find clues from different medical books and studies and then piece them together to find out what's wrong and how to treat it.

MedHopQA is like a tool that helps doctors play this puzzle game. It gives questions and asks you to find answers from different sources, just like finding puzzle pieces in different boxes. This way, doctors can find answers faster and more accurately!

So next time you visit a doctor, remember they're using a lot of smart tools and methods to help you out!

Glossary

Multi-hop Reasoning

Multi-hop reasoning refers to the ability to integrate information from multiple sources to derive a conclusion. This is especially important in complex question-answering tasks.

In MedHopQA, each question requires multi-hop reasoning to synthesize information from two distinct Wikipedia articles.

Open-ended Answer

An open-ended answer is not restricted to predefined options and is typically provided in free-text format.

MedHopQA uses an open-ended answer format, requiring models to generate rather than select answers.

Wikipedia

Wikipedia is a free online encyclopedia, collaboratively written by volunteers around the world, covering a wide range of topics.

MedHopQA uses Wikipedia as a knowledge source to ensure the reasoning challenge rather than memorization.

Ontology-grounded Synonym Sets

Ontology-grounded synonym sets are collections of synonyms based on domain ontologies, used to support lexical and concept-level evaluation.

MedHopQA enhances gold annotations with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy.

CodaBench Leaderboard

CodaBench is an online platform for shared tasks and benchmark evaluations, providing leaderboard functionality to foster community participation.

MedHopQA's dataset is embedded on a CodaBench leaderboard to reduce leaderboard gaming and contamination risk.

Zero-shot Setting

A zero-shot setting refers to evaluating a model's performance on a new task without specific task training.

MedHopQA evaluates the performance of four frontier LLMs in a zero-shot setting.

Performance Saturation

Performance saturation occurs when a benchmark no longer effectively distinguishes model capabilities after models achieve high scores.

MedHopQA resists performance saturation through its multi-hop structure and open-ended answer design.

Dataset Contamination

Dataset contamination refers to the presence of similar instances in training data, leading models to achieve high scores through memorization rather than reasoning.

MedHopQA reduces dataset contamination risk by embedding and withholding answers.

BioCreative IX

BioCreative is a community evaluation event for biomedical text mining, aimed at advancing research and technology in the field.

MedHopQA was introduced as a shared task at BioCreative IX to foster community participation and evaluation.

LLM-as-a-judge

LLM-as-a-judge refers to the process of using large language models to validate and evaluate question answers.

MedHopQA uses LLM-as-a-judge validation in the dataset construction process.

Open Questions Unanswered questions from this research

1 How can we construct multi-hop reasoning datasets with broader applicability without relying on specific knowledge sources? Existing methods heavily rely on sources like Wikipedia, which may lead to insufficient coverage of certain domains. New dataset construction methods are needed to cover more knowledge domains and sources.
2 How can we effectively evaluate models' reasoning abilities rather than memorization in multi-hop reasoning tasks? Existing benchmarks may not fully distinguish between models' reasoning and memorization abilities. New evaluation metrics and methods are needed to more accurately assess reasoning capabilities.
3 How can we automate the dataset construction and validation process to improve efficiency and reduce human intervention? The current dataset construction and validation process requires significant human and time investment. New automation tools and methods are needed to improve dataset construction efficiency.
4 How can we effectively handle and integrate information from different sources in multi-hop reasoning tasks? Existing methods may not effectively integrate and process information from different sources. New information integration and processing methods are needed to improve performance in multi-hop reasoning tasks.
5 How can we reduce the risk of training data contamination in multi-hop reasoning tasks? Existing datasets may be affected by training data contamination, leading models to achieve high scores through memorization rather than reasoning. New dataset design and evaluation methods are needed to reduce contamination risk.

Applications

Immediate Applications

Diagnostic Support

MedHopQA can assist physicians in integrating information from different sources to make more accurate diagnostic decisions. Through multi-hop reasoning, physicians can find relevant information faster, improving patient outcomes.

Literature Discovery

Researchers can use MedHopQA to discover new associations and hypotheses in the literature, advancing scientific research. By integrating information from different documents, researchers can gain a more comprehensive understanding of research topics.

Hypothesis Generation

Scientists can leverage MedHopQA to generate new research hypotheses and explore new research directions. Through multi-hop reasoning, scientists can discover new research paths, driving scientific progress.

Long-term Vision

Personalized Medicine

By integrating patient data and the latest medical research, MedHopQA can help achieve personalized medicine, improving treatment outcomes. Although data privacy and technical challenges remain, this vision is promising for the future.

Automated Scientific Discovery

MedHopQA can advance automated scientific discovery by integrating and analyzing large volumes of scientific data to uncover new scientific laws and theories. Despite current technical and computational challenges, this vision is promising for the future.

Abstract

Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.

cs.CL cs.AI cs.IR

References (20)

Large language models in medicine

A. Thirunavukarasu, Darren S. J. Ting, Kabilan Elangovan et al.

2023 3057 citations

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

Di Jin, Eileen Pan, Nassim Oufattole et al.

2020 1626 citations View Analysis →

Lessons from Natural Language Inference in the Clinical Domain

Alexey Romanov, Chaitanya P. Shivade

2018 318 citations View Analysis →

A large-scale benchmark for evaluating large language models on medical question answering in Romanian

Ana-Cristina Rogoz, R. Ionescu, Alexandra-Valentina Anghel et al.

2025 2 citations View Analysis →

♫ MuSiQue: Multihop Questions via Single-hop Question Composition

H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.

2021 816 citations View Analysis →

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Shahriar Golchin, M. Surdeanu

2023 170 citations View Analysis →

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee et al.

2023 1610 citations

Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering

B. Colelough, Davis Bartels, Dina Demner-Fushman

2025 1 citations View Analysis →

Overview of the Medical Question Answering Task at TREC 2017 LiveQA

Asma Ben Abacha, Eugene Agichtein, Yuval Pinter et al.

2017 118 citations

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Qiao Jin, Won Kim, Qingyu Chen et al.

2023 249 citations View Analysis →

RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports

Sarvesh Soni, Meghana Gudala, A. Pajouhi et al.

2022 28 citations

HealthBench: Advancing AI evaluation in healthcare, but not yet clinically ready

Jialin Liu, Siru Liu

2025 2 citations

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

Juraj Vladika, P. Schneider, Florian Matthes

2024 13 citations View Analysis →

RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions

Gregory Kell, A. Roberts, Serge Umansky et al.

2024 5 citations View Analysis →

K-QA: A Real-World Medical Q&A Benchmark

Itay Manes, Naama Ronn, David Cohen et al.

2024 32 citations View Analysis →

LongHealth: A Question Answering Benchmark with Long Clinical Documents

L. Adams, Felix Busch, T. Han et al.

2024 27 citations View Analysis →

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

Johannes Welbl, Pontus Stenetorp, Sebastian Riedel

2017 564 citations View Analysis →

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu

2022 679 citations View Analysis →

BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain

Yunsoo Kim, Yusuf Abdulle, Honghan Wu

2025 11 citations View Analysis →

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

Anusri Pampari, Preethi Raghavan, Jennifer J. Liang et al.

2018 258 citations View Analysis →

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Multi-hop Reasoning

Open-ended Answer

Wikipedia

Ontology-grounded Synonym Sets

CodaBench Leaderboard

Zero-shot Setting

Performance Saturation

Dataset Contamination

BioCreative IX

LLM-as-a-judge

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Diagnostic Support

Literature Discovery

Hypothesis Generation

Long-term Vision

Personalized Medicine

Automated Scientific Discovery

Abstract

References (20)

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs