MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
MedHopQA evaluates biomedical QA via multi-hop reasoning with 1,000 expert-curated question-answer pairs.
Key Findings
Methodology
The MedHopQA dataset was constructed through a multi-stage human-AI collaborative process, combining structured human annotation, AI-assisted augmentation, and multi-stage validation. Each question requires synthesizing information from two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation.
Key Results
- In a zero-shot setting, evaluation of four frontier LLMs (GPT-5.1, Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-4o) reveals performance variation across answer types, with overall accuracy ranging from 66.3% to 83.4%. Performance is strongest on chemical and anatomical questions and most variable on disease and gene/protein categories, where fine-grained semantic discrimination is required.
- The design of the MedHopQA dataset makes it resistant to performance saturation and training data contamination, providing a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance.
- By embedding the 1,000 scored questions within a publicly downloadable set of 10,000 questions and withholding answers on a CodaBench leaderboard, MedHopQA reduces leaderboard gaming and contamination risk.
Significance
MedHopQA provides a new benchmark and evaluation framework focused on multi-hop reasoning, which is crucial for clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation. It fills a gap in existing biomedical QA benchmarks, addressing long-standing pain points such as performance saturation and training data contamination. By using an open-ended answer format and multi-hop structure, MedHopQA advances deeper evaluation of LLM reasoning capabilities.
Technical Contribution
MedHopQA's technical contributions include its explicit multi-hop structure, open-ended answer format, and community-scale evaluation. These design features overcome structural limitations of existing benchmarks, such as format rigidity, saturation, contamination vulnerability, and shallow reasoning issues. By using Wikipedia as a knowledge source, MedHopQA ensures that the challenge is relational rather than encyclopedic.
Novelty
MedHopQA is the first biomedical QA benchmark to combine explicit multi-hop structure, open-ended answer format, and community-scale evaluation. Compared to existing multi-hop evaluation methods, it eliminates answer cueing effects and requires models to generate rather than select the correct inferential output, broadening the surface of reasoning that must be engaged.
Limitations
- One limitation of MedHopQA is its reliance on Wikipedia as a knowledge source, which may lead to insufficient coverage of certain specific domains or the latest research.
- The complexity of question design requires significant human and time investment in dataset construction and validation.
- Despite measures to reduce contamination risk, potential similar instances in training data cannot be completely ruled out.
Future Work
Future work directions include expanding MedHopQA to cover more biomedical domains and knowledge sources, further increasing the diversity and challenge of the questions. Additionally, exploring automated generation and validation processes could improve dataset construction efficiency. The community can leverage the MedHopQA framework to develop new evaluation benchmarks focused on other complex reasoning tasks.
AI Executive Summary
Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination.
Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format rather than as multiple-choice selections.
Gold annotations are augmented with ontology-grounded synonym sets (MONDO, NCBI Gene, NCBI Taxonomy) to support both lexical and concept-level evaluation. The dataset was constructed through a multi-stage human-AI pipeline combining structured human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard.
Evaluation of four frontier LLMs under a zero-shot setting (GPT-5.1, Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-4o) reveals performance variation across answer types, with overall accuracy ranging from 66.3% to 83.4%. Performance is strongest on chemical and anatomical questions and most variable on disease and gene/protein categories, where fine-grained semantic discrimination is required.
MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination mitigation as design constraints. It addresses long-standing pain points such as performance saturation and training data contamination, advancing deeper evaluation of LLM reasoning capabilities.
However, MedHopQA also has its limitations. Its reliance on Wikipedia as a knowledge source may lead to insufficient coverage of certain specific domains or the latest research. Additionally, the complexity of question design requires significant human and time investment in dataset construction and validation. Despite measures to reduce contamination risk, potential similar instances in training data cannot be completely ruled out. Future work directions include expanding MedHopQA to cover more biomedical domains and knowledge sources, further increasing the diversity and challenge of the questions.
Deep Analysis
Background
Biomedical text mining has traditionally been organized around task-specific information extraction. Named entity recognition, relation extraction, event detection, and document classification have been developed and evaluated as discrete, pipelined subtasks, each with dedicated models, training data, and evaluation benchmarks. This task-specific paradigm has supported large-scale biocuration efforts by enabling the systematic extraction of structured knowledge from the literature through pre-defined schemas.
In contrast, natural language question answering provides end users with a more flexible interaction, in which information needs are posed directly as ad-hoc questions and system responses returned in fluent natural language. The rise of large language models (LLMs) has made this setting increasingly practical, enabling systems to retrieve, combine, and interpret information to produce coherent answers. However, natural language question answering exposes different aspects of system behavior, particularly the ability to reliably integrate information across sources and perform multi-step reasoning. As a result, this shift introduces new demands on dataset construction and evaluation, requiring benchmarks that capture not only factual correctness but – crucially – can reliably evoke and evaluate reasoning over biomedical knowledge.
Core Problem
Existing biomedical QA benchmarks are limited in evaluating LLM reasoning capabilities. Multiple-choice formats allow models to succeed by answer elimination rather than inference, and widely circulated exam-style datasets are subject to performance saturation and training data contamination. Multi-hop reasoning, the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks.
Innovation
MedHopQA addresses the limitations of existing benchmarks through the following innovations:
- �� Multi-hop structure: Each question requires synthesizing information from two distinct Wikipedia articles, ensuring reasoning challenge.
- �� Open-ended answer format: Answers are provided in free-text format, eliminating answer cueing effects and requiring models to generate rather than select the correct inferential output.
- �� Community-scale evaluation: By embedding and withholding answers on a CodaBench leaderboard, MedHopQA reduces leaderboard gaming and contamination risk.
- �� Ontology enhancement: Gold annotations are augmented with ontology-grounded synonym sets (MONDO, NCBI Gene, NCBI Taxonomy) to support both lexical and concept-level evaluation.
Methodology
The MedHopQA dataset was constructed through a multi-stage human-AI collaborative process, with the following steps:
- �� Source Material and Seed Page Selection: The seed dataset was constructed from Wikipedia's curated list of disease pages, each paired with pages reachable via its outgoing hyperlinks.
- �� Human Annotation: Sixteen researchers selected page pairs and formulated questions requiring synthesis of information from both articles.
- �� AI Data Augmentation: An AI generation module was used to produce additional QA pairs, which entered the same triage pool as human-authored ones.
- �� Triage and Verification: All QA pairs entered a shared triage pool and were distributed to reviewers for validation, ensuring question quality.
Experiments
In a zero-shot setting, MedHopQA evaluated four frontier LLMs (GPT-5.1, Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-4o). The experimental design included:
- �� Dataset: 1,000 expert-curated question-answer pairs embedded within a publicly downloadable set of 10,000 questions.
- �� Baselines: Comparison with existing biomedical QA benchmarks.
- �� Evaluation Metrics: Accuracy, semantic discrimination ability, and performance variation across answer types.
- �� Hyperparameters: Default model settings, no fine-tuning performed.
- �� Ablation Studies: Analysis of performance variation across different answer types.
Results
Experimental results show:
- �� Overall accuracy ranges from 66.3% to 83.4%, with strongest performance on chemical and anatomical questions.
- �� Performance variation is most pronounced on disease and gene/protein categories, where fine-grained semantic discrimination is required.
- �� In a zero-shot setting, LLMs still face challenges in multi-hop reasoning tasks, indicating the need for further model improvements.
Applications
Application scenarios for MedHopQA include:
- �� Diagnostic Support: Assisting physicians in integrating information from different sources to make more accurate diagnostic decisions through multi-hop reasoning.
- �� Literature Discovery: Supporting researchers in discovering new associations and hypotheses in the literature by integrating information from different documents.
- �� Hypothesis Generation: Providing scientists with new hypotheses and research directions by discovering new research paths through multi-hop reasoning.
Limitations & Outlook
Limitations of MedHopQA include:
- �� Reliance on Wikipedia as a knowledge source, which may lead to insufficient coverage of certain specific domains or the latest research.
- �� The complexity of question design requires significant human and time investment in dataset construction and validation.
- �� Despite measures to reduce contamination risk, potential similar instances in training data cannot be completely ruled out. Future work directions include expanding MedHopQA to cover more biomedical domains and knowledge sources, further increasing the diversity and challenge of the questions.
Plain Language Accessible to non-experts
Imagine you're in a library trying to find a book about a rare disease. You know the book might mention a specific gene, but you're not sure which one. You need to look through multiple books, maybe one book mentions the symptoms of the disease, and another mentions the related gene. You need to piece this information together to find the answer you need.
MedHopQA is like a guide in this library, helping you find the right books and telling you how to integrate information from different books. It's not just about finding the answer simply, but requiring you to reason and synthesize.
This approach is crucial in the medical field because doctors often need to integrate information from different studies and literature when diagnosing patients. MedHopQA simulates this complex reasoning process, helping to evaluate and improve large language models' capabilities in the biomedical domain.
In this way, MedHopQA is not just a simple question-answering tool, but a powerful tool that helps us better understand and apply complex information.
ELI14 Explained like you're 14
Hey there! Have you ever wondered how doctors know what to do when we're sick? They don't just rely on one book, you know!
Imagine you're playing a super complex puzzle game. Each puzzle piece comes from a different box, and you need to put them together to see the full picture. Doctors are like that when diagnosing; they need to find clues from different medical books and studies and then piece them together to find out what's wrong and how to treat it.
MedHopQA is like a tool that helps doctors play this puzzle game. It gives questions and asks you to find answers from different sources, just like finding puzzle pieces in different boxes. This way, doctors can find answers faster and more accurately!
So next time you visit a doctor, remember they're using a lot of smart tools and methods to help you out!
Glossary
Multi-hop Reasoning
Multi-hop reasoning refers to the ability to integrate information from multiple sources to derive a conclusion. This is especially important in complex question-answering tasks.
In MedHopQA, each question requires multi-hop reasoning to synthesize information from two distinct Wikipedia articles.
Open-ended Answer
An open-ended answer is not restricted to predefined options and is typically provided in free-text format.
MedHopQA uses an open-ended answer format, requiring models to generate rather than select answers.
Wikipedia
Wikipedia is a free online encyclopedia, collaboratively written by volunteers around the world, covering a wide range of topics.
MedHopQA uses Wikipedia as a knowledge source to ensure the reasoning challenge rather than memorization.
Ontology-grounded Synonym Sets
Ontology-grounded synonym sets are collections of synonyms based on domain ontologies, used to support lexical and concept-level evaluation.
MedHopQA enhances gold annotations with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy.
CodaBench Leaderboard
CodaBench is an online platform for shared tasks and benchmark evaluations, providing leaderboard functionality to foster community participation.
MedHopQA's dataset is embedded on a CodaBench leaderboard to reduce leaderboard gaming and contamination risk.
Zero-shot Setting
A zero-shot setting refers to evaluating a model's performance on a new task without specific task training.
MedHopQA evaluates the performance of four frontier LLMs in a zero-shot setting.
Performance Saturation
Performance saturation occurs when a benchmark no longer effectively distinguishes model capabilities after models achieve high scores.
MedHopQA resists performance saturation through its multi-hop structure and open-ended answer design.
Dataset Contamination
Dataset contamination refers to the presence of similar instances in training data, leading models to achieve high scores through memorization rather than reasoning.
MedHopQA reduces dataset contamination risk by embedding and withholding answers.
BioCreative IX
BioCreative is a community evaluation event for biomedical text mining, aimed at advancing research and technology in the field.
MedHopQA was introduced as a shared task at BioCreative IX to foster community participation and evaluation.
LLM-as-a-judge
LLM-as-a-judge refers to the process of using large language models to validate and evaluate question answers.
MedHopQA uses LLM-as-a-judge validation in the dataset construction process.
Open Questions Unanswered questions from this research
- 1 How can we construct multi-hop reasoning datasets with broader applicability without relying on specific knowledge sources? Existing methods heavily rely on sources like Wikipedia, which may lead to insufficient coverage of certain domains. New dataset construction methods are needed to cover more knowledge domains and sources.
- 2 How can we effectively evaluate models' reasoning abilities rather than memorization in multi-hop reasoning tasks? Existing benchmarks may not fully distinguish between models' reasoning and memorization abilities. New evaluation metrics and methods are needed to more accurately assess reasoning capabilities.
- 3 How can we automate the dataset construction and validation process to improve efficiency and reduce human intervention? The current dataset construction and validation process requires significant human and time investment. New automation tools and methods are needed to improve dataset construction efficiency.
- 4 How can we effectively handle and integrate information from different sources in multi-hop reasoning tasks? Existing methods may not effectively integrate and process information from different sources. New information integration and processing methods are needed to improve performance in multi-hop reasoning tasks.
- 5 How can we reduce the risk of training data contamination in multi-hop reasoning tasks? Existing datasets may be affected by training data contamination, leading models to achieve high scores through memorization rather than reasoning. New dataset design and evaluation methods are needed to reduce contamination risk.
Applications
Immediate Applications
Diagnostic Support
MedHopQA can assist physicians in integrating information from different sources to make more accurate diagnostic decisions. Through multi-hop reasoning, physicians can find relevant information faster, improving patient outcomes.
Literature Discovery
Researchers can use MedHopQA to discover new associations and hypotheses in the literature, advancing scientific research. By integrating information from different documents, researchers can gain a more comprehensive understanding of research topics.
Hypothesis Generation
Scientists can leverage MedHopQA to generate new research hypotheses and explore new research directions. Through multi-hop reasoning, scientists can discover new research paths, driving scientific progress.
Long-term Vision
Personalized Medicine
By integrating patient data and the latest medical research, MedHopQA can help achieve personalized medicine, improving treatment outcomes. Although data privacy and technical challenges remain, this vision is promising for the future.
Automated Scientific Discovery
MedHopQA can advance automated scientific discovery by integrating and analyzing large volumes of scientific data to uncover new scientific laws and theories. Despite current technical and computational challenges, this vision is promising for the future.
Abstract
Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
References (20)
Large language models in medicine
A. Thirunavukarasu, Darren S. J. Ting, Kabilan Elangovan et al.
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
Di Jin, Eileen Pan, Nassim Oufattole et al.
Lessons from Natural Language Inference in the Clinical Domain
Alexey Romanov, Chaitanya P. Shivade
A large-scale benchmark for evaluating large language models on medical question answering in Romanian
Ana-Cristina Rogoz, R. Ionescu, Alexandra-Valentina Anghel et al.
♫ MuSiQue: Multihop Questions via Single-hop Question Composition
H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.
Time Travel in LLMs: Tracing Data Contamination in Large Language Models
Shahriar Golchin, M. Surdeanu
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee et al.
Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering
B. Colelough, Davis Bartels, Dina Demner-Fushman
Overview of the Medical Question Answering Task at TREC 2017 LiveQA
Asma Ben Abacha, Eugene Agichtein, Yuval Pinter et al.
MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval
Qiao Jin, Won Kim, Qingyu Chen et al.
RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports
Sarvesh Soni, Meghana Gudala, A. Pajouhi et al.
HealthBench: Advancing AI evaluation in healthcare, but not yet clinically ready
Jialin Liu, Siru Liu
MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering
Juraj Vladika, P. Schneider, Florian Matthes
RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions
Gregory Kell, A. Roberts, Serge Umansky et al.
K-QA: A Real-World Medical Q&A Benchmark
Itay Manes, Naama Ronn, David Cohen et al.
LongHealth: A Question Answering Benchmark with Long Clinical Documents
L. Adams, Felix Busch, T. Han et al.
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
Johannes Welbl, Pontus Stenetorp, Sebastian Riedel
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu
BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain
Yunsoo Kim, Yusuf Abdulle, Honghan Wu
emrQA: A Large Corpus for Question Answering on Electronic Medical Records
Anusri Pampari, Preethi Raghavan, Jennifer J. Liang et al.