MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

TL;DR

Proposes MedCase-Structured, a pipeline combining LLMs and terminology validation to generate HL7 FHIR R4 clinical datasets for diagnostic reasoning, with an 82.5% success rate.

cs.CL 🔴 Advanced 2026-05-29 95 views

Valentina Bui Muti Eugénie Dulout Ziquan Fu

AI Reader Arxiv Page Download PDF

medical AI EHR interoperability FHIR standard synthetic data diagnostic reasoning

Key Findings

Methodology

This paper introduces a multi-stage generation pipeline that integrates large language models (LLMs) with terminology grounding mechanisms to convert unstructured clinical text into structured, interoperable HL7 FHIR R4 bundles. The core process involves information extraction, FHIR resource synthesis, and semantic validation. During extraction, LLMs identify key clinical elements, retaining source quotes for validation. The synthesis stage maps extracted data into FHIR resources such as Patient, Encounter, and Condition, guided by predefined templates. The validation step employs SapBERT embeddings and FAISS indexing to verify codes against curated terminologies like SNOMED CT, LOINC, and RxNorm, correcting hallucinated or unsupported codes through a repair loop. Multiple validation and repair iterations ensure high structural and semantic fidelity, reducing errors. The pipeline’s effectiveness was demonstrated on the MedCaseReasoning dataset, achieving a 82.5% valid FHIR generation rate, significantly surpassing prior approaches.

Key Results

In the MedCase-Structured dataset, LLMs such as GPT-5.4 and Claude-Opus-4.6 achieved diagnostic accuracies over 85% on plain text inputs, but performance dropped to as low as 70% when operating on structured FHIR data, indicating increased complexity and reasoning difficulty.
The generated FHIR bundles showed an error rate of approximately 17.5%, mainly due to hallucinated codes and coverage gaps. The terminology validation and repair mechanisms effectively mitigated these issues, improving data quality.
Across multiple diagnostic tasks, models performed consistently worse on structured data compared to unstructured text, emphasizing the challenges posed by the formal data format and the importance of deployment-aligned evaluation protocols.

Significance

This work addresses a critical gap in clinical AI research by providing a scalable, controllable method to generate realistic, structured clinical datasets aligned with real-world interoperability standards. Such datasets enable rigorous benchmarking of clinical decision support systems (CDSS) in environments that closely mimic actual healthcare settings, facilitating better model generalization and robustness. The approach enhances the reproducibility and comparability of AI models across studies, accelerating clinical translation. Furthermore, it supports the development of more reliable and explainable AI tools, ultimately contributing to safer, more effective patient care. The pipeline’s ability to simulate complex diagnostic scenarios with high fidelity marks a significant step toward deploying AI in routine clinical workflows.

Technical Contribution

The paper’s main technical innovation lies in the integration of LLM-based extraction, structured resource synthesis, and a robust terminology grounding and validation framework. By employing SapBERT embeddings and FAISS for high-precision code matching, the method effectively reduces hallucination errors common in LLM outputs. The multi-stage validation and repair loop ensures that generated FHIR bundles meet both structural and semantic standards, enabling scalable and reliable synthetic data production. Compared to prior static or rule-based methods like Synthea, this approach offers higher flexibility, resource diversity, and control over clinical complexity, making it suitable for diagnostic reasoning benchmarks. The framework also lays the groundwork for future extensions to longitudinal data modeling and multi-resource integration.

Novelty

This study is the first to propose a comprehensive, LLM-driven, multi-stage pipeline specifically designed for generating high-fidelity, structured FHIR datasets tailored for diagnostic reasoning evaluation. Unlike existing tools that mainly reconstruct existing records or generate static datasets, this pipeline allows on-demand, controllable synthesis from unstructured text, with integrated terminology validation to minimize hallucinations. The achievement of an 82.5% valid case generation rate demonstrates its robustness. This innovation bridges the gap between unstructured clinical narratives and structured interoperable data, providing a new standard for clinical AI benchmarking and development.

Limitations

The current pipeline supports a limited subset of FHIR resources, primarily focusing on core clinical entities, and does not yet model complex temporal relationships or longitudinal patient trajectories, which are vital for comprehensive clinical reasoning.
Terminology grounding remains imperfect, especially for emerging, ambiguous, or context-dependent codes, leading to residual hallucination errors that could impact downstream tasks.
The computational cost of multi-stage validation and repair processes is high, limiting scalability for large-scale or real-time applications. Further optimization is necessary to enhance efficiency.

Future Work

Future efforts will focus on expanding resource coverage, including more complex resource types such as procedures, observations, and longitudinal data representations. Improving the semantic validation process with context-aware mechanisms and dynamic terminology updates is also planned. Additionally, integrating multi-modal data sources like imaging and genomics could enrich synthetic datasets. Developing end-to-end training strategies that fine-tune LLMs on curated clinical datasets will further enhance the realism and accuracy of generated data. Ultimately, the goal is to create a comprehensive, scalable platform capable of supporting real-time clinical decision support evaluation and deployment, bridging the gap between research and practice.

AI Executive Summary

In the rapidly evolving landscape of healthcare AI, electronic health records (EHRs) serve as a vital repository of patient information, yet their inherent complexity and heterogeneity pose significant challenges for model training and evaluation. Traditional datasets like MIMIC-IV provide valuable benchmarks but are limited in scope, often lacking the structural richness and clinical diversity necessary for robust diagnostic reasoning assessments. Synthetic data generation offers a promising alternative, but existing tools such as Synthea rely on heuristic rules and predefined modules, which restrict their ability to simulate complex, atypical, or nuanced clinical scenarios.

This paper introduces MedCase-Structured, a novel framework designed to generate high-fidelity, structured clinical datasets aligned with HL7 FHIR R4 standards. The core innovation is a multi-stage pipeline that leverages large language models (LLMs) for information extraction and resource synthesis, combined with a rigorous terminology grounding and validation mechanism. The process begins with LLMs parsing unstructured clinical narratives to identify key patient data, including demographics, symptoms, labs, and medications. These extracted elements are then mapped into FHIR resources such as Patient, Encounter, and Condition, guided by predefined templates to ensure structural integrity.

A critical component of the pipeline is the terminology validation step, which employs SapBERT embeddings and FAISS indexing to verify and correct clinical codes against curated terminologies like SNOMED CT, LOINC, and RxNorm. This step effectively reduces hallucinated or unsupported codes, ensuring the generated datasets are both clinically plausible and interoperable. Multiple validation and repair iterations further enhance data quality, resulting in a synthetic dataset with an 82.5% success rate in producing valid FHIR bundles from clinician-authored cases.

The evaluation of this dataset revealed important insights: while LLMs perform well on plain-text diagnostic tasks, their accuracy diminishes significantly when operating on structured FHIR data. For instance, GPT-5.4 achieved over 85% accuracy on unstructured inputs but dropped to approximately 70% on structured data, highlighting the increased reasoning complexity introduced by formal data formats. These findings underscore the necessity of deploying models in environments that reflect real-world clinical data structures.

Overall, MedCase-Structured provides a scalable, controllable, and clinically realistic platform for benchmarking AI systems in diagnostic reasoning. It bridges the gap between unstructured narrative data and structured interoperable formats, facilitating more accurate and meaningful evaluation of clinical decision support tools. Future enhancements will aim to expand resource coverage, incorporate longitudinal data, and improve semantic validation, paving the way for AI models that are better suited for real-world deployment and ultimately improving patient outcomes.

Deep Dive

Abstract

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

cs.CL cs.AI

References (13)

Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record

Jason A. Walonoski, Mark Kramer, Joseph Nichols et al.

2017 428 citations ⭐ Influential

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

Kevin Wu, Eric Wu, R. Thapa et al.

2025 22 citations ⭐ Influential View Analysis →

Billion-Scale Similarity Search with GPUs

Jeff Johnson, Matthijs Douze, H. Jégou

2017 5214 citations View Analysis →

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

David Fraile Navarro, Farah Magrabi, Enrico W. Coiera

2026 4 citations View Analysis →

MIMIC-IV, a freely accessible electronic health record dataset

A. Johnson, Lucas Bulgarelli, Lu Shen et al.

2023 2830 citations

A systematic review of large language model (LLM) evaluations in clinical medicine

Sina Shool, Sara Adimi, Reza Saboori Amleshi et al.

2025 240 citations

MIMIC-IV on FHIR: converting a decade of in-patient data into an exchangeable, interoperable format

A. Bennett, Hannes Ulrich, P. Damme et al.

2023 34 citations

Self-Alignment Pretraining for Biomedical Entity Representations

Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng et al.

2020 443 citations View Analysis →

Reasoning with large language models in medicine: a systematic review of techniques, challenges and clinical integration

Isra Mansoor, Muhammad Abdullah, M. Rizwan et al.

2025 11 citations

FHIR-GPT Enhances Health Interoperability with Large Language Models.

Yikuan Li, Hanyin Wang, H. Yerebakan et al.

2024 23 citations

A scoping review of using Large Language Models (LLMs) to investigate Electronic Health Records (EHRs)

Lingyao Li, Jiayan Zhou, Zhenxiang Gao et al.

2024 78 citations View Analysis →

Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes

Johann Frei, Nils Feldhus, Lisa Raithel et al.

2025 1 citations View Analysis →

EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks

Xiao Yang, Xuejiao Zhao, Zhiqi Shen

2025 5 citations View Analysis →

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (13)

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs