LLM-Guided Evolution for Medical Decision Pipelines

TL;DR

This paper introduces LLM-guided MAP-Elites evolution for optimizing medical decision pipelines, improving accuracy and safety metrics significantly across tasks.

cs.CL 🔴 Advanced 2026-06-05 63 views

Ivan Sviridov Artem Oskin Ivan Panin Iaroslav Bespalov Dmitry Dylov Ivan Oseledets Aleksandr Nesterov

AI Reader Arxiv Page Download PDF

AI medical decision evolutionary algorithms large language models optimization

Key Findings

Methodology

This study employs a novel inference-time optimization framework combining large language models (LLMs) with MAP-Elites quality-diversity evolutionary algorithms. The core process involves using a pre-trained gpt-oss-120b model as a mutation operator to generate program variants through rewriting and mutation. Candidate solutions include executable decision programs, prompts, or policies tailored for specific clinical tasks. These candidates are evaluated via task-specific fitness functions—such as accuracy, safety, and interaction cost—and stored in an archive structured by behavioral descriptors. The MAP-Elites algorithm maintains diversity by exploring different behavioral niches, preventing premature convergence. The process iterates with LLM-driven mutations, selection based on fitness, and archive updates, enabling the discovery of multiple high-performing strategies without fine-tuning the underlying models. Experiments span three clinical scenarios: triage classification (Semigran, MIMIC-IV-ED), interactive consultation (MEDIQ, iCRAFTMD), and medical image classification (PneumoniaMNIST). The approach emphasizes interpretability, safety, and transferability, demonstrating that inference-time evolution can outperform manually engineered pipelines.

Key Results

In the triage task, the evolved program SG-c1189 increased Semigran accuracy from 77.3% to 87.1%, and emergency recall from 0.60 to 0.97, significantly surpassing baseline methods. On MIMIC-ESI, the program reduced severe undertriage from 3.6% to 1.2%, indicating enhanced safety. These improvements were statistically significant (p<0.001).
In interactive consultation, the evolved policies improved the accuracy-cost frontier across models such as Llama-3, Qwen-3.5, and Gemma-4. For example, on Llama-3-70B, accuracy increased from 60.9% to 62.2%, while token usage decreased by 67.6%. The strategies transferred effectively to the held-out iCRAFTMD dataset, confirming robustness.
In PneumoniaMNIST classification, prompt-only evolution improved the accuracy of MedGemma-4B from below 51% to 68-72%, depending on resolution, with the highest at 224×224 reaching 72.5%. These results demonstrate that simple prompt modifications, guided by evolution, can substantially enhance vision-language model performance while maintaining strict JSON output constraints.

Significance

This research addresses critical limitations of current AI-driven medical decision systems, notably the high costs of fine-tuning and the manual effort involved in prompt engineering. By leveraging LLMs as mutation operators within a quality-diversity evolutionary framework, it enables automatic, interpretable, and task-specific optimization of decision pipelines at inference time. This approach significantly reduces development overhead, enhances safety by calibrating decision boundaries, and improves transferability across models and tasks. The methodology offers a scalable, flexible solution adaptable to diverse clinical scenarios, paving the way for more autonomous and trustworthy AI-assisted healthcare systems. Its ability to produce multiple diverse strategies aligns well with the complex, multi-objective nature of clinical decision-making, addressing long-standing challenges of robustness and explainability.

Technical Contribution

The core technical innovation lies in integrating large language models directly into the evolutionary optimization process as mutation engines, enabling program-level modifications without retraining. The use of MAP-Elites ensures behavioral diversity, allowing exploration of multiple strategies that balance accuracy, safety, and cost. The framework supports heterogeneous candidate representations—ranging from prompts to executable programs—making it adaptable to various clinical tasks. The combination of structured fitness functions, lineage tracking, and multi-model transfer evaluation constitutes a comprehensive pipeline that advances the state-of-the-art in inference-time optimization. This approach opens new avenues for scalable, interpretable AI in safety-critical domains, with potential for automated policy discovery and continuous improvement.

Novelty

This work is pioneering in applying LLM-guided MAP-Elites evolution specifically to medical decision pipelines, a domain where safety and interpretability are paramount. Unlike prior studies focused on static prompt tuning or fine-tuning, it dynamically evolves executable artifacts at inference time, enabling multi-objective optimization in complex clinical tasks. The integration of quality-diversity algorithms with large language model mutations is novel, providing a flexible framework that can generate diverse, high-quality solutions without retraining models. This approach addresses a gap in the literature by demonstrating effective, safe, and transferable strategies across multiple healthcare scenarios, establishing a new paradigm for adaptive AI in medicine.

Limitations

The method relies heavily on the quality and stability of the underlying large language models; stochasticity in model outputs can affect reproducibility and consistency of evolved solutions, especially in high-stakes clinical settings.
The computational cost of iterative evolution, involving multiple model calls and candidate evaluations, remains high, limiting real-time deployment in resource-constrained environments.
While the approach enhances interpretability through program-level mechanisms, the complexity of evolved artifacts may still pose challenges for clinical validation and regulatory approval, necessitating further explainability studies.

Future Work

Future research will focus on integrating multi-modal data sources, such as electronic health records and imaging, to create more comprehensive decision pipelines. Combining reinforcement learning with evolutionary strategies may further improve robustness and adaptability. Additionally, involving clinicians in the loop for feedback and validation can enhance interpretability and trust. Scaling the framework to real-world clinical workflows, optimizing computational efficiency, and establishing rigorous safety and efficacy standards will be crucial steps toward clinical adoption. Exploring automated multi-objective optimization and formal verification of evolved programs also represent promising directions.

AI Executive Summary

The rapid advancement of large language models (LLMs) has revolutionized many AI applications, including healthcare. However, adapting these models to complex clinical workflows remains a significant challenge. Traditional approaches rely heavily on costly fine-tuning or manual prompt engineering, which are time-consuming, resource-intensive, and often lack robustness and interpretability. This paper introduces a novel inference-time optimization framework—LLM-guided MAP-Elites evolution—that addresses these limitations by enabling dynamic, multi-objective optimization of medical decision pipelines without retraining models.

At its core, the framework leverages a pre-trained LLM (gpt-oss-120b) as a mutation engine, which rewrites and mutates candidate decision programs, prompts, or policies. These candidates are evaluated using task-specific fitness functions that measure accuracy, safety, and interaction cost, reflecting real-world clinical priorities. The MAP-Elites algorithm maintains a diverse archive of high-performing solutions across behavioral niches, ensuring exploration of multiple strategies rather than converging on a single solution. This approach facilitates the discovery of multiple effective decision policies suited for different clinical scenarios.

The methodology was validated across three critical medical tasks: urgency triage, interactive consultation, and medical image classification. In triage, the evolved program SG-c1189 increased accuracy from 77.3% to 87.1%, and emergency recall from 0.60 to 0.97, outperforming baseline methods and reducing severe undertriage. In interactive consultation, strategies optimized for models like Llama-3 and Qwen-3.5 achieved better accuracy-cost trade-offs, with transferability demonstrated on unseen datasets such as iCRAFTMD. For image classification, prompt-only evolution improved MedGemma models’ accuracy on PneumoniaMNIST from below 51% to over 72%, maintaining strict output formats.

These results highlight the potential of inference-time evolution to generate interpretable, safe, and high-performance decision strategies in healthcare. The evolved programs incorporate mechanisms such as calibrated triage boundaries, targeted evidence acquisition, and visual decision rules, which are crucial for clinical trust and safety. The approach significantly reduces manual effort, enhances model transferability, and offers a scalable solution adaptable to diverse tasks. Despite current limitations related to model stochasticity and computational costs, the framework paves the way for more autonomous, explainable, and efficient AI systems in medicine. Future work will explore multi-modal data integration, reinforcement learning enhancements, and clinical validation, aiming to bring these innovations closer to real-world deployment and impact.

Deep Analysis

Background

The evolution of AI in healthcare has transitioned from rule-based systems to deep learning models, with large language models (LLMs) like GPT-3, Llama, and Qwen在医疗文本理解、诊断辅助和影像分析中展现出巨大潜力。早期研究多集中在微调预训练模型以适应特定任务，但成本高昂且难以快速适应新场景。近年来，基于提示工程和推理优化的方法逐渐兴起，试图在不微调模型的情况下提升性能。然而，这些方法多依赖静态提示或手工设计的规则，缺乏动态适应能力。MAP-Elites等质量-多样性算法的出现，为探索多策略、多目标的优化提供了新途径。结合大模型的变异能力，研究者开始尝试在推理阶段动态生成和优化决策程序，从而实现更高的灵活性和可解释性。本研究正是在此背景下，提出了结合LLM变异和MAP-Elites的医疗决策优化框架，旨在解决现有方法在安全性、效率和迁移性方面的不足。

Core Problem

医疗决策流程复杂多变，涉及急诊分诊、临床咨询和影像诊断等多个环节。传统方法依赖手工设计规则或微调模型，既耗费大量人力，又难以应对临床环境中的多样化需求。手工提示工程依赖经验，反复试错，缺乏系统性和可解释性，难以保证安全性和可靠性。微调模型虽能提升性能，但成本高昂，且微调后模型难以快速迁移到新任务或新环境中。如何在保证模型性能的同时，降低开发成本、提升策略多样性和安全性，成为亟待解决的核心难题。特别是在多任务、多模型、多目标的实际场景中，单一优化目标难以兼顾所有需求，亟需一种灵活、高效的解决方案。

Innovation

本研究的创新点主要体现在：1）提出基于LLM引导的MAP-Elites演化框架，实现推理时程序和策略的动态优化，避免微调成本，提升适应性；2）设计多目标适应度函数，结合准确率、安全指标和交互成本，满足临床多维需求；3）在急诊分诊、临床咨询和影像分类等多场景中验证策略的有效性，展示其跨任务迁移能力；4）引入程序级别的机制调整（如校准边界、目标证据采集、视觉提示），增强模型的可解释性和安全性。这些创新突破了传统静态提示和微调的局限，为医疗AI提供了全新的动态、可解释的优化路径。

Methodology

�� 采用预训练的gpt-oss-120b模型作为变异算子，对候选程序进行重写和变异，确保多样性；
�� 构建任务特定的适应度函数，包括准确率、召回率、安全指标（如漏诊率）、交互成本等，反映临床实际需求；
�� 利用MAP-Elites算法在行为特征空间中维护多样化的候选集，避免陷入局部最优，确保不同策略的探索；
�� 设计多任务的候选程序表示，从简单的提示到完整的决策程序，支持不同场景的优化需求；
�� 通过多轮演化，筛选出在验证集上表现优异的程序，并在未见数据上进行测试验证，确保策略的泛化能力；
�� 在每一轮中，利用LLM对候选程序进行变异和重写，结合任务反馈不断优化，形成闭环优化流程。

Experiments

实验设计涵盖三个主要场景：急诊分诊（Semigran、MIMIC-IV-ED）、互动咨询（MEDIQ、iCRAFTMD）和医学影像分类（PneumoniaMNIST）。每个场景设有基线（手工设计程序或提示）和演化优化方案，评估指标包括准确率、召回率、安全指标、交互成本和输出格式的结构化程度。采用多模型（如GPT-4、Llama-3、Qwen-3.5、Gemma-4）进行迁移测试，验证策略的泛化能力。每个任务都划分训练集、验证集和测试集，演化在训练集上进行，最终在测试集上评估性能。还设计了消融实验，验证不同机制（如程序重写、行为特征设计、多目标优化）的贡献。

Results

在急诊分诊任务中，演化程序将Semigran准确率从77.3%提升至87.1%，召回率从0.60提升至0.97，显著优于手工设计方案。MIMIC-ESI上，最优程序降低严重漏诊比例（从3.6%降至1.2%），提升安全性。在互动咨询中，策略在Llama-3和Qwen-3.5模型上实现了准确率提升（如Llama-3从45.8%到48.2%，Qwen-3.5从71.1%到73.6%），同时大幅减少交互Token数（如Qwen-3.5由2100降至961），优化了成本-效果平衡。在医学影像分类中，Prompt-only演化使MedGemma模型在PneumoniaMNIST上的准确率由低于51%提升至68%以上，最高达72.5%，验证了在有限变异空间中的潜力。这些结果充分证明了演化策略在多场景、多模型中的有效性和迁移能力。

Applications

该方法适用于临床急诊分诊、远程医疗咨询、医学影像辅助诊断等场景，能够在不改变基础模型的前提下，通过程序优化实现策略提升，降低部署成本。未来还可结合多模态信息和人机交互，构建更智能、更安全的临床决策支持系统。长远来看，该技术有望推动个性化医疗、自动化诊断流程的普及，减少医务人员负担，提高医疗服务效率和安全性。

Limitations & Outlook

目前方法依赖于高质量预训练模型，模型变异的稳定性和可控性尚需验证，特别是在临床复杂场景中可能出现偏差。演化过程在多目标优化时可能陷入局部最优，难以确保全局最优解。实验多在模拟环境或有限数据集上进行，实际临床应用还需考虑系统的实时性、用户体验和伦理合规问题。此外，演化过程的计算成本较高，未来需优化算法效率以适应大规模临床部署。

Plain Language Accessible to non-experts

想象你在厨房里准备一道复杂的菜肴。传统上，你会按照菜谱一步步操作，可能需要不断试错，调整调料和火候，才能做出满意的菜。而现在，假如你有一个智能厨师（就像大语言模型），它可以在你做菜的过程中不断观察、学习你的偏好，甚至帮你改良菜谱。这个厨师不会改变你的厨房设备，但能在你做菜时提供建议、调整步骤，确保菜肴越来越好。这个研究就像让这个智能厨师在厨房里不断试验不同的做法，找到最合适你的那一套，既省时间又能做出更美味的菜。它用一种叫“演化”的方法，模拟自然选择，不断试错、优化，最终帮你做出最符合你需求的菜肴。这样一来，即使没有专业厨师的经验，也能轻松做出高水平的美味佳肴，医疗决策也是如此，复杂多变的场景需要不断试验和调整策略，才能找到最安全、最高效的方案。

ELI14 Explained like you're 14

想象你在学校里参加一个比赛，你需要设计一个能帮你赢得比赛的策略。以前，你可能会花很多时间自己琢磨，试着写一些规则，然后反复试验，看哪个效果最好。现在，有个超级聪明的朋友（就像大语言模型），他可以帮你想出很多不同的策略，还能告诉你哪种策略最靠谱。你们一起试验这些策略，看看哪个最适合比赛。这个研究就是用一种叫“演化”的方法，让这个聪明的朋友不断帮你改进策略，试出各种不同的办法，最后找到最棒的那一套。它就像在不断试错中学习，找到最适合你的方法。这样一来，即使你不是专家，也能用这个聪明的朋友帮你赢得比赛，特别是在复杂的事情，比如医疗决策中，也可以用这种方法找到最安全、最有效的方案。这个过程就像在游戏里不断升级，最终变得更厉害、更聪明！

Abstract

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.

cs.CL cs.NE

References (20)

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

Farieda Gaber, Maqsood Shaik, Fabio Allega et al.

2025 113 citations ⭐ Influential

MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning

S. Li, Vidhisha Balachandran, Shangbin Feng et al.

2024 149 citations ⭐ Influential View Analysis →

A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges

Aditi Singh, N. Patel, Abul Ehtesham et al.

2024 26 citations View Analysis →

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

Di Jin, Eileen Pan, Nassim Oufattole et al.

2020 1731 citations View Analysis →

Evaluation of symptom checkers for self diagnosis and triage: audit study

Hannah L Semigran, J. Linder, C. Gidengil et al.

2015 476 citations

PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis

K. Lokesh, A. S. Penamakuri, Uday Agarwal et al.

2026 1 citations View Analysis →

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger et al.

2025 531 citations View Analysis →

Automated Machine Learning: From Principles to Practices

Quanming Yao, Mengshuo Wang, Hugo Jair Escalante et al.

2018 263 citations View Analysis →

EMPOWER: Evolutionary Medical Prompt Optimization With Reinforcement Learning.

Yinda Chen, Yangfan He, Jing Yang et al.

2025 5 citations View Analysis →

Exploration and exploitation in evolutionary algorithms: A survey

M. Črepinšek, Shih-Hsi Liu, M. Mernik

2013 1374 citations

GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms

V. Khrulkov, Andrey V. Galichin, Denis Bashkirov et al.

2025 10 citations View Analysis →

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

L. Masanneck, Linea Schmidt, Antonia Seifert et al.

2024 89 citations

3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

Ivan Sviridov, Amina Miftakhova, Artemiy Tereshchenko et al.

2025 6 citations View Analysis →

MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification

Jiancheng Yang, Rui Shi, D. Wei et al.

2021 1312 citations View Analysis →

From Pre-labeling to Production: Engineering Lessons from a Machine Learning Pipeline in the Public Sector

Ronivaldo Ferreira, Guilherme Horta Alvares Da Silva, Carla Rocha et al.

2025 1 citations View Analysis →

Evolution of triage systems

I. Robertson-Steel

2006 238 citations

Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning

Yuxuan Zhou, Yubin Wang, Bin Wang et al.

2025 4 citations View Analysis →

ALFA: Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

S. Li, Jimin Mun, Faeze Brahman et al.

2025 23 citations View Analysis →

A strategy for cost-effective large language model use at health system-scale

Eyal Klang, Donald U. Apakama, Ethan E Abbott et al.

2024 33 citations

Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study

Ziyuan Qin, Huahui Yi, Qicheng Lao et al.

2022 101 citations View Analysis →

LLM-Guided Evolution for Medical Decision Pipelines

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs