MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

TL;DR

MLEvolve is a self-evolving multi-agent framework using LLMs for end-to-end machine learning algorithm discovery, achieving 65.3% medal rate within 12 hours.

cs.AI 🔴 Advanced 2026-06-05 78 views

Shangheng Du Xiangchao Yan Jinxin Shi Zongsheng Cao Shiyang Feng Zichen Liang Boyuan Sun Tianshuo Peng Yifan Zhou Xin Li Jie Zhou Liang He Bo Zhang Lei Bai

AI Reader Arxiv Page Download PDF

AutoML Multi-agent System Graph Search Experience Memory Self-evolution

Key Findings

Methodology

MLEvolve employs Progressive MCGS, an extension of Monte Carlo Tree Search integrated with graph-based cross-branch information sharing and an entropy-inspired progressive exploration schedule. The framework combines three core modules: (1) Progressive MCGS, which enhances exploration-exploitation balance through graph structures and adaptive scheduling; (2) Retrospective Memory, unifying static domain knowledge with dynamic experience storage for continuous learning; (3) Hierarchical Planning with Adaptive Code Generation, decoupling strategic planning from code implementation, and switching among full rewrite, stepwise, and diff modes based on search state. These components enable stable, long-horizon iterative optimization of ML pipelines. Experiments on MLE-Bench and mathematical optimization tasks demonstrate superior performance, with a 65.3% medal rate under a 12-hour constraint, outperforming existing methods including AlphaEvolve, and showing strong cross-domain generalization.

Key Results

Under a 12-hour time budget, MLEvolve achieved an average medal rate of 65.3% on MLE-Bench, outperforming all baseline methods, indicating highly efficient exploration and solution refinement capabilities.
On mathematical algorithm optimization tasks, MLEvolve surpassed AlphaEvolve, demonstrating its ability to generalize across different problem domains with improved success rates and solution quality.
The integration of Progressive MCGS and Retrospective Memory significantly improved search stability and efficiency, reducing redundant exploration and enabling experience-driven optimization.

Significance

This work addresses longstanding challenges in long-horizon automated machine learning, providing a comprehensive framework that combines graph-based search, experience memory, and hierarchical control. It advances the state-of-the-art in autonomous AI systems capable of continuous self-improvement over extended periods. The demonstrated cross-domain capabilities suggest broad applicability in scientific discovery, industrial automation, and complex system optimization. By enabling AI agents to learn from accumulated experience and adapt their strategies dynamically, this research paves the way for more autonomous, intelligent systems that can operate effectively in real-world, long-term scenarios.

Technical Contribution

The paper introduces Progressive MCGS, a novel graph-structured extension of Monte Carlo Tree Search that facilitates cross-branch information flow and adaptive exploration. The Retrospective Memory mechanism integrates static domain knowledge with dynamic experience storage, enabling continuous learning during search. The hierarchical planning module decouples strategic reasoning from code implementation, supporting multiple coding modes for stability and flexibility. These innovations collectively push the frontier of automated algorithm discovery, providing a robust, scalable approach for long-horizon optimization tasks.

Novelty

This is the first work to incorporate Progressive MCGS with cross-branch information sharing and entropy-based exploration scheduling into automated machine learning. The integration of Retrospective Memory for experience accumulation and the decoupling of planning and coding modes further distinguishes this framework from prior methods, enabling sustained self-evolution and cross-domain generalization. These innovations collectively establish a new paradigm for autonomous, long-term AI optimization.

Limitations

Despite significant improvements, the approach may still face challenges in extremely high-dimensional or highly complex search spaces, where exploration becomes computationally expensive within limited time frames.
The reliance on large language models introduces computational costs and potential biases, which could affect robustness and reproducibility in real-world applications.
The current framework's performance depends on the quality of the static domain knowledge base and experience retrieval, which may require domain-specific tuning and curation.

Future Work

Future research could focus on integrating reinforcement learning techniques to enhance exploration strategies further. Developing more efficient memory storage and retrieval systems will be critical for scaling to larger, more complex problems. Extending the framework to multi-modal data and multi-agent collaboration could unlock new applications in scientific research and industrial automation. Additionally, improving interpretability and explainability of the search process will be vital for practical deployment and trustworthiness.

AI Executive Summary

The rapid advancement of artificial intelligence has driven significant interest in automating the design of high-performance machine learning algorithms. Traditional AutoML approaches, while effective in automating model selection and hyperparameter tuning, often rely heavily on manual intervention and struggle with complex, multi-stage pipelines. Recent developments leveraging large language models (LLMs) have opened new avenues for autonomous AI agents capable of long-term, iterative optimization. These agents can plan, generate code, execute, evaluate, and adapt strategies over extended horizons, mimicking human-like problem-solving processes.

However, existing systems face critical limitations. Many suffer from information isolation between different search trajectories, preventing effective knowledge transfer and reuse. They also lack mechanisms to accumulate and leverage past experiences, leading to redundant exploration. Furthermore, the coupling of strategic planning and code generation hampers the flexibility and stability of the search process. These issues collectively hinder the ability of autonomous agents to perform sustained, efficient, and reliable long-horizon optimization.

In response, this paper introduces MLEvolve, a novel self-evolving multi-agent framework designed to address these challenges. Central to MLEvolve is Progressive MCGS, an innovative graph-structured extension of Monte Carlo Tree Search. This mechanism facilitates cross-branch information sharing through reference edges, enabling the transfer of successful strategies across different search paths. Coupled with an entropy-inspired progressive exploration schedule, it dynamically balances exploration and exploitation, improving search efficiency.

Complementing this is Retrospective Memory, which combines a curated static knowledge base with a dynamic experience repository. This setup allows the system to automatically accumulate, retrieve, and reuse valuable search experiences, significantly reducing redundant efforts and enhancing solution quality over time. To further improve stability and control, the framework decouples strategic planning from code generation, employing adaptive modes such as full rewrite, stepwise, and diff-based editing, tailored to the current search context.

Experimental results demonstrate the effectiveness of MLEvolve. On the comprehensive MLE-Bench, it achieved a 65.3% medal rate within a 12-hour window, outperforming all existing methods, including proprietary and open-source baselines. Its cross-domain generalization was validated through superior performance on mathematical optimization tasks, surpassing AlphaEvolve. These findings highlight the framework’s robustness, efficiency, and adaptability.

This research marks a significant step toward autonomous AI systems capable of long-term self-improvement. By integrating graph-based search, experience memory, and hierarchical control, MLEvolve provides a scalable, flexible platform for automated machine learning pipeline discovery. Its ability to learn from past experiences and adapt strategies dynamically opens new horizons for scientific discovery, industrial automation, and beyond. Future directions include incorporating reinforcement learning, multi-modal data, and multi-agent collaboration to further enhance autonomous capabilities, making AI systems more intelligent, reliable, and versatile in tackling complex real-world problems.

Deep Analysis

Background

随着人工智能技术的不断演进，自动化设计高性能机器学习算法成为研究的焦点。早期的AutoML方法如Auto-WEKA和TPOT主要通过超参数优化和模型选择实现自动化，但仍依赖大量人工经验和繁琐调优。近年来，深度学习和强化学习驱动的自动算法搜索（如Neural Architecture Search，NAS）极大提升了自动化水平。特别是，基于大规模预训练模型（如GPT系列）的智能代理系统开始在长远任务中展现出自主演化的潜力。代表性工作包括AlphaEvolve和ML-Master，它们通过树搜索、演化算法和多智能体协作探索候选方案。然而，这些方法普遍存在信息孤岛、经验缺失和缺乏层次控制的问题，限制了在复杂长周期任务中的表现。随着对自主AI系统需求的增长，研究者开始关注如何实现系统的持续自我改进和跨域泛化，推动了本研究的提出。

Core Problem

现有的自动算法发现方法在长远优化中存在三大难题：一是信息孤岛，分支间缺乏有效交流，导致成功策略难以在不同路径中迁移；二是缺乏记忆机制，无法积累和重用过去的经验，导致探索效率低下；三是缺少层次控制，代码生成多为一体化，缺乏对策略和实现的区分，影响搜索的稳定性和效率。这些问题在复杂、多阶段、多任务的机器学习工程中尤为突出，严重制约了自动算法发现的效果和泛化能力。解决这些瓶颈，成为实现自主长周期优化的关键。

Innovation

本研究的核心创新包括：

�� Progressive MCGS：引入图结构的交叉引用边，支持跨分支信息共享，结合entropy启发的渐进式探索调度，从而在搜索过程中逐步从探索转向利用，提升效率和稳定性。
�� Retrospective Memory：结合静态领域知识库和动态全局记忆，自动积累和重用搜索经验，避免重复探索，增强系统的自主学习能力。
�� 层次规划与自适应编码：将策略决策与代码实现分离，支持全重写、逐步生成和差异化编辑三种模式，提升代码生成的稳定性和可控性。这些创新共同推动了端到端机器学习自动发现的技术边界，为复杂任务的长远优化提供了新思路。

Methodology

�� 构建搜索空间：将候选解决方案组织为有向图，节点代表完整的ML流程，边包括生成关系（E_T）和引用关系（E_ref）。
�� Progressive MCGS：在选择阶段，采用UCT准则结合信息熵调度，动态调整探索策略，从而在搜索过程中逐步从探索转向利用。
�� 图结构扩展：引入交叉引用边，支持跨分支信息流动和方案融合，增强搜索的多样性和效率。
�� 经验回忆机制：结合静态知识库和动态全局记忆，利用FAISS和RRF实现高效检索，支持任务相关的经验积累与重用。
�� 层次化规划：将策略决策与代码生成分离，采用全重写、逐步和差异化三种编码模式，根据搜索状态自适应选择。
�� 实验设计：在MLE-Bench和数学优化任务上进行评估，比较多种基线，包括AlphaEvolve，采用奖牌率、提交率等指标，验证方法的有效性。

Experiments

实验采用两个主要基准：MLE-Bench（包含75个Kaggle任务，覆盖低、中、高复杂度）和AlphaEvolve的数学优化任务（15个实例）。硬件方面，使用Gemini-3.1-Pro模型，配置21 vCPU、234GB内存和NVIDIA H200 GPU。每个任务最大扩展500次，时间限制为12小时。评估指标包括奖牌率、有效提交率和任务成功率。对比方法涵盖多种AutoML框架和算法发现工具，进行ablation研究验证Progressive MCGS、Retrospective Memory和层次规划的贡献。超参数调优通过验证集实现，确保公平性和稳定性。

Results

在12小时预算下，MLEvolve在MLE-Bench上实现了65.3%的平均奖牌率，优于所有对比方法，显示出其在长时间探索中的优势。数学优化任务中，MLEvolve超越AlphaEvolve，表现出更高的成功率和更优的解质量。引入Progressive MCGS显著提升了搜索效率，减少了无效探索；Retrospective Memory增强了经验利用，降低了重复劳动；层次化编码提升了代码生成的稳定性。这些结果验证了框架设计的有效性和实用性。

Applications

该方法适用于自动化机器学习流程设计、科学研究中的算法探索、工业中的模型优化等场景。只需提供任务描述和基础数据，系统即可自主探索最优方案，减少人工干预。未来，结合自动化硬件调度和多模态数据输入，有望实现全自动化的AI系统设计，推动智能制造、科研创新和个性化服务的发展。

Limitations & Outlook

尽管取得了显著进展，MLEvolve在极端复杂或高维空间中仍可能遇到探索瓶颈，尤其在有限时间内难以保证全局最优。对大规模知识库的存储和检索效率提出挑战，可能影响实际应用的响应速度。此外，模型在某些特定任务上的表现仍受限于LLM的推理能力，未来需结合更强的知识融合和推理机制以提升表现。

Abstract

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

cs.AI cs.CL

References (20)

The FM Agent

Annan Li, Chufan Wu, Z. Ge et al.

2025 13 citations View Analysis →

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe et al.

2024 247 citations View Analysis →

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu, Yuzhu Cai, Zexi Liu et al.

2026 18 citations View Analysis →

Mathematical exploration and discovery at scale

Bogdan Georgiev, Javier G'omez-Serrano, Terence Tao et al.

2025 58 citations View Analysis →

LocAgent: Graph-Guided LLM Agents for Code Localization

Zhaoling Chen, Xiangru Tang, Gangda Deng et al.

2025 65 citations View Analysis →

AIBuildAI: An AI Agent for Automatically Building AI Models

Ruiyi Zhang, Peijia Qin, Qingmei Cao et al.

2026 2 citations View Analysis →

A Survey on the Memory Mechanism of Large Language Model-based Agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo et al.

2024 568 citations View Analysis →

Towards end-to-end automation of AI research

Chris Lu, Cong Lu, R. Lange et al.

2026 71 citations

AI and science: what 1,600 researchers think

Richard Van Noorden, Jeffrey Perkel

2023 291 citations

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang et al.

2023 234 citations View Analysis →

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

Shiyang Feng, Runmin Ma, Xiang-yu Yan et al.

2026 18 citations View Analysis →

AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents

Shangheng Du, Xiangchao Yan, Dengyang Jiang et al.

2025 12 citations View Analysis →

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski et al.

2025 34 citations View Analysis →

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

Xu Yang, Xiao Yang, Shikai Fang et al.

2025 19 citations View Analysis →

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi, Jaehyun Nam et al.

2026 8 citations View Analysis →

Monte-Carlo Graph Search: the Value of Merging Similar States

Edouard Leurent, Odalric-Ambrym Maillard

2020 22 citations

KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization

Alireza Nadafian, Alireza Mohammadshahi, Majid Yazdani

2026 5 citations View Analysis →

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger et al.

2025 529 citations View Analysis →

Software Engineering for Machine Learning: A Case Study

Saleema Amershi, Andrew Begel, C. Bird et al.

2019 998 citations

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Yifei Zhang, Xu Yang, Xiao Yang et al.

2026 2 citations View Analysis →

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

References (20)

Related Papers

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

Multi-Agent Transactive Memory

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Automated reproducibility assessments in the social and behavioral sciences using large language models

The Role of Feedback Alignment in Self-Distillation