MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

TL;DR

MUSE-Autoskill improves task success to 68.4% via unified skill lifecycle management and cross-agent skill transfer.

cs.AI 🔴 Advanced 2026-05-27 180 views
Huawei Lin Peng Li Jie Song Fuxin Jiang Tieying Zhang
Large Language Models Skill Evolution Skill Management Memory Mechanisms Autonomous Agents

Key Findings

Methodology

This paper introduces MUSE-Autoskill, a skill-centric agent framework built on large language models (LLMs) that integrates the full skill lifecycle: creation, memory, management, evaluation, and refinement. The core innovation includes an embedded skill_create tool enabling on-demand skill generation within the runtime loop, a unique skill-level memory module that accumulates experience across tasks, unit-test-driven automatic evaluation and feedback to ensure skill reliability, and adaptive context compression with cross-session persistence for long-horizon tasks. Skills are executed via a unified interface within isolated sandbox environments, ensuring modularity and testability. The framework supports cross-agent skill transfer, enhancing skill reusability and generalization beyond a single agent instance.

Key Results

  • On the SkillsBench benchmark of 51 real-world tasks, MUSE-Autoskill achieves 68.40% accuracy, a 15.21 percentage point improvement over the no-skill baseline, outperforming Codex and Hermes agents.
  • In 35 tasks where skills were successfully generated, MUSE-Autoskill surpasses the human-skill ceiling, demonstrating the effectiveness and reliability of automatic skill creation.
  • Cross-agent skill transfer experiments show that injecting MUSE-generated skills into the Hermes agent raises its accuracy to 79%, validating the portability and generality of the skills.

Significance

MUSE-Autoskill addresses long-standing limitations in skill-centric agents by treating skills as long-lived, testable, and experience-aware assets rather than isolated static artifacts. This holistic lifecycle approach enhances agents' problem-solving capabilities, efficiency, and skill reuse, while enabling cross-agent sharing. The training-free design ensures broad applicability across LLM backbones and agent architectures, facilitating scalable and sustainable intelligent agent development. These advances have significant implications for both academic research and industrial applications in autonomous AI systems.

Technical Contribution

This work is the first to comprehensively integrate all five stages of the skill lifecycle within a single training-free framework. It introduces a novel skill-level memory that accumulates per-skill experience across tasks, enabling dynamic adaptation. The unit-test-driven evaluation and automatic refinement loop ensure high skill reliability and robustness. The unified skill execution interface leveraging sandbox environments guarantees modular, secure, and consistent skill invocation. Furthermore, the framework empirically validates cross-agent skill transfer, breaking the conventional coupling of skills to specific models and enabling broader skill portability.

Novelty

MUSE-Autoskill is the first training-free system to cover the entire skill lifecycle—creation, memory, management, evaluation, and refinement—simultaneously. Its unique skill-level memory and automated unit-test-driven refinement distinguish it from prior methods that address only partial lifecycle stages or rely on reinforcement learning. The explicit demonstration of cross-agent skill transfer is unprecedented, establishing skills as transferable, long-lived experience assets rather than transient model behaviors.

Limitations

  • In the science and engineering domain, MUSE-Autoskill underperforms Codex on some boundary tasks, likely due to the complexity and constraints in skill generation.
  • Skill creation depends on successful task trajectories; high initial failure rates can limit skill generation efficiency and quality.
  • The multi-level memory and context management impose significant computational and storage overhead, posing challenges for large-scale deployment.

Future Work

Future directions include extending skill generation to multimodal inputs to enhance expressiveness and coverage in complex domains; optimizing memory compression and retrieval to reduce resource consumption; exploring broader cross-agent and cross-model skill transfer scenarios; integrating reinforcement learning for improved skill selection and refinement automation; and developing more comprehensive skill evaluation frameworks encompassing functionality, robustness, and safety.

AI Executive Summary

Large language model (LLM) agents have rapidly advanced in solving complex, multi-step tasks involving diverse domains and external tool interactions. However, existing approaches often treat skills—reusable units of capability—as isolated, static artifacts, limiting their reusability, reliability, and long-term improvement. This paper presents MUSE-Autoskill, a novel skill-centric agent framework that unifies the entire skill lifecycle: creation, memory, management, evaluation, and refinement. By embedding a skill creation tool within the runtime loop, MUSE-Autoskill enables on-demand generation of executable skill packages complete with interface definitions, scripts, resources, and unit tests. A unique skill-level memory accumulates experience across tasks, informing future invocations and adaptations. Skills are evaluated through automated unit tests and runtime feedback, triggering automatic refinement when failures occur. Adaptive context compression and cross-session state persistence support long-horizon task execution without information loss.

The framework executes skills via a unified interface within sandboxed environments, ensuring modularity, safety, and consistent behavior. Skills are indexed and managed in a skill bank, facilitating efficient retrieval, merging, pruning, and cross-agent transfer. Experiments on the SkillsBench benchmark, comprising 51 real-world tasks across science & engineering, data analysis, document processing, and operations & planning domains, demonstrate that MUSE-Autoskill achieves a 68.40% task success rate, outperforming strong baselines including Codex and Hermes. Notably, in tasks where skills are successfully generated, the system surpasses human skill performance ceilings. Cross-agent skill transfer experiments show that skills generated by MUSE-Autoskill can be injected into other agents, significantly improving their accuracy and validating skill portability.

MUSE-Autoskill addresses critical gaps in prior work by treating skills as long-lived, testable, and experience-aware assets, enabling continuous self-improvement and scalable skill reuse. Its training-free design ensures broad applicability across different LLM backbones and agent architectures. This holistic approach advances the state of the art in autonomous agent design, paving the way for more robust, efficient, and adaptable AI systems capable of lifelong learning and cross-domain generalization.

Despite these advances, challenges remain. The system shows slightly lower performance in certain scientific and engineering tasks, reflecting the complexity of skill generation in these domains. Skill creation depends on successful task executions, which may limit efficiency in early learning phases. Additionally, the memory and context management components impose computational overhead that must be optimized for large-scale deployment. Future work aims to extend skill generation to multimodal inputs, enhance memory efficiency, explore richer cross-agent transfer scenarios, and integrate reinforcement learning to further automate skill refinement.

Overall, MUSE-Autoskill offers a comprehensive framework for self-evolving agents that autonomously create, manage, and refine reusable skills, significantly advancing the capabilities and scalability of intelligent autonomous systems.

Deep Analysis

Background

The field of large language model (LLM) agents has witnessed rapid progress in recent years, driven by advances in pretrained models and interactive frameworks. Seminal works such as ReAct introduced the paradigm of interleaving reasoning and action, enabling agents to interact with external tools and environments dynamically. Subsequent systems like Agent-Omni and OmniGAIA expanded this to multimodal autonomous agents capable of handling diverse workflows. Parallel research has focused on equipping agents with tool-use capabilities, ranging from few-shot tool invocation to large-scale API orchestration and software engineering assistants like CodeAgent and SWE-Agent. These systems are benchmarked on suites such as GAIA, SWE-bench, and AgentBench, covering web browsing, real-world software tasks, and multi-environment tool use. Despite these advances, most frameworks treat available actions as fixed tool registries or flat conversational contexts, lacking native support for agents to autonomously author, validate, and accumulate reusable capabilities over time. Skills have emerged as a natural abstraction to decouple capabilities from monolithic model weights, enabling modular execution and structured knowledge accumulation. However, enabling agents to continuously improve through self-managed skills remains an open challenge.

Core Problem

Existing automatic skill creation approaches suffer from four main limitations. First, a creation-usage mismatch arises because skills are often generated without access to the agent’s runtime context, leading to skills that do not align well with actual task needs. Second, there is no structured per-skill memory to accumulate free-form experience about individual skills across tasks, limiting adaptation and reuse. Third, skills are typically static and unvalidated, lacking unit-test-driven evaluation or automatic refinement mechanisms, which undermines reliability. Fourth, poor context handling results in truncated or overflowing conversation histories during long-horizon tasks, causing information loss and degraded performance. These issues prevent skills from evolving into long-lived, transferable assets and hinder the scalability and robustness of skill-centric agent systems.

Innovation

MUSE-Autoskill introduces several key innovations to address these challenges. First, it formalizes the skill lifecycle into five integrated stages—creation, memory, management, evaluation, and refinement—providing a unified framework for skill evolution. Second, it embeds skill creation within the agent’s runtime loop via a built-in skill_create tool, eliminating the creation-usage mismatch by leveraging current context. Third, it introduces a novel skill-level memory that accumulates experience and metadata for each skill across tasks, enabling more effective reuse and adaptation. Fourth, it implements unit-test-driven evaluation, automatically triggering skill refinement upon test failures, ensuring high reliability. Fifth, it employs a structured context manager with adaptive compression and cross-session persistence to handle long-horizon tasks without context window overflow. Finally, it demonstrates cross-agent skill transfer, showing that skills generated by one agent can be used by others without modification, enhancing portability and scalability.

Methodology

The MUSE-Autoskill framework operates as follows:


  • �� Iterative Decision Loop: The agent cycles through planning, action, and observation stages, dynamically decomposing tasks, selecting or creating skills, and refining plans based on feedback.

  • �� Skill Creation: When existing skills are insufficient, the embedded skill_create tool generates a complete skill package, including a SKILL.md interface file, scripts, resources, and unit tests, guided by a high-level specification.

  • �� Skill Evaluation: Newly created skills undergo automated unit testing within sandboxed Docker environments; only passing skills are registered in the skill bank. Failed tests trigger automatic patching and re-evaluation.

  • �� Skill Execution: Skills are invoked through a unified interface, executing code or reading resources inside isolated sandboxes. Execution is integrated into the agent’s ReAct loop, allowing iterative refinement and error handling.

  • �� Skill Memory: A multi-level memory system maintains short-term task context, long-term general experience, and skill-level notes capturing usage history and failure modes, supporting informed skill retrieval and adaptation.

  • �� Skill Management: Skills are indexed by metadata and injected into the system prompt at task start. The agent selects relevant skills during planning, merges overlapping skills, refines failing ones, and prunes unused or faulty skills to maintain a compact, reliable skill bank.

  • �� Context Management: Conversation history is represented as a directed acyclic graph (DAG) of nodes, with adaptive two-level compression (node summarization and multi-node merging) to fit within model token limits. Cross-session snapshots enable task resumption without loss.

  • �� Cross-Agent Transfer: Generated skills are portable and can be injected into different agents without modification, facilitating skill sharing and reuse across heterogeneous systems.

Experiments

Experiments were conducted on SkillsBench, a benchmark comprising 51 real-world tasks across four super-domains: science & engineering, data analysis, document processing, and operations & planning. Each task runs in an isolated Docker container with automated verification of final outputs, yielding a reward score in [0,1]. The evaluation protocol averages scores over five runs per task, excluding environment failures. Baselines include Codex and Hermes agents, both powered by GPT-5.5, ensuring that performance differences arise from system design rather than model capacity. The study evaluates (1) the impact of skill usage on task accuracy, (2) the effectiveness of automatic skill generation from agent experience, and (3) cross-agent skill transfer by injecting MUSE-generated skills into Hermes. All agents share identical settings except for skill-related components, enabling fair comparison.

Results

Key findings include:


  • �� MUSE-Autoskill achieves 68.40% average accuracy across 51 tasks, outperforming Codex (67.28%) and Hermes (61.21%), with a 15.21 percentage point lift over no-skill baselines.

  • �� On 35 tasks where skills were successfully generated, MUSE-Autoskill surpasses the human skill ceiling, demonstrating the high quality and utility of automatically created skills.

  • �� Cross-agent skill transfer experiments show that injecting MUSE skills into Hermes raises its accuracy to 79%, confirming skill portability and generalization.

  • �� Domain-wise analysis reveals MUSE leads in data analysis, document processing, and operations & planning, while trailing Codex slightly in science & engineering, highlighting areas for future improvement.

Applications

MUSE-Autoskill is applicable to complex, multi-step tasks requiring domain-specific knowledge and tool use, such as scientific computing and simulation, automated data analysis, intelligent document processing, and system operations planning. Its skill lifecycle management enables agents to accumulate and reuse capabilities over time, supporting lifelong learning and scalable autonomous systems. The cross-agent skill transfer capability facilitates skill sharing across different products and models, promoting industrial collaboration and ecosystem development. The framework’s modularity and training-free design also make it adaptable to emerging multimodal and interactive AI applications, including intelligent assistants, automated software engineering, and complex workflow management.

Limitations & Outlook

Despite its strengths, MUSE-Autoskill has limitations:


  • �� Slightly lower performance in certain scientific and engineering tasks suggests challenges in generating high-quality skills for complex domains.

  • �� Skill creation depends on successful task trajectories; high failure rates in early stages can hinder skill generation efficiency and quality.

  • �� The multi-level memory and adaptive context compression impose computational and storage overhead, which may limit scalability.

  • �� The evaluation mechanism relies primarily on unit tests, which may not capture all aspects of skill robustness and safety.

  • �� Cross-agent transfer, while effective, may face challenges due to interface and environment heterogeneity across agents.

Abstract

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

cs.AI cs.CL cs.LG cs.MA

References (20)

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Beck Labash et al.

2023 3630 citations ⭐ Influential View Analysis →

Baseline

W. Leigh, Anne Kriete

2020 132 citations

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

H. Cho, Ryan Kang, Youngeun Kim

2026 1 citations View Analysis →

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang et al.

2023 1227 citations View Analysis →

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen et al.

2024 439 citations View Analysis →

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Yuchen Shi, Yuzheng Cai, Siqi Cai et al.

2025 3 citations View Analysis →

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil et al.

2023 699 citations View Analysis →

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig et al.

2023 2351 citations View Analysis →

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli et al.

2023 1118 citations View Analysis →

LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners

Junhao Zheng, Xidi Cai, Qiuke Li et al.

2025 22 citations View Analysis →

MetaGPT: Meta Programming for Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan P. Chen et al.

2023 1858 citations View Analysis →

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Min Yang, J. Piao, Xuanye Xia et al.

2026 1 citations View Analysis →

GAIA: a benchmark for General AI Assistants

G. Mialon, Clémentine Fourrier, Craig Swift et al.

2023 799 citations View Analysis →

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu, Yang Yan

2026 47 citations View Analysis →

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, N. Provenzano, Jaydon Bingham et al.

2026 29 citations View Analysis →

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan et al.

2023 1471 citations View Analysis →

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang et al.

2023 1714 citations View Analysis →

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Shan Zhong, Yiming Lu, Jingjie Ning et al.

2026 2 citations View Analysis →

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen et al.

2023 1850 citations View Analysis →

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang et al.

2025 345 citations View Analysis →