MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

TL;DR

MOSS enables source-level self-rewriting in autonomous agents, boosting OpenClaw’s four-task mean grader score from 0.25 to 0.61 in one cycle.

cs.AI 🔴 Advanced 2026-05-22 592 views
Qianshu Cai Yonggang Zhang Xianzhang Jia Wei Xue Jun Song Xinmei Tian Yike Guo
autonomous agents source code rewriting self-evolution production systems container hot swap

Key Findings

Methodology

This paper introduces MOSS, a system for comprehensive source-level self-rewriting in production-grade autonomous agent systems. MOSS anchors each evolutionary iteration to an automatically curated batch of real user failure evidence and navigates a deterministic seven-stage pipeline including fault localization, repair planning, implementation, code review, task evaluation, and verdict. Code editing is performed by a pluggable external coding-agent CLI, while MOSS orchestrates stage sequencing and final decision-making. Candidate fixes are verified by replaying the failure batch under isolated ephemeral trial containers, and once approved by the user, promoted via in-place container hot swap with health-probe-gated rollback to safely update the live agent.

Key Results

  • On the OpenClaw platform, using four operations/compliance audit tasks from the claweval benchmark dataset, MOSS improved the mean grader score from 0.25 at baseline to 0.61 post-evolution in a single iteration, demonstrating effective autonomous repair of complex multi-tool execution and data integration failures.
  • Source-level rewriting enabled fixes unattainable by prior approaches limited to text-mutable artifacts (skills, prompts, memory schemas, workflows), such as routing errors, hook ordering issues, and concurrency-induced invariants violations in the agent harness core.
  • The multistage iterative pipeline ensured incremental problem diagnosis and solution refinement, avoiding random untargeted mutations, while the pluggable coding agent design supported various LLM-based code generation engines like Claude Code and OpenAI Codex for flexibility.
  • Runtime verification on trial workers running production-equivalent containers ensured candidate correctness and stability, with user-gated deployment using container hot swap and comprehensive health checks preventing regressions on live services.

Significance

This work addresses a longstanding limitation in autonomous agent self-evolution: the confinement to editable text artifacts leaves structural and harness-level failures unresolved. By extending autonomous adaptation to the Turing-complete source code layer, MOSS substantially enhances agents’ ability to self-repair and restructure at a depth unattainable before. This reduces costly human intervention, improves system robustness, and enables continuous autonomous operation in production. MOSS sets a new paradigm combining academic theory of source rewriting with engineering practices to realize safe, dynamic evolution in deployed systems.

Technical Contribution

Technically, MOSS is the first system to implement full source-level self-rewriting in a large-scale production multi-module autonomous agent. It establishes a real-world directed evolution loop anchored on concrete user failure batches rather than synthetic benchmarks, thereby avoiding ineffective random search. The architecture distinctly separates deterministic orchestration from external multi-LLM coding agents for modularity. Runtime validation via ephemeral containerized trial runs verifies fixes within realistic environments. The safe in-place container hot swap design with health probe rollback enables operational deployment while preserving user sessions and state, demonstrating a highly practical evolution pipeline beyond previous minimal academic scaffolds.

Novelty

MOSS fundamentally innovates by expanding self-evolution scope from solely text-mutable artifacts to include source code modifications, introducing deterministic effect, long-context drift resilience, and universal Turing completeness to agent adaptation. Prior self-evolving systems failed to address structural harness faults due to restriction at prompt or memory layers. MOSS bridges this gap, making it the first to integrate production-grade source-level rewriting within a complex autonomous agent ecosystem with practical deployment guarantees.

Limitations

  • MOSS depends heavily on the quality and representativeness of automatically collected failure samples; insufficient sampling or noisy user feedback may lead to misdirected repair efforts and reduced efficacy.
  • Source code rewriting inherently carries risks of introducing regressions or unpredictable bugs despite safeguards; further advances in preemptive static and dynamic analysis are needed to improve safety.
  • The current MOSS design targets single-instance containerized agents; its applicability to multi-node distributed autonomous agent systems with shared state synchronization remains to be demonstrated at scale.

Future Work

Future work includes enhancing automatic failure detection and batch quality assessment algorithms for more precise evolution triggers; integrating advanced code analysis and semantic repair suggestion models to improve fix quality; extending MOSS to distributed multi-agent systems supporting coordinated source-level rewriting and state consistency; and refining user interaction flows for smoother, more automated approval cycles. Additionally, incorporating enhanced security policies and multidimensional health monitoring will further safeguard production deployments during autonomous evolution.

AI Executive Summary

Autonomous agent systems deployed at application level—such as OpenClaw—have shown remarkable capabilities automating multi-step tasks across platforms like Slack and Discord. However, once deployed, they remain essentially static, failing to learn from real user interactions or fix recurring failure modes until manual updates arrive. Existing self-evolving agents attempt to address this, but their scope is limited to mutable textual artifacts like skill files, prompt configurations, memory schemas, and workflows. This leaves fundamental structural defects—arising from routing logic, hook sequencing, state invariants, and dispatch mechanisms embedded in source code—unreachable and unfixable via these means.

This paper introduces MOSS, a groundbreaking system that achieves self-evolution through automated source-level rewriting within production-grade autonomous agents. Core to MOSS is a deterministic multi-stage pipeline driven by real-world failure evidence automatically curated from user sessions and flagged conversation turns. Each iteration spans fault localization, repair planning (with iterative plan reviews), implementation, code review, task execution evaluation using qualitative metrics, and final verdict determination. The actual source code rewriting is delegated to plug-in external coding-agent CLIs powered by state-of-the-art large language models such as Claude Code or OpenAI Codex. Verification utilizes isolated ephemeral trial containers replaying the failure batch to ensure candidate reliability before any live deployment.

On the OpenClaw platform, MOSS demonstrates significant performance gains on four operations and compliance audit tasks from the claweval benchmark. Starting from a baseline mean grader score of 0.25—well below the 0.75 pass threshold—MOSS autonomously performs a single evolution cycle that raises the score to 0.61. This uplift evidences effective resolution of multi-tool scheduling, semantic dispatch, and session management failures that prior text-based evolution methods cannot address. The source-level approach leverages the Turing-completeness of code-based adaptation to overcome the limitations of prompt or skill file-only edits.

From an engineering perspective, MOSS deploys a host-resident daemon overseeing container lifecycles, an in-container gateway exposing a CLI interface for evolution control, and ephemeral worker containers for runtime verification. Its use of a plugin abstraction layer for coding agents enables flexible integration with multiple LLM providers. An in-place container hot swap mechanism safely updates the live agent image, coupled with a configurable health-probe rollback process that preserves persistent user states. This architecture balances robustness and flexibility to deliver production-ready autonomous evolution.

In summary, MOSS elevates autonomous agent systems’ self-adaptation capabilities by pioneering source-level self-rewriting anchored in real production feedback. It addresses critical robustness gaps inherent in prior text-based evolution approaches, radically diminishing the operational maintenance burden and enabling agents to self-improve continuously and reliably. This work paves the way for future distributed system scaling, enhanced analysis methods, and safer autonomous deployment cycles, marking a significant milestone in autonomous agent research and engineering.

Deep Analysis

Background

The field of autonomous agents has witnessed rapid maturation from conceptual demonstrations to production-grade systems capable of handling sophisticated multi-step user tasks across various platforms. Systems like OpenClaw incorporate multi-channel gateways, plugins, hooks, session management, and persistent user state to deliver real-world utility. Concurrently, the academic community has explored self-evolving agents capable of autonomous adaptation post-deployment. Representative works include Hermes Agent, SkillClaw, GenericAgent, EvoAgentX, and others, predominantly focusing on mutable text artifacts such as skills, prompts, memory schemas, and workflows for evolution. Some projects like SICA, Darwin Gödel Machine, and HyperAgents have demonstrated source-level self-rewriting feasibility on minimal experimental scaffolds guided by benchmark scores.


However, these advances remain isolated from production contexts characterized by large, intricate codebases embedding routing logic, hook ordering, state invariants, and dispatch mechanisms that are immutable through textual edits. This separation creates a gap wherein structural classes of failures remain inaccessible to prevailing self-evolution techniques, exacerbated as systems grow in complexity. Therefore, a more comprehensive approach targeting the source code layer promises to unify and generalize the adaptation medium, enabling deterministic, long-term stable behavioral modifications that text-based approaches cannot reliably achieve.

Core Problem

The core problem addressed in this paper is the inability of existing autonomous agent self-evolution approaches to rectify structural failures rooted in the agent harness's source code, as they restrict fixes to text-mutable artifacts. This limitation results in recurring failures such as misrouted messages, hook execution order errors, corrupted session state, and improper atomicity across concurrent skills that evade prompt or skill-based rewrites. Additionally, randomness and undirected mutations employed in many exploration-driven models are impractical in production environments due to the scale and complexity of codebases, lack of global fitness signals, real-time user expectations, and impossibility of harmless failures during live operation. These constraints necessitate a directed, evidence-anchored, deterministic evolution methodology that can operate safely and effectively within production-grade systems.

Innovation

MOSS introduces several unique innovations to overcome these challenges:


1. Directed Evolution Anchored to Real Failure Evidence: Unlike exploratory approaches driven by benchmarks, MOSS constructs fix targets from automatically curated user session failures and user flags, enabling precise and meaningful repair objectives.


2. Deterministic Multi-Stage Pipeline: The evolution process is structured into Locate, Plan, Plan-Review, Implement, Code-Review, Task-Evaluate, and Verdict stages, promoting rigorous stepwise diagnosis, solution design, coding, review, testing, and decision, improving fix quality and reducing regression risk.


3. Pluggable External Coding Agents: Delegating code editing to pluggable CLI interfaces powered by diverse large language models decouples orchestration from code generation, fostering modularity and adaptability according to resource availability and provider capabilities.


4. Ephemeral Production-Equivalent Trial Containers: Runtime verification through isolated short-lived containers replicates real application deployment, ensuring fixes are validated under realistic operational conditions beyond static code or benchmark testing.


5. In-Place Container Hot Swap with Health-Probe-Gated Rollback: This deployment paradigm enables seamless, state-preserving upgrades with automatic failback on health probe failure, balancing innovation with production safety.

Methodology

MOSS methodology unfolds through the following components and steps:


  • �� Failure Evidence Collection: Periodic background scans extract dialogue segments labeled weak or missing by task-specific evaluation, supplemented by user-flagged conversational turns, aggregated into per-conversation failure batches.

  • �� Evolution Triggering and Batch Management: Users or automated schedulers invoke start commands selecting the latest sealed batch to initiate evolution.

  • �� Iterative Multi-Stage Pipeline:
  • �� Locate: Analyze failure transcripts to pinpoint source-level defects.
  • �� Plan: Propose targeted repair strategies.
  • �� Plan-Review: Validate and refine repair plans iteratively.
  • �� Implement: Apply code changes via external coding-agent CLI.
  • �� Code-Review: Review and refine code modifications in cycles.
  • �� Task-Evaluate: Execute qualitative scoring against fixed keypoint criteria.
  • �� Verdict: Decide convergence or need for further iterations.

  • �� External Coding-Agent Integration: Coding-agent CLI is invoked as a host subprocess per stage, harnessing large language model-based code editing capabilities such as Claude Code or OpenAI Codex.

  • �� Runtime Verification: Host-resident daemon spins up multiple ephemeral trial worker containers from candidate image, autonomously replaying batch tasks to detect runtime faults.

  • �� Hot Swap Deployment: Upon user consent, daemon performs atomically container replacement using the candidate image, employs a 90-second rolling health probe (heartbeat freshness, container and CLI checks), and commits or rolls back accordingly. Persistent user state is maintained across swaps via host-mounted volumes.

Experiments

The evaluation employs OpenClaw, a production-grade autonomous agent platform with a complex multi-plugin architecture and persistent user state. For controlled assessment, the authors use four claweval compliance audit tasks (T141zh/T142 and T137zh/T138), covering SLA compliance and restock-chain checking in Chinese and English, respectively. Baseline scores using DeepSeek V3.2 model average 0.25 on the claweval grader metric, reflecting recurring partial and incorrect outputs. MOSS executes a single evolution iteration encompassing full pipeline and validation phases. Code changes target harness-level dispatch and synthesis pipeline annotation gaps, addressing multi-entity payload handling issues.


Multi-round planning and code review loops improve repair scope and depth. Runtime verification ensures fixes are stable under concurrent multi-task execution in isolated containers. The candidate image is then hot-swapped live, preserving user sessions and credentials. Experimental rigor includes controlled pre/post scoring on identical tasks with the claweval benchmark as a reproducible fitness proxy.

Results

MOSS lifts the average grader score on the strict four-task batch from a baseline 0.25 to 0.61 after a single evolution cycle, demonstrating substantial autonomous performance gains. Diagnosed issues include generic execution path biases over semantic tooling and a lack of annotation branches for multi-entity response handling in the harness. Introduced fixes comprise an added annotation branch with explicit usage hints and a pre-call deny gate to block improper syntheses, effectively closing coverage gaps. Iterative planning and code reviews reduce overfitting and regressions, while trial workers' real-world environment validation catches runtime-level concurrency anomalies. This outcome confirms the efficacy of source-level rewriting in production realities, surpassing prior text-layer-limited techniques.

Applications

MOSS is directly applicable to complex, multi-skill, multi-modal autonomous agent platforms requiring deep module coordination and state lifecycle management. Use cases encompass automated operational assistants, compliance auditing bots, and multi-service workflow automation systems. By extending autonomous adaptation to the source code, agents can self-repair intricate failure modes, drastically reducing human maintenance overhead and improving uptime. Prerequisites include containerized deployment environments, source code accessibility, integration with coding-agent LLM services, and infrastructure supporting container orchestration and health monitoring.

Limitations & Outlook

MOSS’s dependence on high-fidelity failure batch construction makes it vulnerable to incomplete sampling or noisy user flags, potentially misguiding repair focus and limiting effectiveness. The complexity of source-level rewriting poses inherent risks of introducing new defects despite protective mechanisms like multi-stage review and rollback; stronger predictive verifications are needed. Its current architectural design caters primarily to single-instance monolithic containerized agents, lacking support for distributed multi-node autonomous agent systems where state synchronization and coordinated upgrades present additional challenges.

Plain Language Accessible to non-experts

Imagine a factory with a robot that makes cakes. This robot follows a recipe written on sticky notes attached to its control panel. If the robot starts making mistakes—mixing ingredients in the wrong order or forgetting a step—you can rewrite the sticky notes to fix the problem. But what if the real issue is how the robot’s internal wiring controls the sequence of actions? No matter how many sticky notes you change, the wiring remains the same, and mistakes persist.

MOSS is like sending a smart engineer into the robot’s control system to directly rewire its circuits, not just adjust the recipe notes. It first gathers clues about where the robot failed during cake-making (like burnt edges or missing sugar), then carefully plans how to fix the wiring. Instead of guessing randomly, it takes a step-by-step approach: find the fault, design a repair, write new wiring instructions, double-check them, and test by baking again in a safe mock kitchen.

Only if all tests pass does this new wiring replace the old one in the real robot, and if something goes wrong, the engineer can quickly revert to the previous setup. This way, the robot continuously learns and improves from its mistakes, fixing problems deep inside its system rather than just the surface notes.

In simple terms, MOSS helps machines not only learn new instructions but rewrite their inner workings to get better over time, ensuring that what causes errors gets truly fixed.

ELI14 Explained like you're 14

Hey there! Imagine you have a super cool robot buddy that helps you with chores. But sometimes, it messes up—for example, putting your toys in the fridge or starting to eat before washing hands! How would you teach it better? Telling it “do this” doesn’t always work because it only changes the to-do list, not *how* the robot thinks.

This paper talks about MOSS, a magic helper that goes inside the robot's brain and rewrites its rules! It watches when the robot messes up, figures out exactly what’s wrong, then writes new rules to fix those mistakes. But wait, it doesn't just rush; it first tests the new rules in a pretend robot to make sure it won’t make things worse.

Once everything looks awesome, MOSS swaps the robot’s brain with the improved one—all without stopping the robot or losing your stuff! So your robot gets smarter all by itself, fixing problems way deeper than just changing lists.

Pretty neat, right? It’s like giving your robot the power to learn and upgrade all by itself, making life easier and more fun!

Glossary

Autonomous Agent

An AI entity capable of sensing its environment, making decisions, and completing tasks independently.

OpenClaw functions as a production-grade autonomous agent performing multi-step user workflows.

Source-Level Rewriting

Modifying the agent's source code directly to alter its behavior and structure fundamentally.

MOSS performs self-evolution by rewiring the agent's own source code rather than just textual configurations.

Text-Mutable Artifacts

Editable configurations such as prompts, skill files, memory schemas, or workflow graphs that influence agent behavior without changing code.

Prior self-evolution approaches limited modifications to these artifacts, insufficient for structural fixes.

Coding-Agent CLI

A command-line interface for an external code editing agent powered by large language models, facilitating code modifications.

MOSS integrates multiple coding-agent CLIs like Claude Code and OpenAI Codex for autonomous code rewriting.

Trial Worker

A short-lived isolated container that executes the candidate agent on failure batches to verify patch correctness before deployment.

Used by MOSS to validate fixes within a production-equivalent environment.

Container Hot Swap

Replacing a running container with a new version without downtime, preserving user sessions and states.

MOSS deploys evolved agents through in-place container hot swaps guarded by health probes.

Task-Evaluate Stage

A pipeline stage that scores agent outputs qualitatively against failure evidence keypoints to assess fix effectiveness.

Drives feedback loops and verdict decisions during MOSS’s evolutionary iterations.

Agent Harness

Core agent infrastructure including routing, hook ordering, state management, and dispatch logic.

MOSS targets the harness source code to fix failures unreachable through text artifact edits.

Open Questions Unanswered questions from this research

  • 1 Current automatic failure batch collection is limited by incomplete or noisy data, risking inaccurate repair focus; richer, more reliable evidence integration is required.
  • 2 Predicting and preventing new bugs introduced by autonomous source-level rewriting remains challenging; advanced static and dynamic code analysis techniques must be developed.
  • 3 Extending MOSS to multi-node distributed autonomous agents requires solutions for distributed state synchronization and coordinated source modifications.
  • 4 Streamlining user authorization and intervention processes is necessary to make autonomous evolution seamless while maintaining safety guarantees.
  • 5 Reliability and efficiency of language-model-driven coding agents in complex, large-scale codebases warrant further study and optimization.

Applications

Immediate Applications

Automated Operational Assistant Self-Repair

Deployable on enterprise IT operation bots to autonomously identify and fix failures originating from complex multi-module code, reducing manual maintenance.

Compliance Audit Bot Evolution

Enables auditing agents to autonomously update compliance logic and multi-source data integration rules in response to real-world regulatory feedback, improving accuracy.

Workflow Automation Platform Stability

Supports complex multi-tool task execution platforms to self-correct dispatch and state management errors, enhancing reliability and reducing downtime.

Long-term Vision

Autonomous Agent Operating Systems

Foster development of self-managing AI operating environments capable of continuous self-improvement, learning, and adaptation via source-level evolution.

Distributed Multi-Agent Cooperative Evolution

Enable large-scale autonomous agent clusters to coordinate source-level updates and state consistency across nodes, advancing scalable AI ecosystems.

Abstract

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

cs.AI cs.LG

References (20)

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang et al.

2026 29 citations ⭐ Influential View Analysis →

MetaGPT: Meta Programming for Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan P. Chen et al.

2023 1816 citations View Analysis →

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Jiaqing Liang, Jinyi Han, Weijia Li et al.

2026 4 citations View Analysis →

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu et al.

2025 168 citations View Analysis →

The Claude 3 Model Family: Opus, Sonnet, Haiku

1636 citations

A Self-Improving Coding Agent

Maxime Robeyns, M. Szummer, Laurence Aitchison

2025 24 citations View Analysis →

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, Jeff Clune

2024 189 citations View Analysis →

EvoAgentX: An Automated Framework for Evolving Agentic Workflows

Yingxu Wang, Siwei Liu, Jinyuan Fang et al.

2025 23 citations View Analysis →

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang et al.

2023 1631 citations View Analysis →

ChatDev: Communicative Agents for Software Development

Cheng Qian, Wei Liu, Hongzhang Liu et al.

2023 792 citations View Analysis →

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang et al.

2024 902 citations View Analysis →

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

G. Li, Hasan Hammoud, Hani Itani et al.

2023 1385 citations View Analysis →

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, Haoyang Zhang

2026 3 citations View Analysis →

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shi Liang, Yining Ye et al.

2023 1552 citations View Analysis →

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu et al.

2025 97 citations View Analysis →

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì et al.

2023 3865 citations View Analysis →

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji et al.

2026 10 citations View Analysis →

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang et al.

2026 15 citations View Analysis →

A survey on large language model based autonomous agents

Lei Wang, Chengbang Ma, Xueyang Feng et al.

2023 2902 citations View Analysis →

Hyperagents

Jenny Zhang, Bingchen Zhao, Wannan Yang et al.

2026 4 citations View Analysis →