Internalizing Agency from Reflective Experience

TL;DR

LEAFE framework internalizes recovery agency from reflective experience, enhancing Pass@k performance in long-horizon tasks.

cs.AI 🔴 Advanced 2026-03-18 71 views
Rui Ge Yichao Fu Yuyang Qian Junda Su Yiming Zhao Peng Zhao Hao Zhang
Large Language Models Autonomous Agents Feedback Learning Reflective Experience Long-Horizon Tasks

Key Findings

Methodology

The LEAFE (Learning Feedback-Grounded Agency from Reflective Experience) framework internalizes recovery agency from reflective experience. During exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. These experience-guided corrections are then distilled into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions.

Key Results

  • LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (e.g., GRPO) and experience-based methods (e.g., Early Experience), with gains of up to 14% on Pass@128 across a diverse set of interactive coding and agentic tasks under fixed interaction budgets.
  • In the WebShop task, LEAFE achieves a higher Pass@128 on the Qwen2.5-7B model, even though GRPO performs better on Pass@1.
  • On CodeContests, LEAFE improves Pass@128 by up to 47.88%, highlighting the advantage of internalizing feedback-grounded agency in domains requiring iterative correction.

Significance

The LEAFE framework significantly enhances the performance of large language models in long-horizon tasks by internalizing feedback-guided recovery capabilities. This approach not only increases the success rate but also reduces reliance on test-time resampling, decreasing both deployment complexity and latency. By turning environment feedback into actionable supervision, LEAFE offers a new perspective on the autonomous agency of large language models, advancing their application in complex tasks.

Technical Contribution

LEAFE contrasts sharply with existing outcome-driven methods (e.g., GRPO) by internalizing feedback-guided recovery capabilities. It not only focuses on reinforcing successful trajectories but also extends the model's behavioral coverage by identifying critical decision points and making feedback-conditioned corrections. This method provides richer supervision signals for large language models, enhancing their performance in long-horizon interactions.

Novelty

LEAFE is the first framework to internalize recovery agency from reflective experience. Unlike traditional outcome-driven methods, it emphasizes correcting failed trajectories by turning environment feedback into actionable experience, offering an innovative approach to expanding the model's exploration capabilities.

Limitations

  • LEAFE may still face challenges in handling extremely complex tasks, especially when feedback signals are unclear or inconsistent.
  • The computational overhead of LEAFE could be significant due to the need for reflective rollback and correction, particularly in large-scale applications.
  • In certain specific tasks, LEAFE's performance might not surpass that of specially optimized outcome-driven methods.

Future Work

Future research directions include optimizing the computational efficiency of the LEAFE framework, exploring its application in more complex tasks, and integrating other learning strategies (e.g., meta-learning) to further enhance the autonomous agency of models.

AI Executive Summary

As large language models (LLMs) evolve, they are increasingly deployed as autonomous agents that must interact over long horizons with environments providing rich feedback. However, existing outcome-driven post-training methods (e.g., RLVR) primarily optimize final success signals, leaving rich environment feedback underutilized. This results in policies that reproduce a narrow set of already-successful behaviors, failing to improve feedback-grounded agency.

To address this, the paper proposes the LEAFE (Learning Feedback-Grounded Agency from Reflective Experience) framework. This framework internalizes recovery agency from reflective experience. During exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. These experience-guided corrections are then distilled into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions.

LEAFE demonstrates outstanding performance across a diverse set of interactive coding and agentic tasks. Under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (e.g., GRPO) and experience-based methods (e.g., Early Experience), with gains of up to 14% on Pass@128.

The core technical principle of this framework is transforming environment feedback into actionable supervision, reducing reliance on test-time resampling, and decreasing deployment complexity and latency. By internalizing feedback-guided recovery capabilities, LEAFE offers a new perspective on the autonomous agency of large language models, advancing their application in complex tasks.

However, LEAFE may still face challenges in handling extremely complex tasks, especially when feedback signals are unclear or inconsistent. Additionally, the computational overhead of LEAFE could be significant due to the need for reflective rollback and correction, particularly in large-scale applications. Future research directions include optimizing the computational efficiency of the LEAFE framework, exploring its application in more complex tasks, and integrating other learning strategies (e.g., meta-learning) to further enhance the autonomous agency of models.

Deep Analysis

Background

Large language models (LLMs) have made significant strides in natural language processing, particularly in generation and comprehension tasks. However, as application scenarios become more complex, LLMs are increasingly deployed as autonomous agents that must interact over long horizons with environments providing rich feedback. In this context, traditional outcome-driven post-training methods (e.g., RLVR) primarily optimize final success signals, failing to fully utilize the rich feedback provided by the environment. This results in policies that reproduce a narrow set of already-successful behaviors, failing to improve feedback-grounded agency. Existing research indicates that environment feedback not only contains simple failure signals but also provides structured information on why a trajectory is unproductive and how it can be corrected. Thus, effectively utilizing this feedback to enhance decision-making capabilities becomes a crucial research question.

Core Problem

In long-horizon interaction tasks, models need robust recovery capabilities to effectively adjust strategies when errors occur. However, existing outcome-driven methods (e.g., GRPO) typically focus only on reinforcing successful trajectories, neglecting the analysis and correction of failed trajectories. This limits model performance in long-horizon tasks, especially those requiring multiple attempts and corrections. How to enhance model recovery and exploration capabilities without increasing deployment complexity and latency is the core problem currently faced.

Innovation

The LEAFE framework addresses the above issues through the following innovations:

1) Reflective Experience Internalization: By summarizing environment feedback into actionable experience, the agent can identify critical decision points and make feedback-conditioned corrections. This innovation enables the model to recover more effectively in future interactions.

2) Supervised Fine-Tuning: Experience-guided corrections are distilled into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. This process not only increases the success rate but also reduces reliance on test-time resampling.

3) Expanded Behavioral Coverage: By identifying critical decision points and making feedback-conditioned corrections, LEAFE extends the model's behavioral coverage, enhancing performance in long-horizon interactions.

Methodology

The specific methodology of the LEAFE framework is as follows:

  • �� Reflective Experience Internalization: During exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions.
  • �� Supervised Fine-Tuning: These experience-guided corrections are distilled into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions.
  • �� Expanded Behavioral Coverage: By identifying critical decision points and making feedback-conditioned corrections, the model's behavioral coverage is extended.
  • �� Experimental Evaluation: Conduct experiments across a diverse set of interactive coding and agentic tasks to evaluate LEAFE's performance under fixed interaction budgets.

Experiments

The experimental design includes the following aspects:

  • �� Datasets: Use datasets such as CodeContests, WebShop, ALFWorld, ScienceWorld, and Sokoban, covering a range of tasks from programming to multi-step interactive reasoning.
  • �� Baselines: Select GRPO and Early Experience as baseline methods for comparative experiments.
  • �� Evaluation Metrics: Use Pass@1 and Pass@128 as the main evaluation metrics, measuring single-try success rate and performance under larger sampling budgets, respectively.
  • �� Hyperparameters: Adjust hyperparameters in the experiments to optimize model performance and conduct ablation studies to verify the contribution of each component.

Results

Experimental results show that LEAFE performs excellently across diverse tasks:

  • �� On CodeContests, LEAFE improves Pass@128 by up to 47.88%, highlighting the advantage of internalizing feedback-grounded agency in domains requiring iterative correction.
  • �� In the WebShop task, LEAFE achieves a higher Pass@128 on the Qwen2.5-7B model, even though GRPO performs better on Pass@1.
  • �� Under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (e.g., GRPO) and experience-based methods (e.g., Early Experience), with gains of up to 14% on Pass@128.

Applications

Application scenarios for the LEAFE framework include:

  • �� Programming Task Optimization: LEAFE can be used to optimize code generation in programming tasks, improving code accuracy and efficiency through internalized feedback, applicable to automated programming tools.
  • �� Autonomous Agents in Complex Environments: In complex environments requiring long-horizon interaction and error recovery, LEAFE can significantly enhance the autonomous agency of models, applicable to fields such as robotics and autonomous driving.
  • �� Multi-Step Interactive Reasoning: LEAFE can be used in multi-step interactive reasoning tasks, improving task completion rates by identifying and correcting critical decision points, applicable to intelligent assistants and dialogue systems.

Limitations & Outlook

Despite LEAFE's excellent performance across multiple tasks, there are still some limitations:

  • �� Computational Overhead: The computational overhead of LEAFE could be significant due to the need for reflective rollback and correction, particularly in large-scale applications.
  • �� Unclear Feedback Signals: LEAFE's performance may be affected when feedback signals are unclear or inconsistent.
  • �� Specific Task Performance: In certain specific tasks, LEAFE's performance might not surpass that of specially optimized outcome-driven methods. Future research directions include optimizing the computational efficiency of the LEAFE framework, exploring its application in more complex tasks, and integrating other learning strategies (e.g., meta-learning) to further enhance the autonomous agency of models.

Plain Language Accessible to non-experts

Imagine you're navigating a maze, with markings on the walls indicating which paths are dead ends and which lead to the exit. LEAFE is like a smart assistant that not only remembers which paths are dead ends but also tells you how to avoid them and find better routes. Traditional methods are like an assistant that only focuses on whether you successfully exit the maze, remembering only the successful paths without telling you how to improve failed attempts. LEAFE helps you make better decisions in your next attempt by summarizing the experience of each attempt. This way, you not only exit the maze faster but also learn more from each attempt. This approach not only increases success rates but also reduces the number of times you get lost in the maze.

ELI14 Explained like you're 14

Imagine you're playing a super complex video game with many levels, each with different challenges. Traditional methods are like a coach who only cares if you beat the level, remembering only the successful paths without telling you how to improve failed attempts. LEAFE is like a super smart game assistant that not only remembers why you failed each time but also tells you how to improve your strategy, making it easier to beat the level on your next try. This way, you not only finish the game faster but also learn more skills with each attempt. Isn't that cool?

Glossary

Large Language Model

A deep learning-based model capable of understanding and generating natural language text, widely used in natural language processing tasks.

In this paper, large language models are used as autonomous agents that must interact over long horizons with environments providing rich feedback.

Autonomous Agent

A system capable of making decisions and taking actions independently, typically executing tasks in complex environments.

In this paper, large language models are viewed as autonomous agents that must make decisions and recover in long-horizon tasks.

Feedback Learning

A learning method that uses feedback information from the environment to improve model decisions.

The LEAFE framework internalizes recovery capabilities through feedback learning, enhancing model performance in long-horizon tasks.

Reflective Experience

A method of improving future decisions by summarizing past experiences and feedback.

The LEAFE framework internalizes recovery capabilities from reflective experience, enabling more effective recovery in future interactions.

Long-Horizon Task

A complex task requiring decision-making and action over multiple steps.

Tasks such as WebShop and ALFWorld in this paper are long-horizon tasks requiring robust recovery capabilities.

Outcome-Driven Method

A learning method that primarily focuses on final success signals, often neglecting feedback information during the process.

Traditional outcome-driven methods like GRPO primarily optimize final success signals without fully utilizing environment feedback.

Supervised Fine-Tuning

A method of improving model performance by fine-tuning with supervision signals.

The LEAFE framework internalizes experience-guided corrections into the model through supervised fine-tuning.

Behavioral Coverage

The diversity and breadth of behaviors a model can execute in a task.

LEAFE extends the model's behavioral coverage by identifying critical decision points and making feedback-conditioned corrections.

Pass@k

A metric evaluating the model's success in at least one of k attempts, reflecting the model's exploration ability and success rate.

The paper uses Pass@1 and Pass@128 as the main evaluation metrics, measuring single-try success rate and performance under larger sampling budgets, respectively.

GRPO

An outcome-driven reinforcement learning method that primarily optimizes the probability of successful trajectories.

GRPO is used as one of the baseline methods in this paper for comparative experiments.

Open Questions Unanswered questions from this research

  • 1 How can the efficiency and performance of the LEAFE framework be further improved without increasing computational overhead? Current methods may face computational resource limitations when handling complex tasks, necessitating exploration of more efficient algorithms in the future.
  • 2 How can LEAFE's robustness be improved when feedback signals are unclear or inconsistent? Current methods rely on clear feedback signals, and future research needs to explore how to make effective decisions in uncertain environments.
  • 3 Can LEAFE's performance in specific tasks surpass that of specially optimized outcome-driven methods? Further experimental validation and theoretical analysis are needed.
  • 4 How can LEAFE be combined with other learning strategies (e.g., meta-learning) to further enhance the autonomous agency of models? This may require new algorithm designs and experimental validation.
  • 5 How does LEAFE perform in multi-task learning environments? Research is needed on its transferability and adaptability across different tasks.

Applications

Immediate Applications

Programming Task Optimization

LEAFE can be used to optimize code generation in programming tasks, improving code accuracy and efficiency through internalized feedback, applicable to automated programming tools.

Autonomous Agents in Complex Environments

In complex environments requiring long-horizon interaction and error recovery, LEAFE can significantly enhance the autonomous agency of models, applicable to fields such as robotics and autonomous driving.

Multi-Step Interactive Reasoning

LEAFE can be used in multi-step interactive reasoning tasks, improving task completion rates by identifying and correcting critical decision points, applicable to intelligent assistants and dialogue systems.

Long-term Vision

General Artificial Intelligence

By continuously optimizing and expanding the LEAFE framework, it may be possible to achieve more powerful general artificial intelligence systems in the future, capable of autonomous decision-making and learning in complex environments.

Cross-Domain Applications

The technology behind LEAFE can be extended to more domains, such as medical diagnosis and financial analysis, improving decision accuracy and robustness through internalized feedback.

Abstract

Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.

cs.AI

References (20)

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani et al.

2025 76 citations ⭐ Influential View Analysis →

FLEX: Continuous Agent Evolution via Forward Learning from Experience

Zhicheng Cai, Xinyuan Guo, Yu Pei et al.

2025 18 citations ⭐ Influential View Analysis →

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui et al.

2024 1932 citations ⭐ Influential View Analysis →

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey et al.

2024 13446 citations ⭐ Influential View Analysis →

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu et al.

2025 417 citations ⭐ Influential View Analysis →

StepFun-Prover Preview: Let's Think and Verify Step by Step

Shijie Shang, Ruosi Wan, Yue Peng et al.

2025 7 citations ⭐ Influential View Analysis →

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Mingyue Cheng, Ouyang Jie, Shuo Yu et al.

2025 20 citations ⭐ Influential View Analysis →

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5 Team Aohan Zeng, Xin Lv, Qinkai Zheng et al.

2025 242 citations View Analysis →

Agent Learning via Early Experience

Kai Zhang, Xiang Chen, Bo Liu et al.

2025 28 citations View Analysis →

Mastering Diverse Domains through World Models

Danijar Hafner, J. Pašukonis, Jimmy Ba et al.

2023 957 citations View Analysis →

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.

2024 1305 citations View Analysis →

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang et al.

2025 275 citations View Analysis →

Internalizing World Models via Self-Play Finetuning for Agentic RL

Shiqi Chen, Tongyao Zhu, Zian Wang et al.

2025 8 citations View Analysis →

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Peng Xia, Peng Xia, Kaide Zeng et al.

2025 16 citations View Analysis →

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I-Hung Hsu et al.

2025 43 citations View Analysis →

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu et al.

2022 6412 citations View Analysis →

OpenAI o1 System Card

Ahmed El-Kishky

2024 1575 citations

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Penghui Qi, Zi-Yan Liu, Tianyu Pang et al.

2025 21 citations View Analysis →

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xu Shen et al.

2025 132 citations View Analysis →

Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent

Xingzuo Li, Kehai Chen, Yunfei Long et al.

2025 3 citations View Analysis →