Internalizing Agency from Reflective Experience
LEAFE framework internalizes recovery agency from reflective experience, enhancing Pass@k performance in long-horizon tasks.
Key Findings
Methodology
The LEAFE (Learning Feedback-Grounded Agency from Reflective Experience) framework internalizes recovery agency from reflective experience. During exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. These experience-guided corrections are then distilled into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions.
Key Results
- LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (e.g., GRPO) and experience-based methods (e.g., Early Experience), with gains of up to 14% on Pass@128 across a diverse set of interactive coding and agentic tasks under fixed interaction budgets.
- In the WebShop task, LEAFE achieves a higher Pass@128 on the Qwen2.5-7B model, even though GRPO performs better on Pass@1.
- On CodeContests, LEAFE improves Pass@128 by up to 47.88%, highlighting the advantage of internalizing feedback-grounded agency in domains requiring iterative correction.
Significance
The LEAFE framework significantly enhances the performance of large language models in long-horizon tasks by internalizing feedback-guided recovery capabilities. This approach not only increases the success rate but also reduces reliance on test-time resampling, decreasing both deployment complexity and latency. By turning environment feedback into actionable supervision, LEAFE offers a new perspective on the autonomous agency of large language models, advancing their application in complex tasks.
Technical Contribution
LEAFE contrasts sharply with existing outcome-driven methods (e.g., GRPO) by internalizing feedback-guided recovery capabilities. It not only focuses on reinforcing successful trajectories but also extends the model's behavioral coverage by identifying critical decision points and making feedback-conditioned corrections. This method provides richer supervision signals for large language models, enhancing their performance in long-horizon interactions.
Novelty
LEAFE is the first framework to internalize recovery agency from reflective experience. Unlike traditional outcome-driven methods, it emphasizes correcting failed trajectories by turning environment feedback into actionable experience, offering an innovative approach to expanding the model's exploration capabilities.
Limitations
- LEAFE may still face challenges in handling extremely complex tasks, especially when feedback signals are unclear or inconsistent.
- The computational overhead of LEAFE could be significant due to the need for reflective rollback and correction, particularly in large-scale applications.
- In certain specific tasks, LEAFE's performance might not surpass that of specially optimized outcome-driven methods.
Future Work
Future research directions include optimizing the computational efficiency of the LEAFE framework, exploring its application in more complex tasks, and integrating other learning strategies (e.g., meta-learning) to further enhance the autonomous agency of models.
AI Executive Summary
As large language models (LLMs) evolve, they are increasingly deployed as autonomous agents that must interact over long horizons with environments providing rich feedback. However, existing outcome-driven post-training methods (e.g., RLVR) primarily optimize final success signals, leaving rich environment feedback underutilized. This results in policies that reproduce a narrow set of already-successful behaviors, failing to improve feedback-grounded agency.
To address this, the paper proposes the LEAFE (Learning Feedback-Grounded Agency from Reflective Experience) framework. This framework internalizes recovery agency from reflective experience. During exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. These experience-guided corrections are then distilled into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions.
LEAFE demonstrates outstanding performance across a diverse set of interactive coding and agentic tasks. Under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (e.g., GRPO) and experience-based methods (e.g., Early Experience), with gains of up to 14% on Pass@128.
The core technical principle of this framework is transforming environment feedback into actionable supervision, reducing reliance on test-time resampling, and decreasing deployment complexity and latency. By internalizing feedback-guided recovery capabilities, LEAFE offers a new perspective on the autonomous agency of large language models, advancing their application in complex tasks.
However, LEAFE may still face challenges in handling extremely complex tasks, especially when feedback signals are unclear or inconsistent. Additionally, the computational overhead of LEAFE could be significant due to the need for reflective rollback and correction, particularly in large-scale applications. Future research directions include optimizing the computational efficiency of the LEAFE framework, exploring its application in more complex tasks, and integrating other learning strategies (e.g., meta-learning) to further enhance the autonomous agency of models.
Deep Analysis
Background
Large language models (LLMs) have made significant strides in natural language processing, particularly in generation and comprehension tasks. However, as application scenarios become more complex, LLMs are increasingly deployed as autonomous agents that must interact over long horizons with environments providing rich feedback. In this context, traditional outcome-driven post-training methods (e.g., RLVR) primarily optimize final success signals, failing to fully utilize the rich feedback provided by the environment. This results in policies that reproduce a narrow set of already-successful behaviors, failing to improve feedback-grounded agency. Existing research indicates that environment feedback not only contains simple failure signals but also provides structured information on why a trajectory is unproductive and how it can be corrected. Thus, effectively utilizing this feedback to enhance decision-making capabilities becomes a crucial research question.
Core Problem
In long-horizon interaction tasks, models need robust recovery capabilities to effectively adjust strategies when errors occur. However, existing outcome-driven methods (e.g., GRPO) typically focus only on reinforcing successful trajectories, neglecting the analysis and correction of failed trajectories. This limits model performance in long-horizon tasks, especially those requiring multiple attempts and corrections. How to enhance model recovery and exploration capabilities without increasing deployment complexity and latency is the core problem currently faced.
Innovation
The LEAFE framework addresses the above issues through the following innovations:
1) Reflective Experience Internalization: By summarizing environment feedback into actionable experience, the agent can identify critical decision points and make feedback-conditioned corrections. This innovation enables the model to recover more effectively in future interactions.
2) Supervised Fine-Tuning: Experience-guided corrections are distilled into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. This process not only increases the success rate but also reduces reliance on test-time resampling.
3) Expanded Behavioral Coverage: By identifying critical decision points and making feedback-conditioned corrections, LEAFE extends the model's behavioral coverage, enhancing performance in long-horizon interactions.
Methodology
The specific methodology of the LEAFE framework is as follows:
- �� Reflective Experience Internalization: During exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions.
- �� Supervised Fine-Tuning: These experience-guided corrections are distilled into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions.
- �� Expanded Behavioral Coverage: By identifying critical decision points and making feedback-conditioned corrections, the model's behavioral coverage is extended.
- �� Experimental Evaluation: Conduct experiments across a diverse set of interactive coding and agentic tasks to evaluate LEAFE's performance under fixed interaction budgets.
Experiments
The experimental design includes the following aspects:
- �� Datasets: Use datasets such as CodeContests, WebShop, ALFWorld, ScienceWorld, and Sokoban, covering a range of tasks from programming to multi-step interactive reasoning.
- �� Baselines: Select GRPO and Early Experience as baseline methods for comparative experiments.
- �� Evaluation Metrics: Use Pass@1 and Pass@128 as the main evaluation metrics, measuring single-try success rate and performance under larger sampling budgets, respectively.
- �� Hyperparameters: Adjust hyperparameters in the experiments to optimize model performance and conduct ablation studies to verify the contribution of each component.
Results
Experimental results show that LEAFE performs excellently across diverse tasks:
- �� On CodeContests, LEAFE improves Pass@128 by up to 47.88%, highlighting the advantage of internalizing feedback-grounded agency in domains requiring iterative correction.
- �� In the WebShop task, LEAFE achieves a higher Pass@128 on the Qwen2.5-7B model, even though GRPO performs better on Pass@1.
- �� Under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (e.g., GRPO) and experience-based methods (e.g., Early Experience), with gains of up to 14% on Pass@128.
Applications
Application scenarios for the LEAFE framework include:
- �� Programming Task Optimization: LEAFE can be used to optimize code generation in programming tasks, improving code accuracy and efficiency through internalized feedback, applicable to automated programming tools.
- �� Autonomous Agents in Complex Environments: In complex environments requiring long-horizon interaction and error recovery, LEAFE can significantly enhance the autonomous agency of models, applicable to fields such as robotics and autonomous driving.
- �� Multi-Step Interactive Reasoning: LEAFE can be used in multi-step interactive reasoning tasks, improving task completion rates by identifying and correcting critical decision points, applicable to intelligent assistants and dialogue systems.
Limitations & Outlook
Despite LEAFE's excellent performance across multiple tasks, there are still some limitations:
- �� Computational Overhead: The computational overhead of LEAFE could be significant due to the need for reflective rollback and correction, particularly in large-scale applications.
- �� Unclear Feedback Signals: LEAFE's performance may be affected when feedback signals are unclear or inconsistent.
- �� Specific Task Performance: In certain specific tasks, LEAFE's performance might not surpass that of specially optimized outcome-driven methods. Future research directions include optimizing the computational efficiency of the LEAFE framework, exploring its application in more complex tasks, and integrating other learning strategies (e.g., meta-learning) to further enhance the autonomous agency of models.
Plain Language Accessible to non-experts
Imagine you're navigating a maze, with markings on the walls indicating which paths are dead ends and which lead to the exit. LEAFE is like a smart assistant that not only remembers which paths are dead ends but also tells you how to avoid them and find better routes. Traditional methods are like an assistant that only focuses on whether you successfully exit the maze, remembering only the successful paths without telling you how to improve failed attempts. LEAFE helps you make better decisions in your next attempt by summarizing the experience of each attempt. This way, you not only exit the maze faster but also learn more from each attempt. This approach not only increases success rates but also reduces the number of times you get lost in the maze.
ELI14 Explained like you're 14
Imagine you're playing a super complex video game with many levels, each with different challenges. Traditional methods are like a coach who only cares if you beat the level, remembering only the successful paths without telling you how to improve failed attempts. LEAFE is like a super smart game assistant that not only remembers why you failed each time but also tells you how to improve your strategy, making it easier to beat the level on your next try. This way, you not only finish the game faster but also learn more skills with each attempt. Isn't that cool?
Glossary
Large Language Model
A deep learning-based model capable of understanding and generating natural language text, widely used in natural language processing tasks.
In this paper, large language models are used as autonomous agents that must interact over long horizons with environments providing rich feedback.
Autonomous Agent
A system capable of making decisions and taking actions independently, typically executing tasks in complex environments.
In this paper, large language models are viewed as autonomous agents that must make decisions and recover in long-horizon tasks.
Feedback Learning
A learning method that uses feedback information from the environment to improve model decisions.
The LEAFE framework internalizes recovery capabilities through feedback learning, enhancing model performance in long-horizon tasks.
Reflective Experience
A method of improving future decisions by summarizing past experiences and feedback.
The LEAFE framework internalizes recovery capabilities from reflective experience, enabling more effective recovery in future interactions.
Long-Horizon Task
A complex task requiring decision-making and action over multiple steps.
Tasks such as WebShop and ALFWorld in this paper are long-horizon tasks requiring robust recovery capabilities.
Outcome-Driven Method
A learning method that primarily focuses on final success signals, often neglecting feedback information during the process.
Traditional outcome-driven methods like GRPO primarily optimize final success signals without fully utilizing environment feedback.
Supervised Fine-Tuning
A method of improving model performance by fine-tuning with supervision signals.
The LEAFE framework internalizes experience-guided corrections into the model through supervised fine-tuning.
Behavioral Coverage
The diversity and breadth of behaviors a model can execute in a task.
LEAFE extends the model's behavioral coverage by identifying critical decision points and making feedback-conditioned corrections.
Pass@k
A metric evaluating the model's success in at least one of k attempts, reflecting the model's exploration ability and success rate.
The paper uses Pass@1 and Pass@128 as the main evaluation metrics, measuring single-try success rate and performance under larger sampling budgets, respectively.
GRPO
An outcome-driven reinforcement learning method that primarily optimizes the probability of successful trajectories.
GRPO is used as one of the baseline methods in this paper for comparative experiments.
Open Questions Unanswered questions from this research
- 1 How can the efficiency and performance of the LEAFE framework be further improved without increasing computational overhead? Current methods may face computational resource limitations when handling complex tasks, necessitating exploration of more efficient algorithms in the future.
- 2 How can LEAFE's robustness be improved when feedback signals are unclear or inconsistent? Current methods rely on clear feedback signals, and future research needs to explore how to make effective decisions in uncertain environments.
- 3 Can LEAFE's performance in specific tasks surpass that of specially optimized outcome-driven methods? Further experimental validation and theoretical analysis are needed.
- 4 How can LEAFE be combined with other learning strategies (e.g., meta-learning) to further enhance the autonomous agency of models? This may require new algorithm designs and experimental validation.
- 5 How does LEAFE perform in multi-task learning environments? Research is needed on its transferability and adaptability across different tasks.
Applications
Immediate Applications
Programming Task Optimization
LEAFE can be used to optimize code generation in programming tasks, improving code accuracy and efficiency through internalized feedback, applicable to automated programming tools.
Autonomous Agents in Complex Environments
In complex environments requiring long-horizon interaction and error recovery, LEAFE can significantly enhance the autonomous agency of models, applicable to fields such as robotics and autonomous driving.
Multi-Step Interactive Reasoning
LEAFE can be used in multi-step interactive reasoning tasks, improving task completion rates by identifying and correcting critical decision points, applicable to intelligent assistants and dialogue systems.
Long-term Vision
General Artificial Intelligence
By continuously optimizing and expanding the LEAFE framework, it may be possible to achieve more powerful general artificial intelligence systems in the future, capable of autonomous decision-making and learning in complex environments.
Cross-Domain Applications
The technology behind LEAFE can be extended to more domains, such as medical diagnosis and financial analysis, improving decision accuracy and robustness through internalized feedback.
Abstract
Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.
References (20)
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani et al.
FLEX: Continuous Agent Evolution via Forward Learning from Experience
Zhicheng Cai, Xinyuan Guo, Yu Pei et al.
Qwen2 Technical Report
An Yang, Baosong Yang, Binyuan Hui et al.
The Llama 3 Herd of Models
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey et al.
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng, Yuzhen Huang, Qian Liu et al.
StepFun-Prover Preview: Let's Think and Verify Step by Step
Shijie Shang, Ruosi Wan, Yue Peng et al.
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
Mingyue Cheng, Ouyang Jie, Shuo Yu et al.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5 Team Aohan Zeng, Xin Lv, Qinkai Zheng et al.
Agent Learning via Early Experience
Kai Zhang, Xiang Chen, Bo Liu et al.
Mastering Diverse Domains through World Models
Danijar Hafner, J. Pašukonis, Jimmy Ba et al.
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang et al.
Internalizing World Models via Self-Play Finetuning for Agentic RL
Shiqi Chen, Tongyao Zhu, Zian Wang et al.
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
Peng Xia, Peng Xia, Kaide Zeng et al.
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Siru Ouyang, Jun Yan, I-Hung Hsu et al.
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu et al.
OpenAI o1 System Card
Ahmed El-Kishky
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
Penghui Qi, Zi-Yan Liu, Tianyu Pang et al.
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu, Jiaxuan Gao, Xu Shen et al.
Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent
Xingzuo Li, Kehai Chen, Yunfei Long et al.