EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents
EEVEE framework employs a router-conditioned prompt set with co-evolution to enhance LLM robustness across heterogeneous task streams, improving scores by 10.38-24.32 points.
Key Findings
Methodology
The EEVEE framework introduces a learnable router that partitions incoming input streams into task-specific clusters, each associated with a dedicated prompt configuration. The system employs a router-prompt co-evolution strategy, alternating between router evolution and prompt refinement phases. During training, a three-stage process is used: initialization to generate diverse prompts via Pareto front maintenance, exploration through iterative coupled updates, and convergence once the routing stabilizes. The router is optimized using multi-objective scoring functions—accuracy, consistency, and balance—guiding the partitioning process. Prompts are stored in a Pareto front pool to preserve diversity. Evaluation on datasets like GPQA, Formula, TheoremQA, and HumanEval demonstrates that EEVEE outperforms state-of-the-art methods such as GEPA and ACE, with average score improvements of 10.38 to 24.32 points, and gains up to 37.2% and 48.2%.
Key Results
- On the Qwen3-4B-Instruct model, EEVEE achieves an average score of 51.75, surpassing the baseline by 10.38 points, and outperforms GEPA and ACE by 16.83 and 14.02 points respectively, indicating strong multi-task adaptation.
- In the DeepSeek-V3.2 model, the average score reaches 64.07, a 24.32-point improvement over the baseline, with specific gains of +30.55 on Formula, +18.63 on TheoremQA, and +50 on HumanEval, demonstrating excellent transferability.
- Ablation studies reveal that jointly optimizing the router and prompts via co-evolution significantly outperforms static routing or single-stage training, confirming the importance of dynamic, coupled updates.
Significance
This work addresses a critical challenge in deploying large language models in real-world scenarios, where input streams are heterogeneous and constantly evolving. Traditional prompt tuning methods lack the ability to dynamically adapt to diverse tasks, leading to performance degradation due to interference. EEVEE's router-conditioned prompt set effectively mitigates this issue, enabling models to maintain high performance across multiple domains. The approach paves the way for more robust, scalable, and self-improving AI systems capable of continuous learning in complex environments, with broad implications for industrial applications such as customer service, automated coding, and knowledge management.
Technical Contribution
EEVEE introduces a novel multi-dataset test-time prompt learning framework that integrates a learnable router with a diverse prompt pool, optimized through a co-evolution strategy. The key technical innovations include: • A multi-objective routing mechanism that balances accuracy, consistency, and task distribution, reducing cross-task interference; • A three-stage training process—initialization, exploration, and convergence—that ensures prompt diversity and routing stability; • Use of Pareto front-based prompt pool management to maintain a set of complementary prompts; • An interleaved optimization scheme that allows the router and prompts to evolve jointly, significantly outperforming static or sequential training approaches. These contributions enable scalable, flexible, and robust multi-task learning in large language models.
Novelty
This is the first framework explicitly designed for multi-dataset, test-time prompt learning with a learnable router that dynamically partitions input streams. Unlike prior methods like GEPA and ACE, which adapt prompts within a single task or static environment, EEVEE's co-evolution strategy allows the router and prompts to mutually adapt, effectively mitigating cross-task interference. The integration of Pareto front-based prompt pools and multi-objective routing distinguishes this work, offering a scalable solution for real-world multi-task scenarios. This approach fundamentally advances prompt learning by enabling models to self-organize and adapt continuously in heterogeneous environments.
Limitations
- The performance of the learned router heavily depends on the training data distribution; in unseen or highly skewed task environments, the routing may be suboptimal, affecting overall performance.
- The training process involves multiple phases and complex optimization, leading to high computational costs and potential difficulties in real-time deployment.
- Despite improvements, some task-specific interference and forgetting still occur, especially when task clusters are poorly defined or prompts lack sufficient coverage. Further research is needed to enhance generalization and reduce training overhead.
Future Work
Future directions include developing more efficient, lightweight routing mechanisms that can operate in real-time with lower computational costs. Extending the framework to incorporate multi-modal inputs, such as images and speech, could broaden its applicability. Additionally, integrating meta-learning or reinforcement learning techniques may further improve the adaptability and robustness of the routing and prompt co-evolution process, enabling truly autonomous, continuously self-improving AI agents in dynamic environments.
AI Executive Summary
The rapid evolution of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in understanding and generating human-like text. However, deploying these models in real-world scenarios presents significant challenges, particularly when faced with heterogeneous, multi-source input streams. Traditional prompt tuning methods, which rely on fixed prompts or static adaptation strategies, struggle to maintain performance amid diverse tasks, domains, and evolving data distributions. This limitation hampers the practical scalability of LLMs, especially in dynamic environments such as customer service, automated coding, and knowledge management.
Recognizing this gap, the authors introduce EEVEE, a novel framework designed to facilitate test-time prompt learning across multiple datasets and task streams. EEVEE employs a learnable router that dynamically partitions incoming inputs into task-specific clusters, each associated with a tailored prompt configuration. This approach preserves the benefits of prompt-based adaptation while mitigating cross-task interference, a common issue in multi-task learning. The core innovation lies in the router-prompt co-evolution strategy, which alternates between optimizing the routing mechanism and refining prompts, ensuring they evolve synergistically. The training process unfolds in three stages: initialization, exploration, and convergence. During initialization, diverse prompts are generated using Pareto front maintenance to cover a broad task space. Exploration involves iterative coupled updates, where the router and prompts are refined based on multi-objective scoring functions that balance accuracy, consistency, and task distribution. Once the routing stabilizes, the model enters the convergence phase, where large-scale prompt tuning occurs under a fixed routing scheme.
Extensive experiments demonstrate that EEVEE significantly outperforms existing methods such as GEPA and ACE across multiple benchmarks, including GPQA, Formula, TheoremQA, and HumanEval. In the four-benchmark suite, EEVEE achieves an average score increase of 10.38 to 24.32 points, with improvements up to 37.2% and 48.2%. Notably, in multi-task incremental learning scenarios, EEVEE maintains positive retention, ending with a +41.53 cumulative gain, whereas baseline methods decline. The framework also exhibits strong cross-model and cross-task generalization, transferring prompts effectively between different models and unseen tasks. Furthermore, EEVEE maintains computational efficiency, with only modest token overhead compared to baseline methods.
This research advances the field of prompt learning by providing a scalable, adaptive, and robust solution for real-world multi-task environments. It addresses longstanding issues of task interference and catastrophic forgetting, paving the way for autonomous, self-improving AI systems capable of continuous learning. The framework's design principles—dynamic routing, multi-objective optimization, and joint evolution—offer a blueprint for future developments in multi-modal, multi-task AI. Despite its successes, challenges remain, including the high computational cost of training and potential limitations in extremely novel or skewed task distributions. Future work will focus on enhancing efficiency, extending to multi-modal inputs, and integrating meta-learning techniques to further boost adaptability. Overall, EEVEE represents a significant step toward intelligent agents that can learn, adapt, and improve autonomously in complex, real-world settings.
Deep Analysis
Background
近年来,随着GPT、BERT、T5等大规模预训练模型的崛起,提示学习(Prompt Learning)成为提升模型适应性的关键技术。早期工作如软提示(Soft Prompting)和离散提示(Discrete Prompting)通过优化提示参数,无需微调模型参数,即可实现任务适应。随后,AutoPrompt、P-Tuning等方法引入自动化提示生成,利用梯度信息优化提示内容。反思机制(Reflection)如GEPA和ACE提出利用模型反馈进行自我优化,提升提示效果。然而,这些方法多局限于单一任务或数据集,难以应对复杂多变的实际场景。近年来,研究开始关注多任务、多域的提示适应,尝试通过多提示池或记忆机制实现多源任务的协同学习,但仍面临跨任务干扰和泛化不足的问题。整体来看,提示学习已从单任务逐步扩展到多任务、多域,但在实际应用中仍需创新机制以提升鲁棒性和效率。
Core Problem
在实际应用中,模型常常面对来自不同领域、不同格式和不同评价规则的输入流。单一提示或静态路由难以满足多样化需求,容易导致任务间干扰,影响模型性能。传统方法多采用预定义路由或固定提示,缺乏动态适应能力,难以应对任务流的不断变化。这使得模型在多任务环境中表现不稳定,甚至出现遗忘和干扰问题。解决这一难题的关键在于设计一种动态划分任务簇的机制,既能保证任务的专属性,又能实现跨任务的知识迁移。实现这一目标需要创新的路由策略、协同优化机制和高效训练流程,以支持模型在复杂、多变的环境中持续提升。
Innovation
本文的核心创新在于提出EEVEE框架,结合可学习的路由器与多样化提示池,通过路由器-提示共同进化策略,解决多源异构任务流中的干扰问题。具体创新包括:• 引入多目标评分机制(准确性、一致性、平衡性)优化路由器,动态划分任务簇;• 设计三阶段训练流程(初始化、探索、收敛),确保提示多样性和路由器稳定性;• 利用Pareto前沿池维护多样提示,避免陷入单一提示的局限;• 采用交替优化策略,使路由器和提示集共同进化,提升整体适应性。这些创新使模型能够在多任务、多域环境中实现持续学习和自我提升,显著优于现有方法。
Methodology
- �� 输入:多源异构任务流,涵盖不同领域、格式和评价规则。
- �� 初始化:在混合训练集上进行提示微调,生成多样化提示池,利用Pareto前沿筛选出互补提示。
- �� 路由器设计:构建可学习的路由器,基于多目标评分(准确性、一致性、平衡性)进行优化,划分输入流到不同任务簇。
- �� 共同进化:在训练过程中,交替进行路由器演化和提示微调。
- 路由器演化:在固定提示集基础上,生成多个候选路由策略,评估其在验证集上的性能,选择最优者。
- 提示微调:在确定的任务簇内,对提示进行突变和反思,提升其任务适应性。
- �� 训练流程:分为三阶段(初始化、探索、收敛),每阶段目标不同,逐步提升模型性能。
- �� 评估:在多个公开数据集(GPQA、Formula、TheoremQA、HumanEval)上测试,比较不同策略的效果,验证鲁棒性和迁移能力。
Experiments
- �� 数据集:包括GPQA(知识问答)、Formula(数学推理)、TheoremQA(符号推理)、HumanEval(代码生成)等,覆盖多种任务类型。
- �� 基线:未适应模型、GEPA、ACE、静态路由、单阶段训练等。
- �� 评价指标:平均得分、任务保持率、迁移能力、模型鲁棒性。
- �� 超参数:路由目标评分权重、提示池大小、训练轮次、学习率等,经过调优。
- �� 实验设计:多轮随机抽样多次运行,统计平均性能,进行消融分析验证不同组件的贡献,测试不同任务簇划分策略的效果。
Results
- �� 在四个基准任务上,EEVEE平均提升10.38至24.32分,显著优于GEPA和ACE,尤其在多任务连续学习中表现出极强的抗干扰能力。
- �� 在跨模型迁移中,提示在Qwen3-4B-Instruct上训练后,迁移到DeepSeek-V3.2模型,平均提升12.28分,显示出良好的泛化能力。
- �� 消融实验显示,静态路由和单阶段训练效果明显逊色,动态共同优化策略显著提升性能,验证了设计的有效性。
Applications
- �� 立即应用:可用于智能客服系统、多任务问答平台、自动编程助手等场景,提升模型在多源任务中的表现和稳定性。
- �� 长远愿景:推动自主学习和自我优化的智能系统发展,实现模型在复杂环境中的持续适应和自我提升,未来可结合强化学习和元学习进一步增强性能。
Limitations & Outlook
- �� 当前路由器性能依赖于训练样本分布;在极端或未见任务类型下,可能表现不足。
- �� 训练过程复杂,计算成本高,实际部署时对硬件资源要求较大。
- �� 在某些任务中仍存在干扰或遗忘,特别是在簇划分不理想或提示覆盖不足时。未来需优化训练效率和泛化能力,解决模型在极端场景下的表现不足。
Plain Language Accessible to non-experts
想象你在一家大型工厂工作,工厂里有许多不同的生产线,每条生产线负责不同的产品。有时候,工厂接到新订单,这些订单来自不同的客户,要求不同的产品。为了让工厂高效运转,管理者会根据订单的不同类型,把订单分配到不同的生产线。每条生产线都特别擅长某一类产品,但如果所有订单都挤在一起,工厂就会变得混乱,生产效率也会下降。
现在,把大规模语言模型想象成这个工厂,提示集就是不同的生产线,而路由器就像管理者,负责决定每个订单(输入)应该送到哪条生产线(提示)。这个管理者不断学习和调整自己的判断,确保每个订单都能由最擅长的生产线处理,从而提高整体效率。通过不断地试错和优化,工厂逐渐变得更聪明,能应对各种复杂订单,保持高效运转。这种方法让工厂在面对各种新订单时,都能快速适应,发挥出最好的水平,就像EEVEE让大模型在多任务环境中表现得更出色一样。
ELI14 Explained like you're 14
想象你在学校里,有很多不同的老师教不同的科目,比如数学、语文、科学。每个老师都擅长自己的一套教学方法。有时候,你会遇到不同的老师给你布置不同的作业。有的老师喜欢用题目来考你,有的老师喜欢让你写作文。为了让你学得更好,学校会安排一个“老师调度员”,根据你要学的科目,把你安排到最合适的老师那里。这个调度员会不断学习,知道哪个老师擅长什么,然后根据你的作业内容,把你送到最合适的老师那里。这样,你就能更快、更好地学到东西,不会被不同老师的风格搞糊涂。
EEVEE的想法也是一样的:它让一个“调度员”学会根据输入内容,把不同的任务送到不同的“提示老师”那里。每个“提示老师”都专门擅长某一类任务,比如数学推理或写代码。通过不断调整这个调度员和老师们的提示,模型变得越来越聪明,能应对各种不同的任务,就像你在学校里学得更好一样。这种方法让大模型在面对复杂、多样的任务时,表现得更稳定、更聪明。
Glossary
Prompt Tuning (提示微调)
一种通过调整输入提示内容以引导模型行为的方法,无需修改模型参数。技术上通过优化提示向量或文本实现任务适应。
在本文中,提示微调用于在测试时动态优化模型的响应策略。
Router (路由器)
在模型中用于根据输入特征动态划分任务簇或选择提示配置的机制,类似于交通指挥员。
本文引入可学习的路由器,用于将输入流划分到不同的提示集。
Co-evolution (共同进化)
两个或多个系统(如路由器和提示集)在训练过程中交替优化,相互促进以达到更优性能。
本文采用路由器-提示共同进化策略,确保两者协同提升。
Pareto Front (帕累托前沿)
在多目标优化中,表示不存在其他方案在所有目标上都优于它的解集合。
用以维护提示集多样性,避免陷入局部最优。
Multi-dataset Test-time Prompt Learning (多数据集测试时提示学习)
在模型部署后,面对来自多个不同数据源或任务的输入,动态调整提示以适应不同任务的学习方法。
本文的核心目标。
Multi-objective Optimization (多目标优化)
同时优化多个性能指标(如准确性、一致性、平衡性),以获得更全面的模型表现。
用于路由器的评分机制。
Task Cluster (任务簇)
由路由器划分的具有相似特征或需求的输入集合,用于提示配置的匹配。
实现多任务适应的基础。
Prompt Pool (提示池)
存储多样化提示的集合,用于在训练和推理中选择最优提示。
通过Pareto前沿维护多样性。
Self-Improving Agents (自我提升代理)
能够通过自身反馈不断优化行为和策略的智能系统。
本文目标之一。
Heterogeneous Task Streams (异构任务流)
包含多种不同类型、领域和格式的任务输入流。
模型面临的实际挑战。
Open Questions Unanswered questions from this research
- 1 当前路由器在极端或未见任务类型下的泛化能力不足,未来需结合元学习或强化学习技术进行增强。
- 2 训练过程复杂,涉及多阶段、多目标优化,计算成本较高,实际部署时对硬件资源要求较大。
- 3 模型在某些任务中仍可能出现干扰或遗忘,特别是在任务簇划分不理想或提示集未能充分覆盖任务多样性时。
Applications
Immediate Applications
多任务问答系统
在智能客服或知识问答平台中,模型能根据不同用户请求自动切换任务簇,提升响应准确性和鲁棒性。
自动编程助手
结合不同编程任务的提示,支持多语言、多任务的代码生成和调试,提升开发效率。
多领域知识管理
在企业知识库中,模型根据任务类型自动调配提示,保持知识的准确性和一致性。
Long-term Vision
自主学习与自我优化
未来模型能在实际环境中不断通过反馈调整路由和提示,实现持续自我提升,减少人工干预。
跨模态多任务智能系统
结合视觉、语音等多模态信息,构建具有多源感知和适应能力的智能系统,广泛应用于机器人、智能家居等。
Abstract
In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.