LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

TL;DR

LongMemEval-V2 achieves 72.5% accuracy with AgentRunbook-C, evaluating long-term memory in agents.

cs.CL 🔴 Advanced 2026-05-13 807 views
Di Wu Zixiang Ji Asmi Kawatkar Bryan Kwan Jia-Chen Gu Nanyun Peng Kai-Wei Chang
long-term memory agent systems RAG coding agent environment experience

Key Findings

Methodology

The paper introduces LongMemEval-V2, a benchmark for evaluating long-term memory in agent systems. It employs two memory methods: AgentRunbook-R and AgentRunbook-C. AgentRunbook-R is based on RAG technology, using knowledge pools to store raw state observations, events, and strategy notes. AgentRunbook-C stores trajectories as files and uses a coding agent to gather evidence in an augmented sandbox.

Key Results

  • AgentRunbook-C achieved the best performance in experiments with an average accuracy of 72.5%, outperforming the strongest RAG baseline at 48.5% and the off-the-shelf coding agent baseline at 69.3%.
  • Despite the significant accuracy improvements, coding agent-based methods have high latency costs.
  • AgentRunbook-C advances the accuracy-latency Pareto frontier, but substantial room for improvement remains.

Significance

This research establishes a challenging testbed for developing long-term memory systems that convert environment experience into reusable knowledge. It fills the gap in existing benchmarks by directly evaluating the ability of memory systems to internalize environment-specific experience, advancing the application of agent systems in complex environments.

Technical Contribution

The technical contributions include proposing a new standard for memory system evaluation and developing two memory methods. AgentRunbook-C offers an innovative solution by treating memory management as a file management problem. Compared to existing agent memory methods, the proposed methods significantly improve accuracy and efficiency.

Novelty

LongMemEval-V2 is the first benchmark to scale history length to tens or over 100 million tokens in agent environments. Compared to the most related work, it provides more complex contexts and a new ability taxonomy focused on agent experience memory.

Limitations

  • Although AgentRunbook-C excels in accuracy, its high latency cost limits efficiency in practical applications.
  • Current methods still have room for improvement in handling multimodal contexts, especially in complex environments.
  • Future research needs to further optimize memory system efficiency for better real-time application performance.

Future Work

Future directions include optimizing memory system efficiency, exploring more complex environments and multimodal contexts, and developing stronger agent systems to enhance environmental adaptability.

AI Executive Summary

In modern web environments, the success of agent systems depends on their long-term memory capabilities, which allow them to recall interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, the paper introduces LongMemEval-V2, a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent-based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

Long-term memory helps large language models (LLMs) operate beyond their context and parameters by storing and recalling information over long horizons. Memory is especially important for agent systems, where LLMs interact with specialized environments over many steps. Recent works show that memorizing task procedures, interface affordances, and hidden failure modes improve agent performance at inference time. However, benchmarks for memory in the agentic context remain limited. Existing memory works mainly evaluate retrieval and reasoning over long documents or user chat histories. Recent works consider evaluating memorization over agent trajectories, but often use simplified game environments, emphasize limited dependencies within one or a few trajectories, or evaluate indirectly through downstream task success. As a result, they provide limited insight into whether memory systems can accumulate holistic, environment-specific knowledge from sustained interaction with a complex environment. To highlight this perspective, this paper uses the following framing: A high-quality memory makes an agent an experienced colleague in a specialized environment. Driven by this view, we introduce LongMemEval-V2, a benchmark for evaluating whether memory systems can help web agents acquire the experience needed to become knowledgeable colleagues. LME-V2 leverages customized websites including Magento shopping, shopping admin, Postmill forum, and ServiceNow from WebArena and WorkArena. From task-solving web agent trajectories, we manually curate 451 questions covering five core memory abilities: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. We provide examples in Figure 1 and ability definitions in §3.1. These questions are specific to the customized environments and thus remain generally unanswerable by recent frontier LLMs. LME-V2 further pairs the questions with a sequence of web agent trajectories (“haystacks”), where only a small fraction bears the answers to each question (“needles”). LME-V2-Small provides a 100-trajectory haystack shared by all questions, and LME-V2-Medium has 500-trajectory question-specific haystacks. Compared to prior benchmarks, LME-V2 poses new challenges with its deep context (25M/115M tokens in the small/medium tiers) and comprehensive memory ability coverage.

LME-V2 evaluates memory systems' ability to intelligently store and filter information from noisy agent trajectories, retaining both low-level observations as well as higher-level environment dynamics and procedural knowledge. As a result, naive application of popular agent memory methods could be ineffective as they are biased towards less noisy conversational contexts or high-level strategic knowledge. In this paper, we propose AgentRunbook, a simple yet effective baseline consisting of two variants, optimized separately for efficiency and accuracy. AgentRunbook-R is an efficient retrieval-augmented generation (RAG) pipeline inspired by agentic memory works. It prompts an LLM controller to update and to actively query three knowledge pools: raw observations, state transition events, and high-level strategy notes. AgentRunbook-R is efficient and covers major memory abilities, but its simple design is not optimized for detailed evidence selection. Inspired by Cao et al., we propose AgentRunbook-C, a coding agent-based memory method that casts memory management as a file management problem. AgentRunbook-C stores raw trajectories directly as files. At query time, it augments an off-the-shelf coding agent harness with workflow documents, memory manifests, and helper scripts, then invokes the agent to assemble a compact evidence set.

We evaluate the memory designs on the small and medium tiers of LME-V2. To begin with, a simple RAG method that retrieves state slices can only achieve an overall accuracy of 40.1%, and AgentRunbook-R further improves to 57.8%. Accuracy-wise, we find the off-the-shelf Codex agent has competitive performance, achieving a surprisingly high 69.3% accuracy. However, the agent achieves this at a cost of about 182 seconds per query, about 6.9 times slower than AgentRunbook-R. With our specialization designs, AgentRunbook-C performs best overall with 72.5% accuracy while being 32% faster than Codex at query time. Our further analyses reveal that AgentRunbook-C significantly advances the accuracy-latency frontier, but the room for future improvement remains large. Overall, LME-V2 formulates a new standard for agent memory evaluation and provides a concrete testbed for memory modules that make long-running agents more reliable, adaptive, and useful in real-world environments.

Deep Analysis

Background

Long-term memory plays a crucial role in the field of artificial intelligence, particularly in agent systems. As large language models (LLMs) evolve, researchers have begun to focus on enhancing these models' capabilities through memory systems, allowing them to operate efficiently over extended periods in complex environments. Early research primarily concentrated on information retrieval and instruction following over long input documents. As the demand for personalized memory increased, research expanded to cover explicit user facts and implicit preferences. However, existing benchmarks remain limited in evaluating the memory capabilities of agent systems, often focusing on simplified game environments or user chat histories. The emergence of LongMemEval-V2 marks a new shift, focusing on agent systems' experience memory, constructing complex contexts, and providing a new ability taxonomy.

Core Problem

Agent systems operating in complex web environments require long-term memory capabilities to effectively internalize environment-specific experience. This capability not only includes memory of interface affordances but also involves recognizing state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks fail to directly evaluate these capabilities, typically focusing on user histories or downstream task success. To address this gap, LongMemEval-V2 proposes a new evaluation standard aimed at verifying whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments.

Innovation

The core innovations of LongMemEval-V2 lie in its evaluation framework and memory methods. Firstly, it provides a comprehensive benchmark covering five core memory abilities: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Secondly, it introduces two memory methods: AgentRunbook-R and AgentRunbook-C. AgentRunbook-R is based on RAG technology, using knowledge pools to store raw state observations, events, and strategy notes. AgentRunbook-C stores trajectories as files and uses a coding agent to gather evidence in an augmented sandbox. These innovations not only improve the accuracy of memory systems but also enhance query efficiency.

Methodology

  • �� AgentRunbook-R: Based on RAG technology, using knowledge pools to store raw state observations, events, and strategy notes. An LLM controller generates retrieval queries, supporting multimodal memory context. • AgentRunbook-C: Stores trajectories as files, using a coding agent to gather evidence in an augmented sandbox. Adds lightweight scaffolding components, including workflow documents, query-time rendered manifests, and helper scripts. • Evaluation framework: Uses a context gathering formulation, memory systems consume history trajectories and return compact evidence for downstream question answering. Reports answer accuracy and query latency.

Experiments

The experimental design includes evaluation on the small and medium tiers of LME-V2. Datasets include Magento shopping, shopping admin, Postmill forum, and ServiceNow. Baselines include simple RAG methods and off-the-shelf coding agents. Key hyperparameters include retrieval query generation and memory context truncation budget. Ablation studies analyze the impact of different knowledge pools and scaffolding components on performance.

Results

Experimental results show that AgentRunbook-C achieves the best accuracy, reaching an average of 72.5%. In contrast, the strongest RAG baseline only reaches 48.5%, while the off-the-shelf coding agent baseline is 69.3%. Ablation studies reveal that workflow instructions and manifest artifacts significantly impact efficiency, while helper functions positively affect small-tier results. Overall, AgentRunbook-C demonstrates excellent performance in accuracy-latency trade-offs.

Applications

Application scenarios for LongMemEval-V2 include evaluation and optimization of agent systems in complex web environments. Direct use cases include customer service agents for e-commerce websites and automated assistants in forum management systems. Industry impact involves enhancing agent systems' environmental adaptability and user experience.

Limitations & Outlook

Despite the significant progress made by LongMemEval-V2 in evaluating agent systems' memory capabilities, some limitations remain. Firstly, coding agent-based methods have high latency costs, limiting efficiency in real-time applications. Secondly, current methods still have room for improvement in handling multimodal contexts, especially in complex environments. Future research needs to further optimize memory system efficiency for better real-time application performance.

Plain Language Accessible to non-experts

Imagine you are in a kitchen preparing a meal. The kitchen is a complex environment, and you are the agent system. To make a good meal, you need to remember where the ingredients are, the cooking steps, and common mistakes. Existing memory benchmarks are like a cookbook, telling you how to cook but not teaching you how to work in this specific kitchen. LongMemEval-V2 is like a chef training course, helping you become an experienced chef in this specific kitchen. It not only teaches you how to remember where the ingredients are but also how to handle various issues that arise during cooking. In this way, you can work more efficiently in the kitchen and make tastier food. Just like in the kitchen, agent systems in complex web environments also need this long-term memory capability to better complete tasks.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex video game with lots of levels and missions. To win the game, you need to remember the rules of each level, enemy attack patterns, and hidden traps. Existing game guides are like a manual, telling you how to pass levels but not teaching you how to become a pro in this specific game. LongMemEval-V2 is like a game training camp, helping you become an experienced player in this specific game. It not only teaches you how to remember the rules of each level but also how to handle various issues that arise during gameplay. In this way, you can pass levels more efficiently and become a pro in the game. Just like in the game, agent systems in complex web environments also need this long-term memory capability to better complete tasks.

Glossary

Long-term Memory

The ability of agent systems to store and recall information over extended periods.

Used in the paper to evaluate agent systems' performance in complex environments.

Agent System

An automated system capable of performing tasks in web environments.

Used in the paper to evaluate long-term memory capabilities.

RAG (Retrieval-Augmented Generation)

A technology combining retrieval and generation to enhance memory system efficiency.

Used in the paper for the AgentRunbook-R method.

Coding Agent

An automated agent capable of performing coding tasks.

Used in the paper for the AgentRunbook-C method.

Environment Experience

The knowledge and skills accumulated by agent systems in specific environments.

Used in the paper to evaluate memory system performance.

Knowledge Pool

A structure for storing information observed by agent systems in environments.

Used in the paper for the AgentRunbook-R method.

Sandbox

A method for testing and evaluating agent systems.

Used in the paper for the AgentRunbook-C method.

Workflow

The steps and processes for agent systems to perform tasks in environments.

Used in the paper to evaluate memory system performance.

Environment Gotchas

Common issues and challenges agent systems may encounter in environments.

Used in the paper to evaluate memory system performance.

Premise Awareness

The ability of agent systems to recognize assumptions and premises in environments.

Used in the paper to evaluate memory system performance.

Open Questions Unanswered questions from this research

  • 1 Existing memory benchmarks fail to directly evaluate the ability of agent systems to internalize environment-specific experience. New evaluation standards and methods are needed to verify whether memory systems can help agents acquire experience in complex environments.
  • 2 Coding agent-based methods have high latency costs, limiting efficiency in real-time applications. Further optimization of memory system efficiency is needed for better real-time application performance.
  • 3 Current methods still have room for improvement in handling multimodal contexts, especially in complex environments. Stronger agent systems need to be developed to enhance environmental adaptability.
  • 4 Agent systems operating in complex environments require long-term memory capabilities to effectively internalize environment-specific experience. New memory methods need to be developed to improve accuracy and efficiency.
  • 5 LongMemEval-V2 provides a comprehensive benchmark covering five core memory abilities. However, further research is needed to verify the performance of these abilities in different environments.

Applications

Immediate Applications

Customer Service Agents for E-commerce Websites

Evaluate and optimize customer service agents' memory capabilities using LongMemEval-V2 to improve user experience and service efficiency.

Automated Assistants in Forum Management Systems

Use LongMemEval-V2 to evaluate and optimize automated assistants in forum management systems, enhancing information processing capabilities and user interaction experience.

Evaluation of Agent Systems in Complex Web Environments

Evaluate agent systems' performance in complex web environments using LongMemEval-V2, optimizing their memory capabilities and environmental adaptability.

Long-term Vision

Environmental Adaptability of Agent Systems

Optimize memory systems to enhance agent systems' adaptability in complex environments, promoting their application in more fields.

Multimodal Context Handling Capability

Develop stronger agent systems to improve their ability to handle multimodal contexts, promoting their application in complex environments.

Abstract

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

cs.CL