MEME: Multi-entity & Evolving Memory Evaluation
MEME evaluates multi-entity and evolving memory tasks, exposing dependency reasoning failures in current systems.
Key Findings
Methodology
MEME introduces a novel evaluation framework focusing on multi-entity and evolving memory tasks. It defines six tasks, including Cascade, Absence, and Deletion, generated using a Directed Acyclic Graph (DAG) knowledge graph to ensure verifiable propagation answers. Experiments were conducted on six memory systems across three paradigms, revealing failures in dependency reasoning tasks.
Key Results
- Across 100 controlled episodes, all systems performed poorly on dependency reasoning tasks under default configurations, with Cascade task averaging 3% accuracy and Absence task 1%. Even with prompt optimization, deeper retrieval, and reduced filler noise, results did not significantly improve.
- Only a file-based agent paired with Claude Opus 4.7 partially closed the gap, but at ~70x the baseline cost, indicating current solutions are impractical at scale.
- Experiments revealed critical blind spots in how current memory systems handle stateful, interdependent knowledge, particularly in Cascade and Absence tasks.
Significance
MEME's research exposes deficiencies in existing memory systems when handling dynamic and multi-entity information, particularly in dependency reasoning tasks. This work provides crucial guidance for future memory system designs, emphasizing the need to handle complex knowledge updates in dynamic environments.
Technical Contribution
MEME offers a comprehensive memory evaluation framework by defining new task types and using DAG knowledge graphs to generate datasets. It reveals structural flaws in current memory systems' dependency reasoning tasks and suggests potential solutions.
Novelty
MEME is the first to systematically evaluate multi-entity and evolving memory tasks, especially dependency reasoning tasks. Unlike existing benchmarks, MEME covers complex scenarios involving multi-entity and dynamic changes, not just single-entity updates.
Limitations
- The current evaluation is limited to two handcrafted knowledge graphs (Personal Life and Software Project), which may restrict the generalizability of the results.
- Dialogue data is generated by LLMs rather than collected from real users, which may affect realism.
- The experiment scale is limited, conducted on only 100 episodes, which may not reveal patterns in longer contexts or larger sample sizes.
Future Work
Future research can expand to broader domains and crowd-sourced knowledge graphs to test MEME's generalizability. Additionally, new memory architectures can be explored to propagate updates locally during maintenance rather than relying on costly internal LLMs.
AI Executive Summary
As large language models (LLMs) increasingly serve as agents interacting with users across multiple sessions, accurately storing, updating, and reasoning over past interactions has become essential. However, existing memory systems show significant deficiencies in handling dynamic and multi-entity information, especially in dependency reasoning tasks.
MEME introduces a novel evaluation framework focusing on multi-entity and evolving memory tasks. It defines six tasks, including Cascade, Absence, and Deletion, generated using a Directed Acyclic Graph (DAG) knowledge graph to ensure verifiable propagation answers. Experiments were conducted on six memory systems across three paradigms, revealing failures in dependency reasoning tasks.
Across 100 controlled episodes, all systems performed poorly on dependency reasoning tasks under default configurations, with Cascade task averaging 3% accuracy and Absence task 1%. Even with prompt optimization, deeper retrieval, and reduced filler noise, results did not significantly improve. Only a file-based agent paired with Claude Opus 4.7 partially closed the gap, but at ~70x the baseline cost, indicating current solutions are impractical at scale.
MEME's research exposes deficiencies in existing memory systems when handling dynamic and multi-entity information, particularly in dependency reasoning tasks. This work provides crucial guidance for future memory system designs, emphasizing the need to handle complex knowledge updates in dynamic environments.
Future research can expand to broader domains and crowd-sourced knowledge graphs to test MEME's generalizability. Additionally, new memory architectures can be explored to propagate updates locally during maintenance rather than relying on costly internal LLMs.
Deep Analysis
Background
As AI technology evolves, large language models (LLMs) have become increasingly important in various applications. Traditional memory systems often focus on single-entity updates, neglecting complex scenarios involving multi-entity and dynamic changes. Existing benchmarks, such as RULER and NoLiMa, primarily measure attention window limits within a single input rather than persistent memory across sessions. Multi-session benchmarks, like LoCoMo and LongMemEval, evaluate static preference retention and knowledge updates but fail to assess the ripple effects an upstream change should trigger in dependent entities.
Core Problem
Existing memory systems show significant deficiencies in handling dynamic and multi-entity information, especially in dependency reasoning tasks. Dependency reasoning involves how a fact changes after an upstream update (Cascade), how a previously valid answer becomes uncertain (Absence), and how a removed fact stops being reported (Deletion). These tasks reveal critical blind spots in how current memory systems handle stateful, interdependent knowledge.
Innovation
MEME is the first to systematically evaluate multi-entity and evolving memory tasks, especially dependency reasoning tasks. Unlike existing benchmarks, MEME covers complex scenarios involving multi-entity and dynamic changes, not just single-entity updates. By defining new task types and using DAG knowledge graphs to generate datasets, MEME offers a comprehensive memory evaluation framework.
Methodology
- �� MEME defines six tasks covering multi-entity and evolving memory evaluations.
- �� Uses Directed Acyclic Graph (DAG) knowledge graphs to generate datasets, ensuring verifiable propagation answers.
- �� Experiments conducted on six memory systems across three paradigms.
- �� Tests conducted with prompt optimization, deeper retrieval, and reduced filler noise to reveal failures in dependency reasoning tasks.
Experiments
Experiments were conducted on six memory systems across three paradigms, including raw retrieval, LLM-processed memory, and file-based agents. Using 100 controlled episodes, the performance of Cascade, Absence, and Deletion tasks was evaluated. The experiments revealed failures in current systems' dependency reasoning tasks, particularly in Cascade and Absence tasks.
Results
Across 100 controlled episodes, all systems performed poorly on dependency reasoning tasks under default configurations, with Cascade task averaging 3% accuracy and Absence task 1%. Even with prompt optimization, deeper retrieval, and reduced filler noise, results did not significantly improve. Only a file-based agent paired with Claude Opus 4.7 partially closed the gap, but at ~70x the baseline cost, indicating current solutions are impractical at scale.
Applications
MEME's research exposes deficiencies in existing memory systems when handling dynamic and multi-entity information, particularly in dependency reasoning tasks. This work provides crucial guidance for future memory system designs, emphasizing the need to handle complex knowledge updates in dynamic environments.
Limitations & Outlook
The current evaluation is limited to two handcrafted knowledge graphs (Personal Life and Software Project), which may restrict the generalizability of the results. Dialogue data is generated by LLMs rather than collected from real users, which may affect realism. The experiment scale is limited, conducted on only 100 episodes, which may not reveal patterns in longer contexts or larger sample sizes.
Plain Language Accessible to non-experts
Imagine you have a smart assistant that remembers everything about you, like where you live, what you like, and even your work projects. Now, suppose you move to a new city, your smart assistant not only needs to remember this change but also know that things related to your old address, like commute time or nearby facilities, might no longer be valid. MEME is designed to test how these smart assistants perform when handling such changes. It's like a test to see if these assistants can effectively update and reason over information across multiple tasks. The results show that many systems struggle with handling these complex changes, especially in tasks requiring dependency reasoning. It's like a student facing a tough question in an exam; despite doing well in other subjects, they can't give the right answer to this particular problem.
ELI14 Explained like you're 14
Hey there! Imagine you have this super cool smart assistant that remembers everything about you, like where you live, what you like, and even your school projects. Now, suppose you move to a new city, your smart assistant not only needs to remember this change but also know that things related to your old address, like how long it takes to get to school or what's around your neighborhood, might not be the same anymore. MEME is like a test to see how well these smart assistants handle such changes. The results show that many systems struggle with handling these complex changes, especially when they need to figure out how one change affects other things. It's like a student facing a tough question in an exam; even if they're good at other subjects, they might not get this one right!
Glossary
Multi-entity
Tasks involving the processing of information across multiple entities.
A dimension in MEME's evaluation framework.
Evolving Memory
Memory systems that update and change over time.
A dimension in MEME's evaluation framework.
Dependency Reasoning
The ability to process and reason over dependencies between information.
A core task in MEME's evaluation.
Cascade
Tasks that handle changes in information after an upstream update.
One of the tasks evaluated by MEME.
Absence
Tasks that handle uncertainty when a previously valid answer becomes uncertain.
One of the tasks evaluated by MEME.
Deletion
Tasks that handle stopping the reporting of information after it is removed.
One of the tasks evaluated by MEME.
Directed Acyclic Graph (DAG)
A graph structure used to represent entities and their dependencies.
Used to generate MEME's evaluation datasets.
File-based Agent
LLM agents that manage persistent files via tool-calling.
One of the memory systems evaluated by MEME.
Claude Opus 4.7
An LLM used to partially close the gap in dependency reasoning tasks.
Paired with file-based agents.
Prompt Optimization
Methods to improve system performance by optimizing prompts.
Used in MEME experiments.
Open Questions Unanswered questions from this research
- 1 Current memory systems show deficiencies in handling dynamic and multi-entity information, especially in dependency reasoning tasks. New memory architectures are needed to propagate updates locally during maintenance.
- 2 Existing evaluations are limited to two handcrafted knowledge graphs, which may restrict the generalizability of the results. Expansion to broader domains and crowd-sourced knowledge graphs is needed.
- 3 Dialogue data is generated by LLMs rather than collected from real users, which may affect realism. Real user data is needed to improve evaluation realism.
- 4 The experiment scale is limited, conducted on only 100 episodes, which may not reveal patterns in longer contexts or larger sample sizes. Larger-scale experiments are needed to validate results.
- 5 Current systems' failures in dependency reasoning tasks indicate the need for new solutions to handle stateful, interdependent knowledge updates.
Applications
Immediate Applications
Smart Assistant Optimization
Improve smart assistants' performance in dynamic environments by enhancing memory systems, especially in handling complex knowledge updates.
Multi-entity Data Management
Apply MEME framework in scenarios requiring processing of information across multiple entities to improve data management efficiency.
Dynamic Knowledge Bases
Use MEME framework in enterprise knowledge management to better handle dynamic information updates.
Long-term Vision
Comprehensive Memory Systems
Develop memory systems capable of propagating updates locally to handle complex knowledge updates in dynamic environments.
Cross-domain Applications
Expand MEME framework to more domains to enhance memory systems' generalizability and adaptability.
Abstract
LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.
References (20)
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu et al.
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.
A Coefficient of Agreement for Nominal Scales
Jacob Cohen
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.
♫ MuSiQue: Multihop Questions via Single-hop Question Composition
H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
Yuanzhe Hu, Yu Wang, Julian McAuley
MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents
Haoran Tan, Zeyu Zhang, Chen Ma et al.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng et al.
A Survey on the Memory Mechanism of Large Language Model-based Agents
Zeyu Zhang, Quanyu Dai, Xiaohe Bo et al.
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman et al.
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Vivian Fang, Shishir G. Patil et al.
Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, S. Tulyakov et al.
NoLiMa: Long-Context Evaluation Beyond Literal Matching
Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt et al.
MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning et al.
Unsupervised Dense Information Retrieval with Contrastive Learning
Gautier Izacard, Mathilde Caron, Lucas Hosseini et al.
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang et al.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
O. Khattab, Arnav Singhvi, Paridhi Maheshwari et al.
Evaluating the Ripple Effects of Knowledge Editing in Language Models
Roi Cohen, Eden Biran, Ori Yoran et al.
BM25S: Orders of magnitude faster lexical search via eager sparse scoring
Xing Han Lù
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
P. Chhikara, Dev Khant, Saket Aryan et al.