MEME: Multi-entity & Evolving Memory Evaluation

TL;DR

MEME evaluates multi-entity and evolving memory tasks, exposing dependency reasoning failures in current systems.

cs.LG 🔴 Advanced 2026-05-13 167 views
Seokwon Jung Alexander Rubinstein Arnas Uselis Sangdoo Yun Seong Joon Oh
multi-entity evolving memory dependency reasoning LLM evaluation benchmark

Key Findings

Methodology

MEME introduces a novel evaluation framework focusing on multi-entity and evolving memory tasks. It defines six tasks, including Cascade, Absence, and Deletion, generated using a Directed Acyclic Graph (DAG) knowledge graph to ensure verifiable propagation answers. Experiments were conducted on six memory systems across three paradigms, revealing failures in dependency reasoning tasks.

Key Results

  • Across 100 controlled episodes, all systems performed poorly on dependency reasoning tasks under default configurations, with Cascade task averaging 3% accuracy and Absence task 1%. Even with prompt optimization, deeper retrieval, and reduced filler noise, results did not significantly improve.
  • Only a file-based agent paired with Claude Opus 4.7 partially closed the gap, but at ~70x the baseline cost, indicating current solutions are impractical at scale.
  • Experiments revealed critical blind spots in how current memory systems handle stateful, interdependent knowledge, particularly in Cascade and Absence tasks.

Significance

MEME's research exposes deficiencies in existing memory systems when handling dynamic and multi-entity information, particularly in dependency reasoning tasks. This work provides crucial guidance for future memory system designs, emphasizing the need to handle complex knowledge updates in dynamic environments.

Technical Contribution

MEME offers a comprehensive memory evaluation framework by defining new task types and using DAG knowledge graphs to generate datasets. It reveals structural flaws in current memory systems' dependency reasoning tasks and suggests potential solutions.

Novelty

MEME is the first to systematically evaluate multi-entity and evolving memory tasks, especially dependency reasoning tasks. Unlike existing benchmarks, MEME covers complex scenarios involving multi-entity and dynamic changes, not just single-entity updates.

Limitations

  • The current evaluation is limited to two handcrafted knowledge graphs (Personal Life and Software Project), which may restrict the generalizability of the results.
  • Dialogue data is generated by LLMs rather than collected from real users, which may affect realism.
  • The experiment scale is limited, conducted on only 100 episodes, which may not reveal patterns in longer contexts or larger sample sizes.

Future Work

Future research can expand to broader domains and crowd-sourced knowledge graphs to test MEME's generalizability. Additionally, new memory architectures can be explored to propagate updates locally during maintenance rather than relying on costly internal LLMs.

AI Executive Summary

As large language models (LLMs) increasingly serve as agents interacting with users across multiple sessions, accurately storing, updating, and reasoning over past interactions has become essential. However, existing memory systems show significant deficiencies in handling dynamic and multi-entity information, especially in dependency reasoning tasks.

MEME introduces a novel evaluation framework focusing on multi-entity and evolving memory tasks. It defines six tasks, including Cascade, Absence, and Deletion, generated using a Directed Acyclic Graph (DAG) knowledge graph to ensure verifiable propagation answers. Experiments were conducted on six memory systems across three paradigms, revealing failures in dependency reasoning tasks.

Across 100 controlled episodes, all systems performed poorly on dependency reasoning tasks under default configurations, with Cascade task averaging 3% accuracy and Absence task 1%. Even with prompt optimization, deeper retrieval, and reduced filler noise, results did not significantly improve. Only a file-based agent paired with Claude Opus 4.7 partially closed the gap, but at ~70x the baseline cost, indicating current solutions are impractical at scale.

MEME's research exposes deficiencies in existing memory systems when handling dynamic and multi-entity information, particularly in dependency reasoning tasks. This work provides crucial guidance for future memory system designs, emphasizing the need to handle complex knowledge updates in dynamic environments.

Future research can expand to broader domains and crowd-sourced knowledge graphs to test MEME's generalizability. Additionally, new memory architectures can be explored to propagate updates locally during maintenance rather than relying on costly internal LLMs.

Deep Analysis

Background

As AI technology evolves, large language models (LLMs) have become increasingly important in various applications. Traditional memory systems often focus on single-entity updates, neglecting complex scenarios involving multi-entity and dynamic changes. Existing benchmarks, such as RULER and NoLiMa, primarily measure attention window limits within a single input rather than persistent memory across sessions. Multi-session benchmarks, like LoCoMo and LongMemEval, evaluate static preference retention and knowledge updates but fail to assess the ripple effects an upstream change should trigger in dependent entities.

Core Problem

Existing memory systems show significant deficiencies in handling dynamic and multi-entity information, especially in dependency reasoning tasks. Dependency reasoning involves how a fact changes after an upstream update (Cascade), how a previously valid answer becomes uncertain (Absence), and how a removed fact stops being reported (Deletion). These tasks reveal critical blind spots in how current memory systems handle stateful, interdependent knowledge.

Innovation

MEME is the first to systematically evaluate multi-entity and evolving memory tasks, especially dependency reasoning tasks. Unlike existing benchmarks, MEME covers complex scenarios involving multi-entity and dynamic changes, not just single-entity updates. By defining new task types and using DAG knowledge graphs to generate datasets, MEME offers a comprehensive memory evaluation framework.

Methodology

  • �� MEME defines six tasks covering multi-entity and evolving memory evaluations.
  • �� Uses Directed Acyclic Graph (DAG) knowledge graphs to generate datasets, ensuring verifiable propagation answers.
  • �� Experiments conducted on six memory systems across three paradigms.
  • �� Tests conducted with prompt optimization, deeper retrieval, and reduced filler noise to reveal failures in dependency reasoning tasks.

Experiments

Experiments were conducted on six memory systems across three paradigms, including raw retrieval, LLM-processed memory, and file-based agents. Using 100 controlled episodes, the performance of Cascade, Absence, and Deletion tasks was evaluated. The experiments revealed failures in current systems' dependency reasoning tasks, particularly in Cascade and Absence tasks.

Results

Across 100 controlled episodes, all systems performed poorly on dependency reasoning tasks under default configurations, with Cascade task averaging 3% accuracy and Absence task 1%. Even with prompt optimization, deeper retrieval, and reduced filler noise, results did not significantly improve. Only a file-based agent paired with Claude Opus 4.7 partially closed the gap, but at ~70x the baseline cost, indicating current solutions are impractical at scale.

Applications

MEME's research exposes deficiencies in existing memory systems when handling dynamic and multi-entity information, particularly in dependency reasoning tasks. This work provides crucial guidance for future memory system designs, emphasizing the need to handle complex knowledge updates in dynamic environments.

Limitations & Outlook

The current evaluation is limited to two handcrafted knowledge graphs (Personal Life and Software Project), which may restrict the generalizability of the results. Dialogue data is generated by LLMs rather than collected from real users, which may affect realism. The experiment scale is limited, conducted on only 100 episodes, which may not reveal patterns in longer contexts or larger sample sizes.

Plain Language Accessible to non-experts

Imagine you have a smart assistant that remembers everything about you, like where you live, what you like, and even your work projects. Now, suppose you move to a new city, your smart assistant not only needs to remember this change but also know that things related to your old address, like commute time or nearby facilities, might no longer be valid. MEME is designed to test how these smart assistants perform when handling such changes. It's like a test to see if these assistants can effectively update and reason over information across multiple tasks. The results show that many systems struggle with handling these complex changes, especially in tasks requiring dependency reasoning. It's like a student facing a tough question in an exam; despite doing well in other subjects, they can't give the right answer to this particular problem.

ELI14 Explained like you're 14

Hey there! Imagine you have this super cool smart assistant that remembers everything about you, like where you live, what you like, and even your school projects. Now, suppose you move to a new city, your smart assistant not only needs to remember this change but also know that things related to your old address, like how long it takes to get to school or what's around your neighborhood, might not be the same anymore. MEME is like a test to see how well these smart assistants handle such changes. The results show that many systems struggle with handling these complex changes, especially when they need to figure out how one change affects other things. It's like a student facing a tough question in an exam; even if they're good at other subjects, they might not get this one right!

Glossary

Multi-entity

Tasks involving the processing of information across multiple entities.

A dimension in MEME's evaluation framework.

Evolving Memory

Memory systems that update and change over time.

A dimension in MEME's evaluation framework.

Dependency Reasoning

The ability to process and reason over dependencies between information.

A core task in MEME's evaluation.

Cascade

Tasks that handle changes in information after an upstream update.

One of the tasks evaluated by MEME.

Absence

Tasks that handle uncertainty when a previously valid answer becomes uncertain.

One of the tasks evaluated by MEME.

Deletion

Tasks that handle stopping the reporting of information after it is removed.

One of the tasks evaluated by MEME.

Directed Acyclic Graph (DAG)

A graph structure used to represent entities and their dependencies.

Used to generate MEME's evaluation datasets.

File-based Agent

LLM agents that manage persistent files via tool-calling.

One of the memory systems evaluated by MEME.

Claude Opus 4.7

An LLM used to partially close the gap in dependency reasoning tasks.

Paired with file-based agents.

Prompt Optimization

Methods to improve system performance by optimizing prompts.

Used in MEME experiments.

Open Questions Unanswered questions from this research

  • 1 Current memory systems show deficiencies in handling dynamic and multi-entity information, especially in dependency reasoning tasks. New memory architectures are needed to propagate updates locally during maintenance.
  • 2 Existing evaluations are limited to two handcrafted knowledge graphs, which may restrict the generalizability of the results. Expansion to broader domains and crowd-sourced knowledge graphs is needed.
  • 3 Dialogue data is generated by LLMs rather than collected from real users, which may affect realism. Real user data is needed to improve evaluation realism.
  • 4 The experiment scale is limited, conducted on only 100 episodes, which may not reveal patterns in longer contexts or larger sample sizes. Larger-scale experiments are needed to validate results.
  • 5 Current systems' failures in dependency reasoning tasks indicate the need for new solutions to handle stateful, interdependent knowledge updates.

Applications

Immediate Applications

Smart Assistant Optimization

Improve smart assistants' performance in dynamic environments by enhancing memory systems, especially in handling complex knowledge updates.

Multi-entity Data Management

Apply MEME framework in scenarios requiring processing of information across multiple entities to improve data management efficiency.

Dynamic Knowledge Bases

Use MEME framework in enterprise knowledge management to better handle dynamic information updates.

Long-term Vision

Comprehensive Memory Systems

Develop memory systems capable of propagating updates locally to handle complex knowledge updates in dynamic environments.

Cross-domain Applications

Expand MEME framework to more domains to enhance memory systems' generalizability and adaptability.

Abstract

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

cs.LG cs.CL

References (20)

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu et al.

2024 259 citations ⭐ Influential View Analysis →

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.

2023 8403 citations ⭐ Influential View Analysis →

A Coefficient of Agreement for Nominal Scales

Jacob Cohen

1960 42435 citations ⭐ Influential

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.

2020 13747 citations View Analysis →

♫ MuSiQue: Multihop Questions via Single-hop Question Composition

H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.

2021 811 citations View Analysis →

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

2025 76 citations View Analysis →

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang, Chen Ma et al.

2025 41 citations View Analysis →

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng et al.

2024 1415 citations View Analysis →

A Survey on the Memory Mechanism of Large Language Model-based Agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo et al.

2024 500 citations View Analysis →

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman et al.

2024 857 citations View Analysis →

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil et al.

2023 626 citations View Analysis →

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, S. Tulyakov et al.

2024 430 citations View Analysis →

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt et al.

2025 71 citations View Analysis →

MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions

Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning et al.

2023 316 citations View Analysis →

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini et al.

2021 1474 citations View Analysis →

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang et al.

2018 4396 citations View Analysis →

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

O. Khattab, Arnav Singhvi, Paridhi Maheshwari et al.

2023 690 citations View Analysis →

Evaluating the Ripple Effects of Knowledge Editing in Language Models

Roi Cohen, Eden Biran, Ori Yoran et al.

2023 262 citations View Analysis →

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

Xing Han Lù

2024 107 citations View Analysis →

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

P. Chhikara, Dev Khant, Saket Aryan et al.

2025 311 citations View Analysis →