Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation
Structured distillation reduces personalized agent memory tokens by 11x while preserving retrieval capabilities.
Key Findings
Methodology
The paper introduces a structured distillation method that compresses a user's conversation history with an AI agent into a compact retrieval layer. Each exchange is distilled into a compound object with four fields: exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched. This method reduces the average token count per exchange from 371 to 38, achieving an 11x compression.
Key Results
- Applied to 4,182 conversations (14,340 exchanges), the method reduces the average exchange length from 371 to 38 tokens, achieving an 11x compression ratio.
- In 201 recall-oriented queries, the best pure distilled configuration reached 96% of the best verbatim MRR (0.717 vs 0.745).
- All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756).
Significance
This study demonstrates how to compress single-user agent memory without significantly sacrificing retrieval quality. The method allows thousands of exchanges to fit within a single prompt while retaining the verbatim source for drill-down. This is significant for applications requiring large conversation history management, such as personalized assistants and customer service systems.
Technical Contribution
Technically, the study proposes a novel structured distillation method that significantly reduces the token cost of memory storage by compressing conversation history into retrievable compound objects. Unlike existing summarization methods, this approach retains key information necessary for retrieval and validates its effectiveness through various retrieval modes.
Novelty
This method is the first to combine personalized agent memory distillation with structured information extraction, significantly improving memory compression efficiency while preserving retrieval quality. Compared to traditional conversation summarization methods, it offers a more efficient memory management solution.
Limitations
- In BM25 configurations, retrieval quality significantly degrades, indicating the method's limitations in scenarios heavily reliant on lexical overlap.
- Vector search configurations are statistically non-significant, potentially limiting the method's application in some semantic matching tasks.
- The method primarily targets single-user scenarios and has not been validated for multi-user or cross-domain applications.
Future Work
Future research could explore the method's application in multi-user environments, further optimize the distillation process to enhance cross-domain retrieval performance, and investigate how to integrate other information retrieval technologies, such as deep learning models, to improve retrieval efficiency and accuracy.
AI Executive Summary
Long conversations with an AI agent create a simple problem for users: the history is useful, but carrying it verbatim is expensive. This paper studies personalized agent memory, where a user's conversation history with an agent is distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields: exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched. This method reduces the average token count per exchange from 371 to 38, achieving an 11x compression.
Applied to 4,182 conversations (14,340 exchanges), the method reduces the average exchange length from 371 to 38 tokens, achieving an 11x compression ratio. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745).
Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down.
We release the implementation and analysis pipeline as open-source software. This study demonstrates how to compress single-user agent memory without significantly sacrificing retrieval quality. The method allows thousands of exchanges to fit within a single prompt while retaining the verbatim source for drill-down. This is significant for applications requiring large conversation history management, such as personalized assistants and customer service systems.
Future research could explore the method's application in multi-user environments, further optimize the distillation process to enhance cross-domain retrieval performance, and investigate how to integrate other information retrieval technologies, such as deep learning models, to improve retrieval efficiency and accuracy.
Deep Analysis
Background
In the field of artificial intelligence, as conversational AI agents become more prevalent, effectively managing and retrieving user-agent conversation history has become a critical research topic. Traditional conversation summarization methods often compress and discard the original conversation, resulting in lossy summaries that degrade over long conversations. Recent advances in structured information extraction provide new approaches to address this issue. By transforming conversation history into retrievable structured data, it is possible to significantly reduce storage costs while retaining key information.
Core Problem
Long conversations with an AI agent generate a large amount of historical data, which is useful for users but expensive to retain verbatim. Traditional summarization methods lose significant key information during compression, leading to degraded retrieval quality. The challenge is to compress personalized agent memory without significantly sacrificing retrieval quality.
Innovation
This paper introduces a structured distillation method that compresses a user's conversation history with an AI agent into a compact retrieval layer. Each exchange is distilled into a compound object with four fields: exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched. This method reduces the average token count per exchange from 371 to 38, achieving an 11x compression. Unlike traditional summarization methods, this approach retains key information necessary for retrieval and validates its effectiveness through various retrieval modes.
Methodology
- �� Employ a structured distillation method to compress conversation history into retrievable compound objects.
- �� Each object includes four fields: exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched.
- �� Evaluate distillation effectiveness using various retrieval modes, including vector search and BM25 configurations.
- �� Validate information retention by comparing retrieval results from distilled and verbatim text.
Experiments
The experiments used 4,182 conversations from six software engineering projects, totaling 14,340 exchanges. Evaluation involved 201 recall-oriented queries, 107 configurations spanning five pure search modes and five cross-layer search modes. Five large language model graders assessed 214,519 consensus-graded query-result pairs. Key metrics included MRR, mean grade, P@1, and nDCG@10.
Results
Experimental results show that the best pure distilled configuration reached 96% of the best verbatim MRR (0.717 vs 0.745). All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759).
Applications
The method is applicable to scenarios requiring large conversation history management, such as personalized assistants and customer service systems. By compressing conversation history into retrievable structured data, it is possible to significantly reduce storage costs while retaining key information, thereby improving system efficiency and user experience.
Limitations & Outlook
While the method performs well in vector search configurations, retrieval quality significantly degrades in BM25 configurations, indicating limitations in scenarios heavily reliant on lexical overlap. Additionally, the method primarily targets single-user scenarios and has not been validated for multi-user or cross-domain applications. Future research could explore the method's application in multi-user environments and further optimize the distillation process to enhance cross-domain retrieval performance.
Plain Language Accessible to non-experts
Imagine you're cooking in a kitchen. You have lots of ingredients and tools, but you don't need to bring everything out every time you cook. Instead, you choose specific ingredients and tools as needed. Similarly, when an AI agent interacts with a user, it doesn't need to remember all the conversation history every time. This paper introduces a method to compress conversation history into a compact retrieval layer, like organizing your kitchen ingredients and tools into a handy list. This way, when you need a specific ingredient, you can find it quickly without rummaging through the entire kitchen. This method not only saves space but also improves efficiency, allowing the AI agent to quickly find the information the user needs.
ELI14 Explained like you're 14
Imagine you're playing a massive multiplayer online game. You and your friends have lots of conversations and adventures in the game, but you don't need to remember all the details every time. Instead, you remember the important quests and key items. AI agents do the same! This paper talks about a method that helps AI agents remember important conversation content, not all the details. Just like in the game, you can quickly find the quest information you need without going through the entire chat history. This method makes AI agents smarter and more efficient!
Glossary
Structured Distillation
A technique that compresses conversation history into retrievable structured data, retaining key information for later retrieval.
Used to compress user-agent conversation history.
Personalized Agent Memory
A system for storing and retrieving a single user's conversation history with an AI agent.
Researching how to effectively manage and retrieve user conversation history.
MRR (Mean Reciprocal Rank)
A metric for evaluating information retrieval system performance, representing the average reciprocal rank of the first relevant result.
Used to evaluate retrieval effectiveness of distilled and verbatim text.
BM25
A retrieval algorithm based on term frequency-inverse document frequency, used to evaluate text similarity.
Used to evaluate retrieval effectiveness of distilled text.
Vector Search
A retrieval method based on vector space models, calculating similarity between vectors for retrieval.
Used to evaluate retrieval effectiveness of distilled text.
exchange_core
A brief description of the task completed in a conversation, typically 1-2 sentences.
A field in the distilled object to retain key information.
specific_context
A unique technical detail in a conversation, such as error messages or parameter names.
A field in the distilled object to retain key information.
thematic room_assignments
Categorization of themes or concepts involved in a conversation for organizing and retrieving information.
A field in the distilled object for organizing information.
regex-extracted files_touched
File paths mentioned in a conversation, extracted using regular expressions.
A field in the distilled object to retain key information.
Claude Code
An AI conversational agent platform used for software engineering projects, supporting user-agent dialogues.
The dialogue agent platform used in the study.
FAISS
A library for efficient similarity search, supporting large-scale vector search.
Used to store and retrieve vectors of distilled text.
HNSW
An approximate nearest neighbor search algorithm based on hierarchical navigable small world graphs.
Used to evaluate retrieval effectiveness of distilled text.
Exact
An exact vector search method, calculating precise distances between vectors for retrieval.
Used to evaluate retrieval effectiveness of distilled text.
Reciprocal Rank Fusion (RRF)
A method for fusing multiple retrieval results by calculating weighted sums of reciprocal ranks.
Used for result fusion in multi-field modes.
CombMNZ
A method for fusing multiple retrieval results by calculating weighted sums.
Used for result fusion in cross-layer modes.
Open Questions Unanswered questions from this research
- 1 How can structured distillation be applied in multi-user environments to support multiple users interacting with an AI agent simultaneously? The current method primarily targets single-user scenarios and has not been validated for multi-user environments.
- 2 How can the distillation process be further optimized to enhance cross-domain retrieval performance? The existing method has limited effectiveness in some semantic matching tasks and may need to integrate other information retrieval technologies.
- 3 How can deep learning models be integrated to improve retrieval efficiency and accuracy? The current method primarily relies on traditional information retrieval techniques and may not fully leverage the advantages of deep learning.
- 4 How can retrieval quality be improved without significantly increasing computational costs? The existing method shows retrieval quality degradation in some configurations, which may require further optimization.
- 5 How can the token count of conversation history be further reduced without losing key information? The existing method achieves an 11x compression, but there is still room for further optimization.
Applications
Immediate Applications
Personalized Assistants
Improve response speed and efficiency of personalized assistants by compressing user-agent conversation history.
Customer Service Systems
Apply the method in customer service systems to quickly retrieve and process historical customer conversations.
Software Engineering Project Management
Use the method in software engineering projects to enable team members to quickly access and retrieve project-related conversation history.
Long-term Vision
Multi-user Dialogue Management
Develop systems that support multiple users interacting with an AI agent simultaneously, improving collaboration efficiency.
Cross-domain Information Retrieval
Integrate deep learning technologies to develop systems that support cross-domain information retrieval, enhancing retrieval efficiency and accuracy.
Abstract
Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.
References (20)
Variations in relevance judgments and the measurement of retrieval effectiveness
E. Voorhees
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur, Nils Reimers, Andreas Ruckl'e et al.
Mistral 7B
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch et al.
Measuring nominal scale agreement among many raters.
J. Fleiss
TREC: Experiment and evaluation in information retrieval
José Luis Vicedo González, Jaime Gómez
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Wenhui Wang, Furu Wei, Li Dong et al.
Yi: Open Foundation Models by 01.AI
01.AI Alex Young, Bei Chen, Chao Li et al.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, A. A. Awan et al.
Reciprocal rank fusion outperforms condorcet and individual rank learning methods
G. Cormack, C. Clarke, Stefan Büttcher
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers, Iryna Gurevych
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Vivian Fang, Shishir G. Patil et al.
Cumulated gain-based evaluation of IR techniques
K. Järvelin, Jaana Kekäläinen
The measurement of observer agreement for categorical data.
J. Landis, G. Koch
InternLM2 Technical Report
Zheng Cai, Maosong Cao, Haojiong Chen et al.
A Survey on Dialogue Summarization: Recent Advances and New Frontiers
Xiachong Feng, Xiaocheng Feng, Bing Qin
The kappa statistic in reliability studies: use, interpretation, and sample size requirements.
J. Sim, C. Wright
Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs
Yury Malkov, Dmitry A. Yashunin
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.
Statistical Power Analysis for the Behavioral Sciences
Jacob Cohen