Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

TL;DR

Structured distillation reduces personalized agent memory tokens by 11x while preserving retrieval capabilities.

cs.AI 🔴 Advanced 2026-03-13 2 views

Sydney Lewis

personalized memory structured distillation information retrieval NLP software engineering

Key Findings

Methodology

The paper introduces a structured distillation method that compresses a user's conversation history with an AI agent into a compact retrieval layer. Each exchange is distilled into a compound object with four fields: exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched. This method reduces the average token count per exchange from 371 to 38, achieving an 11x compression.

Key Results

Applied to 4,182 conversations (14,340 exchanges), the method reduces the average exchange length from 371 to 38 tokens, achieving an 11x compression ratio.
In 201 recall-oriented queries, the best pure distilled configuration reached 96% of the best verbatim MRR (0.717 vs 0.745).
All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756).

Significance

This study demonstrates how to compress single-user agent memory without significantly sacrificing retrieval quality. The method allows thousands of exchanges to fit within a single prompt while retaining the verbatim source for drill-down. This is significant for applications requiring large conversation history management, such as personalized assistants and customer service systems.

Technical Contribution

Technically, the study proposes a novel structured distillation method that significantly reduces the token cost of memory storage by compressing conversation history into retrievable compound objects. Unlike existing summarization methods, this approach retains key information necessary for retrieval and validates its effectiveness through various retrieval modes.

Novelty

This method is the first to combine personalized agent memory distillation with structured information extraction, significantly improving memory compression efficiency while preserving retrieval quality. Compared to traditional conversation summarization methods, it offers a more efficient memory management solution.

Limitations

In BM25 configurations, retrieval quality significantly degrades, indicating the method's limitations in scenarios heavily reliant on lexical overlap.
Vector search configurations are statistically non-significant, potentially limiting the method's application in some semantic matching tasks.
The method primarily targets single-user scenarios and has not been validated for multi-user or cross-domain applications.

Future Work

Future research could explore the method's application in multi-user environments, further optimize the distillation process to enhance cross-domain retrieval performance, and investigate how to integrate other information retrieval technologies, such as deep learning models, to improve retrieval efficiency and accuracy.

AI Executive Summary

Long conversations with an AI agent create a simple problem for users: the history is useful, but carrying it verbatim is expensive. This paper studies personalized agent memory, where a user's conversation history with an agent is distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields: exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched. This method reduces the average token count per exchange from 371 to 38, achieving an 11x compression.

Applied to 4,182 conversations (14,340 exchanges), the method reduces the average exchange length from 371 to 38 tokens, achieving an 11x compression ratio. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745).

Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down.

We release the implementation and analysis pipeline as open-source software. This study demonstrates how to compress single-user agent memory without significantly sacrificing retrieval quality. The method allows thousands of exchanges to fit within a single prompt while retaining the verbatim source for drill-down. This is significant for applications requiring large conversation history management, such as personalized assistants and customer service systems.

Deep Analysis

Background

In the field of artificial intelligence, as conversational AI agents become more prevalent, effectively managing and retrieving user-agent conversation history has become a critical research topic. Traditional conversation summarization methods often compress and discard the original conversation, resulting in lossy summaries that degrade over long conversations. Recent advances in structured information extraction provide new approaches to address this issue. By transforming conversation history into retrievable structured data, it is possible to significantly reduce storage costs while retaining key information.

Core Problem

Long conversations with an AI agent generate a large amount of historical data, which is useful for users but expensive to retain verbatim. Traditional summarization methods lose significant key information during compression, leading to degraded retrieval quality. The challenge is to compress personalized agent memory without significantly sacrificing retrieval quality.

Innovation

This paper introduces a structured distillation method that compresses a user's conversation history with an AI agent into a compact retrieval layer. Each exchange is distilled into a compound object with four fields: exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched. This method reduces the average token count per exchange from 371 to 38, achieving an 11x compression. Unlike traditional summarization methods, this approach retains key information necessary for retrieval and validates its effectiveness through various retrieval modes.

Methodology

�� Employ a structured distillation method to compress conversation history into retrievable compound objects.
�� Each object includes four fields: exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched.
�� Evaluate distillation effectiveness using various retrieval modes, including vector search and BM25 configurations.
�� Validate information retention by comparing retrieval results from distilled and verbatim text.

Experiments

The experiments used 4,182 conversations from six software engineering projects, totaling 14,340 exchanges. Evaluation involved 201 recall-oriented queries, 107 configurations spanning five pure search modes and five cross-layer search modes. Five large language model graders assessed 214,519 consensus-graded query-result pairs. Key metrics included MRR, mean grade, P@1, and nDCG@10.

Results

Experimental results show that the best pure distilled configuration reached 96% of the best verbatim MRR (0.717 vs 0.745). All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759).

Applications

The method is applicable to scenarios requiring large conversation history management, such as personalized assistants and customer service systems. By compressing conversation history into retrievable structured data, it is possible to significantly reduce storage costs while retaining key information, thereby improving system efficiency and user experience.

Limitations & Outlook

While the method performs well in vector search configurations, retrieval quality significantly degrades in BM25 configurations, indicating limitations in scenarios heavily reliant on lexical overlap. Additionally, the method primarily targets single-user scenarios and has not been validated for multi-user or cross-domain applications. Future research could explore the method's application in multi-user environments and further optimize the distillation process to enhance cross-domain retrieval performance.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You have lots of ingredients and tools, but you don't need to bring everything out every time you cook. Instead, you choose specific ingredients and tools as needed. Similarly, when an AI agent interacts with a user, it doesn't need to remember all the conversation history every time. This paper introduces a method to compress conversation history into a compact retrieval layer, like organizing your kitchen ingredients and tools into a handy list. This way, when you need a specific ingredient, you can find it quickly without rummaging through the entire kitchen. This method not only saves space but also improves efficiency, allowing the AI agent to quickly find the information the user needs.

ELI14 Explained like you're 14

Imagine you're playing a massive multiplayer online game. You and your friends have lots of conversations and adventures in the game, but you don't need to remember all the details every time. Instead, you remember the important quests and key items. AI agents do the same! This paper talks about a method that helps AI agents remember important conversation content, not all the details. Just like in the game, you can quickly find the quest information you need without going through the entire chat history. This method makes AI agents smarter and more efficient!

Glossary

Structured Distillation

A technique that compresses conversation history into retrievable structured data, retaining key information for later retrieval.

Used to compress user-agent conversation history.

Personalized Agent Memory

A system for storing and retrieving a single user's conversation history with an AI agent.

Researching how to effectively manage and retrieve user conversation history.

MRR (Mean Reciprocal Rank)

A metric for evaluating information retrieval system performance, representing the average reciprocal rank of the first relevant result.

Used to evaluate retrieval effectiveness of distilled and verbatim text.

BM25

A retrieval algorithm based on term frequency-inverse document frequency, used to evaluate text similarity.

Used to evaluate retrieval effectiveness of distilled text.

Vector Search

A retrieval method based on vector space models, calculating similarity between vectors for retrieval.

Used to evaluate retrieval effectiveness of distilled text.

exchange_core

A brief description of the task completed in a conversation, typically 1-2 sentences.

A field in the distilled object to retain key information.

specific_context

A unique technical detail in a conversation, such as error messages or parameter names.

A field in the distilled object to retain key information.

thematic room_assignments

Categorization of themes or concepts involved in a conversation for organizing and retrieving information.

A field in the distilled object for organizing information.

regex-extracted files_touched

File paths mentioned in a conversation, extracted using regular expressions.

A field in the distilled object to retain key information.

Claude Code

An AI conversational agent platform used for software engineering projects, supporting user-agent dialogues.

The dialogue agent platform used in the study.

FAISS

A library for efficient similarity search, supporting large-scale vector search.

Used to store and retrieve vectors of distilled text.

HNSW

An approximate nearest neighbor search algorithm based on hierarchical navigable small world graphs.

Used to evaluate retrieval effectiveness of distilled text.

Exact

An exact vector search method, calculating precise distances between vectors for retrieval.

Used to evaluate retrieval effectiveness of distilled text.

Reciprocal Rank Fusion (RRF)

A method for fusing multiple retrieval results by calculating weighted sums of reciprocal ranks.

Used for result fusion in multi-field modes.

CombMNZ

A method for fusing multiple retrieval results by calculating weighted sums.

Used for result fusion in cross-layer modes.

Open Questions Unanswered questions from this research

1 How can structured distillation be applied in multi-user environments to support multiple users interacting with an AI agent simultaneously? The current method primarily targets single-user scenarios and has not been validated for multi-user environments.
2 How can the distillation process be further optimized to enhance cross-domain retrieval performance? The existing method has limited effectiveness in some semantic matching tasks and may need to integrate other information retrieval technologies.
3 How can deep learning models be integrated to improve retrieval efficiency and accuracy? The current method primarily relies on traditional information retrieval techniques and may not fully leverage the advantages of deep learning.
4 How can retrieval quality be improved without significantly increasing computational costs? The existing method shows retrieval quality degradation in some configurations, which may require further optimization.
5 How can the token count of conversation history be further reduced without losing key information? The existing method achieves an 11x compression, but there is still room for further optimization.

Applications

Immediate Applications

Personalized Assistants

Improve response speed and efficiency of personalized assistants by compressing user-agent conversation history.

Customer Service Systems

Apply the method in customer service systems to quickly retrieve and process historical customer conversations.

Software Engineering Project Management

Use the method in software engineering projects to enable team members to quickly access and retrieve project-related conversation history.

Long-term Vision

Multi-user Dialogue Management

Develop systems that support multiple users interacting with an AI agent simultaneously, improving collaboration efficiency.

Cross-domain Information Retrieval

Integrate deep learning technologies to develop systems that support cross-domain information retrieval, enhancing retrieval efficiency and accuracy.

Abstract

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.

cs.AI cs.CL cs.IR

References (20)

Variations in relevance judgments and the measurement of retrieval effectiveness

E. Voorhees

1998 866 citations ⭐ Influential

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur, Nils Reimers, Andreas Ruckl'e et al.

2021 1503 citations View Analysis →

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch et al.

2023 3189 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 3584 citations View Analysis →

Measuring nominal scale agreement among many raters.

J. Fleiss

1971 9146 citations

TREC: Experiment and evaluation in information retrieval

José Luis Vicedo González, Jaime Gómez

2007 1105 citations

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang, Furu Wei, Li Dong et al.

2020 1947 citations View Analysis →

Yi: Open Foundation Models by 01.AI

01.AI Alex Young, Bei Chen, Chao Li et al.

2024 807 citations View Analysis →

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, A. A. Awan et al.

2024 2041 citations View Analysis →

Reciprocal rank fusion outperforms condorcet and individual rank learning methods

G. Cormack, C. Clarke, Stefan Büttcher

2009 784 citations

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers, Iryna Gurevych

2019 16641 citations View Analysis →

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil et al.

2023 434 citations View Analysis →

Cumulated gain-based evaluation of IR techniques

K. Järvelin, Jaana Kekäläinen

2002 5335 citations

The measurement of observer agreement for categorical data.

J. Landis, G. Koch

1977 76718 citations

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen et al.

2024 351 citations View Analysis →

A Survey on Dialogue Summarization: Recent Advances and New Frontiers

Xiachong Feng, Xiaocheng Feng, Bing Qin

2021 118 citations View Analysis →

The kappa statistic in reliability studies: use, interpretation, and sample size requirements.

J. Sim, C. Wright

2005 4083 citations

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs

Yury Malkov, Dmitry A. Yashunin

2016 2108 citations View Analysis →

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.

2020 12025 citations View Analysis →

Statistical Power Analysis for the Behavioral Sciences

Jacob Cohen

1969 61317 citations

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Structured Distillation

Personalized Agent Memory

MRR (Mean Reciprocal Rank)

BM25

Vector Search

exchange_core

specific_context

thematic room_assignments

regex-extracted files_touched

Claude Code

FAISS

HNSW

Exact

Reciprocal Rank Fusion (RRF)

CombMNZ

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Personalized Assistants

Customer Service Systems

Software Engineering Project Management

Long-term Vision

Multi-user Dialogue Management

Cross-domain Information Retrieval

Abstract

References (20)

Related Papers

Developing and evaluating a chatbot to support maternal health care

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing