Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

TL;DR

Chameleon enhances robotic manipulation with geometry-grounded multimodal memory, improving decision reliability in long-horizon tasks.

cs.RO 🔴 Advanced 2026-03-26 51 views

Xinying Guo Chenxi Jiang Hyun Bin Kim Ying Sun Yang Xiao Yuhang Han Jianfei Yang

AI Reader Arxiv Page Download PDF

robotic manipulation memory systems long-horizon tasks multimodal perceptual aliasing

Key Findings

Methodology

Chameleon is a bio-inspired memory architecture designed for long-horizon robotic manipulation. Its core components include a geometry-grounded perception module, a hierarchical differentiable memory stack, and the HoloHead goal-directed recall mechanism. The perception module converts multi-view observations into end-effector consistent patch tokens, preserving evidence needed for disambiguation. The memory stack couples episodic and working memory through continuous dynamics, producing a compact decision state. HoloHead trains the decision state with a latent imagination objective to predict near-future state evolution.

Key Results

In experiments on the Camo-Dataset, Chameleon excelled across three task categories: achieving a decision success rate of 100% in episodic recall tasks, and 73.5% and 72.2% in spatial tracking and sequential tasks, respectively. These results demonstrate Chameleon's effectiveness in memory-driven decision-making under perceptual aliasing.
Compared to baselines like Diffusion Policy and Flow Matching, Chameleon consistently showed higher decision reliability and task completion rates across all task categories, notably achieving nearly 60% higher decision success rate than Diffusion Policy in sequential tasks.
Ablation studies revealed that removing the HoloHead or geometry-grounded perception module significantly reduced system performance, confirming the critical role of these components in the Chameleon architecture.

Significance

Chameleon holds significant implications for the field of robotic manipulation, particularly in tasks requiring long-term memory. Traditional semantically compressed memory methods perform poorly under perceptual aliasing, but Chameleon's geometry-grounded multimodal memory system effectively addresses this issue. Its innovative memory architecture not only enhances decision reliability but also offers new design insights for future intelligent robotic systems.

Technical Contribution

Chameleon's technical contributions lie in its unique memory architecture design, which combines geometry-grounded perception and a hierarchical differentiable memory stack to solve perceptual aliasing issues present in traditional methods. Additionally, the HoloHead mechanism enhances system stability and decision accuracy in long-horizon tasks through goal-directed recall training.

Novelty

Chameleon is the first to apply a bio-inspired episodic memory system to robotic manipulation, especially under perceptual aliasing conditions. Unlike existing semantically compressed memory methods, Chameleon preserves disambiguating context information through geometry-grounded multimodal tokens, achieving more precise recall and decision-making.

Limitations

Chameleon may perform suboptimally in highly dynamic environments, as its memory system primarily relies on geometry-grounded tokens, which may not update swiftly in rapidly changing scenes.
The system's computational complexity is relatively high, particularly when handling multimodal inputs and training the HoloHead mechanism, which may limit its use in real-time applications.
In certain tasks, Chameleon's performance may be constrained by the diversity and quality of training data.

Future Work

Future research directions include optimizing Chameleon's computational efficiency to adapt to more complex dynamic environments. Additionally, exploring a broader range of multimodal inputs and richer training data could further enhance the system's generalization capabilities and task adaptability. Researchers may also consider applying Chameleon's memory architecture to other domains, such as autonomous driving and human-robot interaction.

AI Executive Summary

In robotic manipulation tasks, memory systems play a crucial role, especially in long-horizon tasks where robots need to rely on past interaction history to make correct decisions. However, existing memory systems often rely on semantically compressed methods, which perform poorly under perceptual aliasing because they lose fine-grained perceptual cues needed for disambiguation.

Chameleon is a novel bio-inspired memory architecture designed for long-horizon robotic manipulation. Its core components include a geometry-grounded perception module, a hierarchical differentiable memory stack, and the HoloHead goal-directed recall mechanism. The perception module converts multi-view observations into end-effector consistent patch tokens, preserving evidence needed for disambiguation. The memory stack couples episodic and working memory through continuous dynamics, producing a compact decision state. HoloHead trains the decision state with a latent imagination objective to predict near-future state evolution.

In experiments on the Camo-Dataset, Chameleon excelled across three task categories: achieving a decision success rate of 100% in episodic recall tasks, and 73.5% and 72.2% in spatial tracking and sequential tasks, respectively. These results demonstrate Chameleon's effectiveness in memory-driven decision-making under perceptual aliasing. Compared to baselines like Diffusion Policy and Flow Matching, Chameleon consistently showed higher decision reliability and task completion rates across all task categories.

However, Chameleon may perform suboptimally in highly dynamic environments, as its memory system primarily relies on geometry-grounded tokens, which may not update swiftly in rapidly changing scenes. Additionally, the system's computational complexity is relatively high, particularly when handling multimodal inputs and training the HoloHead mechanism, which may limit its use in real-time applications. Future research directions include optimizing Chameleon's computational efficiency to adapt to more complex dynamic environments and exploring a broader range of multimodal inputs and richer training data to further enhance the system's generalization capabilities and task adaptability.

Deep Analysis

Background

The field of robotic manipulation has long faced the challenge of making effective decisions in complex environments. Traditional methods often rely on semantically compressed memory systems, which summarize experiences into semantic text-like traces for memory storage and retrieval. However, these methods perform poorly under perceptual aliasing because they lose fine-grained perceptual cues needed for disambiguation. In recent years, with advances in bio-inspired memory systems research, researchers have begun exploring how to apply human episodic memory mechanisms to robotic manipulation to enhance decision reliability in long-horizon tasks.

Core Problem

In robotic manipulation tasks, perceptual aliasing is a common issue, especially in long-horizon tasks where robots need to rely on past interaction history to make correct decisions. Traditional semantically compressed memory methods perform poorly in such cases because they lose fine-grained perceptual cues needed for disambiguation. Therefore, designing a system capable of effective memory-driven decision-making under perceptual aliasing is a pressing problem that needs to be addressed.

Innovation

Chameleon's core innovations lie in its bio-inspired memory architecture design. First, the geometry-grounded perception module converts multi-view observations into end-effector consistent patch tokens, preserving evidence needed for disambiguation. Second, the hierarchical differentiable memory stack couples episodic and working memory through continuous dynamics, producing a compact decision state. Finally, the HoloHead mechanism trains the decision state with a latent imagination objective to predict near-future state evolution. These innovations enable Chameleon to effectively perform memory-driven decision-making under perceptual aliasing.

Methodology

�� Geometry-grounded perception module: Converts multi-view observations into end-effector consistent patch tokens, preserving evidence needed for disambiguation.

�� Hierarchical differentiable memory stack: Couples episodic and working memory through continuous dynamics, producing a compact decision state.

�� HoloHead mechanism: Trains the decision state with a latent imagination objective to predict near-future state evolution.

�� Memory-driven decision-making: Performs decision-making under perceptual aliasing through goal-directed recall mechanisms.

Experiments

The experimental design includes three task tests on the Camo-Dataset: episodic recall, spatial tracking, and sequential tasks. In each task category, Chameleon is compared against baselines like Diffusion Policy and Flow Matching to evaluate its performance in terms of decision success rate, task completion rate, and other metrics. Additionally, ablation studies are conducted to verify the role of each component in the Chameleon architecture.

Results

Experimental results show that Chameleon consistently exhibits higher decision reliability and task completion rates across all task categories. In episodic recall tasks, Chameleon achieves a decision success rate of 100%, and 73.5% and 72.2% in spatial tracking and sequential tasks, respectively. Ablation studies reveal that removing the HoloHead or geometry-grounded perception module significantly reduces system performance, confirming the critical role of these components in the Chameleon architecture.

Applications

Chameleon's application scenarios include robotic manipulation tasks requiring long-term memory and decision-making, such as complex assembly line operations, object tracking and interaction in dynamic environments, etc. Its innovative memory architecture design enables the system to make accurate decisions under perceptual aliasing, enhancing the adaptability and stability of robots in complex tasks.

Limitations & Outlook

Chameleon may perform suboptimally in highly dynamic environments, as its memory system primarily relies on geometry-grounded tokens, which may not update swiftly in rapidly changing scenes. Additionally, the system's computational complexity is relatively high, particularly when handling multimodal inputs and training the HoloHead mechanism, which may limit its use in real-time applications. Future research directions include optimizing Chameleon's computational efficiency to adapt to more complex dynamic environments and exploring a broader range of multimodal inputs and richer training data to further enhance the system's generalization capabilities and task adaptability.

Plain Language Accessible to non-experts

Imagine cooking in a kitchen where you need to remember which ingredients you've already added and which ones you haven't. This is similar to what Chameleon does in robotic manipulation. Chameleon helps robots make the right decisions in complex tasks by using a memory system similar to human memory. Just like you remember adding salt and pepper while cooking, Chameleon remembers the actions and decisions a robot has made during a task. This memory system not only helps robots remain stable in long-horizon tasks but also allows them to make accurate judgments under perceptual aliasing. By using geometry-grounded multimodal tokens, Chameleon preserves context information needed for disambiguation, just like you ensure each step is done correctly in the kitchen through observation and memory. Ultimately, Chameleon's goal is to enable robots to work as flexibly and intelligently as humans in complex environments.

ELI14 Explained like you're 14

Hey there! Did you know robots need memory just like we need to remember things in school? Imagine playing a cup game where a ball is hidden under one cup, and you have to remember which cup it's under. Chameleon is like a super brain that helps robots remember these things! It's like a smart detective that remembers every detail to help robots make the right choices in complex tasks. Just like you need to remember every move in a game, Chameleon helps robots remember every step so they don't mess up! Isn't that cool?

Glossary

Chameleon

Chameleon is a bio-inspired memory architecture designed for long-horizon robotic manipulation, capable of memory-driven decision-making under perceptual aliasing.

In the paper, Chameleon is used to enhance decision reliability in robotic manipulation tasks.

Episodic Memory

Episodic memory is a memory system that preserves the spatiotemporal and causal context of specific events, supporting future behavior decisions.

In the paper, episodic memory helps robots make accurate decisions in complex tasks.

Perceptual Aliasing

Perceptual aliasing refers to decision uncertainty at the observation level, where the same observation may arise from different interaction histories.

In the paper, perceptual aliasing is a core issue Chameleon addresses.

Multimodal Tokens

Multimodal tokens are geometry-grounded tokens combining multi-view observations to preserve context information needed for disambiguation.

In the paper, multimodal tokens are used in Chameleon's perception module.

Differentiable Memory Stack

A differentiable memory stack is an architecture that couples episodic and working memory through continuous dynamics, producing a compact decision state.

In the paper, the differentiable memory stack is a core component of Chameleon.

HoloHead

HoloHead is a goal-directed recall mechanism that trains the decision state with a latent imagination objective to predict near-future state evolution.

In the paper, HoloHead enhances Chameleon's stability and decision accuracy in long-horizon tasks.

Camo-Dataset

Camo-Dataset is a real-world robot dataset used to evaluate Chameleon's performance in episodic recall, spatial tracking, and sequential tasks.

In the paper, Camo-Dataset is used to validate Chameleon's performance.

Spatial Tracking

Spatial tracking is a task requiring robots to track the position and state of objects in dynamic environments.

In the paper, spatial tracking is one of the tasks used to evaluate Chameleon's performance.

Sequential Manipulation

Sequential manipulation is a task requiring robots to maintain consistent decisions across multiple stages, avoiding repetition or omission.

In the paper, sequential manipulation is one of the tasks used to evaluate Chameleon's performance.

Flow Matching

Flow Matching is a baseline method used for performance comparison with Chameleon.

In the paper, Flow Matching serves as a baseline to evaluate Chameleon's performance.

Open Questions Unanswered questions from this research

1 Chameleon's performance in highly dynamic environments remains to be further studied. While its geometry-grounded tokens can work effectively in static scenes, they may not update swiftly in rapidly changing environments, limiting its applicability in certain applications.
2 How to further optimize Chameleon's computational efficiency to enable real-time operation is a pressing issue. Currently, the system's computational complexity is relatively high, especially when handling multimodal inputs and training the HoloHead mechanism.
3 In the presence of multimodal inputs, how to better integrate information from different sources to improve the system's decision accuracy and stability is a direction worth exploring.
4 Chameleon's generalization capability in handling complex tasks still needs verification. Although it performs well on the Camo-Dataset, its performance in a broader range of tasks and environments remains to be further studied.
5 How to apply Chameleon's memory architecture to other domains, such as autonomous driving and human-robot interaction, is a promising research direction.

Applications

Immediate Applications

Complex Assembly Line Operations

Chameleon can be applied to complex assembly line operations requiring long-term memory and decision-making, enhancing system adaptability and stability in complex tasks through its innovative memory architecture design.

Object Tracking in Dynamic Environments

In dynamic environments, Chameleon can accurately track the position and state of objects through its geometry-grounded multimodal token system, applicable in fields like warehousing and logistics.

Interactive Robotic Assistants

Chameleon can be used to develop interactive robotic assistants that help humans complete complex tasks in home and work environments, enhancing the intelligence level of robotic assistants through its memory-driven decision system.

Long-term Vision

Autonomous Driving

Chameleon's memory architecture can be applied to autonomous driving systems, helping vehicles make accurate decisions in complex traffic environments, enhancing driving safety and efficiency.

Human-Robot Interaction

In future human-robot interactions, Chameleon's memory system can help robots better understand and respond to human needs, enhancing the naturalness and effectiveness of interactions.

Abstract

Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.

cs.RO cs.AI cs.CV

References (20)

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min et al.

2020 5303 citations View Analysis →

Parametric Retrieval Augmented Generation

Weihang Su, Yichen Tang, Qingyao Ai et al.

2025 26 citations View Analysis →

The evolution of episodic memory

T. Allen, N. Fortin

2013 292 citations

Pattern Separation in the Human Hippocampal CA3 and Dentate Gyrus

A. Bakker, C. Kirwan, Michael Miller et al.

2008 1060 citations

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Zhao, Vikash Kumar, S. Levine et al.

2023 1456 citations View Analysis →

Extra-hippocampal contributions to pattern separation

T. Amer, L. Davachi

2023 41 citations

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu et al.

2022 6546 citations View Analysis →

A Coefficient of Agreement for Nominal Scales

Jacob Cohen

1960 41911 citations

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

Yang Liu, Xinshuai Song, Kaixuan Jiang et al.

2024 1 citations View Analysis →

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

2023 6236 citations View Analysis →

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, S. Feng, Yilun Du et al.

2023 2710 citations View Analysis →

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Troy Mallen, Akari Asai, Victor Zhong et al.

2022 1009 citations View Analysis →

Empowering LLMs by hybrid retrieval-augmented generation for domain-centric Q&A in smart manufacturing

Yuwei Wan, Zheyuan Chen, Ying Liu et al.

2025 54 citations

Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective

Nhat Chung, Taisei Hanyu, Toan Nguyen et al.

2025 6 citations View Analysis →

INHerit-SG: Incremental Hierarchical Semantic Scene Graphs with RAG-Style Retrieval

Yu Fang, Zhikang Shi, Jiabin Qiu et al.

2026 2 citations View Analysis →

Affordance-based Robot Manipulation with Flow Matching

Fan Zhang, Michael Gienger

2024 54 citations View Analysis →

Flexible Prefrontal Control over Hippocampal Episodic Memory for Goal-Directed Generalization

Yicong Zheng, Nora Wolf, Charan Ranganath et al.

2025 6 citations View Analysis →

Pattern separation and pattern completion: Behaviorally separable processes?

C. Ngo, Sebastian Michelmann, N. Newcombe et al.

2019 35 citations

Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation

Quanting Xie, So Yeon Min, Tianyi Zhang et al.

2024 40 citations View Analysis →

Embodied AI Agents: Modeling the World

Pascale Fung, Yoram Bachrach, Asli Celikyilmaz et al.

2025 41 citations View Analysis →

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Chameleon

Episodic Memory

Perceptual Aliasing

Multimodal Tokens

Differentiable Memory Stack

HoloHead

Camo-Dataset

Spatial Tracking

Sequential Manipulation

Flow Matching

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Complex Assembly Line Operations

Object Tracking in Dynamic Environments

Interactive Robotic Assistants

Long-term Vision

Autonomous Driving

Human-Robot Interaction

Abstract

References (20)

Related Papers

Passage-Aware Structural Mapping for RGB-D Visual SLAM

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Pushing Radar Odometry Beyond the Pavement: Current Capabilities and Challenges

Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

Computational Design and Co-Robotic Fabrication for Material Reuse in Architecture

Guiding Vector Field Generation via Score-based Diffusion Model