Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

TL;DR

Introduces curiosity-driven 3D exploration using persistent 3D Gaussian Splatting world model and Transformer policy, achieving 74.94% 3D coverage on HM3D.

cs.LG 🔴 Advanced 2026-05-22 50 views
Lily Goli Justin Kerr Daniele Reda Alec Jacobson Andrea Tagliasacchi Angjoo Kanazawa
Reinforcement Learning Curiosity-driven Exploration 3D Reconstruction Transformer Robotics Navigation

Key Findings

Methodology

This paper presents a curiosity-driven exploration framework combining a persistent and dynamically updated 3D Gaussian Splatting (3DGS) world model with a Transformer-based policy maintaining episodic memory. The 3DGS model serves as a stable forward model providing intrinsic rewards based on prediction errors between rendered novel views and actual observations, addressing the local loop problem of traditional methods. The policy network operates on sequences of RGB images and actions, employing causal temporal self-attention and global linear attention to encode long-term episodic context. Training utilizes Proximal Policy Optimization (PPO) with a mixture of learned and random actions to ensure exploration diversity. During deployment, the agent relies solely on RGB inputs, enhancing generalization and practical applicability.

Key Results

  • On the HM3D dataset, the method achieves 74.94% 3D scene completeness at 1024 steps, outperforming the OccAnt-RGBD baseline by approximately 0.3%, with an average point-to-observation distance reduced to 0.14cm, indicating more thorough exploration.
  • Zero-shot generalization tests on Gibson and AI-generated environments (Hobbit World and Spaceship) demonstrate robust exploratory behavior with minimal collisions, confirming strong adaptability to unseen and diverse environments.
  • Ablation studies reveal that both the persistent 3DGS world model and Transformer-based long-term memory significantly enhance exploration performance. Short-term memory variants cause local loops, and RNN or no-memory policies perform worse, highlighting the synergy between spatial persistence and episodic context.

Significance

This work addresses critical challenges in curiosity-driven reinforcement learning within photorealistic 3D environments, notably the local loop and false novelty reward issues caused by lack of spatial persistence and episodic memory. By integrating a persistent 3D reconstruction model with a Transformer policy encoding long-term episodic context, the approach substantially improves exploration efficiency in sparse-reward, long-horizon tasks. Its ability to generalize zero-shot to unseen environments and operate purely on RGB inputs at deployment lowers practical barriers, advancing autonomous agents' capabilities in real-world navigation and vision-based tasks.

Technical Contribution

Technically, this paper pioneers the use of online 3D Gaussian Splatting as a persistent and continuously updated world model for intrinsic curiosity, overcoming spatial forgetting inherent in prior statistical models like ICM. The Transformer-based policy leverages causal temporal self-attention combined with a global linear attention memory module to encode long-horizon episodic context, surpassing traditional RNN limitations. The training incorporates a random action mixing strategy within PPO to maintain exploration diversity, preventing policy collapse in sparse-reward settings. The end-to-end framework enables purely RGB-based deployment, balancing training complexity and deployment flexibility.

Novelty

The fundamental novelty lies in coupling a persistent 3DGS reconstruction with a Transformer policy maintaining episodic memory to systematically resolve the local loop and reward deception problems in curiosity-driven exploration. Unlike prior works relying on statistical priors or explicit geometric maps, this approach provides spatially consistent, dynamically updated world modeling and semantic-rich episodic context without depth or localization at test time, establishing a new paradigm for visual exploration in complex 3D environments.

Limitations

  • The approach assumes static environments, limiting applicability in dynamic or highly changing scenes where persistent modeling is more challenging.
  • The computational overhead of 3DGS reconstruction and rendering is significant, potentially hindering real-time performance in large-scale environments.
  • Although deployment requires only RGB input, training depends on privileged depth and camera pose data, increasing sensor requirements and environment setup complexity.

Future Work

Future research directions include extending persistent world models to handle dynamic scenes with real-time change detection and adaptation, improving computational efficiency of 3DGS for scalable online updates, and reducing reliance on depth and pose information during training to enable fully self-supervised RGB-only learning. Additionally, exploring multi-agent and highly dynamic environment scenarios will broaden applicability and robustness.

AI Executive Summary

Effective exploration is foundational for autonomous agents to learn useful behaviors in sparse-reward, long-horizon tasks, particularly within complex 3D environments. Traditional curiosity-driven reinforcement learning methods incentivize exploration through intrinsic rewards derived from prediction errors of learned forward models. However, in photorealistic settings, agents often fall into local loops, repeatedly revisiting familiar states and receiving spurious novelty rewards, which hampers learning.

To address these challenges, this work introduces a novel framework that integrates a persistent, online 3D Gaussian Splatting (3DGS) world model with a Transformer-based policy network that maintains episodic memory. The 3DGS model continuously reconstructs and updates a spatially consistent 3D representation of the environment, enabling reliable intrinsic rewards based on discrepancies between predicted and observed views. The policy network processes sequences of RGB observations and actions using causal temporal self-attention and a global linear attention memory module, allowing the agent to leverage long-term episodic context for planning exploratory trajectories.

Training employs Proximal Policy Optimization (PPO) combined with a random action mixing strategy to maintain exploration diversity and prevent policy collapse in sparse-reward regimes. While training leverages privileged depth and camera pose data to build the 3DGS model, deployment requires only RGB inputs, enhancing practical applicability and generalization.

Experimental evaluation on the Habitat Matterport 3D (HM3D) dataset demonstrates that the proposed method achieves a 74.94% 3D scene coverage at 1024 steps, outperforming state-of-the-art map-based RL baselines. The agent generalizes zero-shot to the Gibson dataset and AI-generated environments, exhibiting coherent exploratory behavior and low collision rates. Ablation studies confirm the critical roles of persistent spatial memory and episodic context. Furthermore, the pretrained exploration policy can be fine-tuned efficiently for downstream tasks such as apple picking and image-goal navigation, surpassing from-scratch baselines, especially under sparse reward conditions.

This research advances the field by resolving fundamental limitations of curiosity-driven exploration in realistic 3D settings, offering a scalable, end-to-end framework that balances spatial persistence, episodic memory, and deployment flexibility. Despite current limitations related to static environment assumptions and computational costs, the work lays a strong foundation for future developments in dynamic scene modeling, efficient 3D reconstruction, and fully self-supervised visual exploration.

Overall, this study provides significant theoretical and practical contributions towards enabling autonomous agents to navigate and learn in complex, photorealistic 3D worlds, with broad implications for robotics, virtual reality, and embodied AI.

Deep Analysis

Background

Exploration is a fundamental prerequisite for autonomous agents to acquire meaningful behaviors, especially in sparse-reward, long-horizon tasks common in robotics and embodied AI. Early psychological studies, such as Edward Tolman's latent learning experiments, demonstrated that animals can learn complex environmental representations without explicit rewards, highlighting intrinsic motivation as a key driver. In reinforcement learning, curiosity-driven methods operationalize this by providing intrinsic rewards based on prediction errors of learned forward models, encouraging agents to seek novel and uncertain states.


Classic approaches like the Intrinsic Curiosity Module (ICM) use forward dynamics models to estimate prediction errors as novelty signals. While effective in simple or simulated environments, these methods struggle in complex, photorealistic 3D scenes due to local loop traps and reward deception. Agents tend to revisit forgotten states repeatedly, receiving spurious novelty rewards, which impedes efficient exploration.


Moreover, many existing methods rely on short-term memory or statistical priors rather than persistent spatial representations, limiting their ability to plan long-term exploratory trajectories. Some approaches incorporate explicit geometric maps to aid navigation, but these often abstract away semantic richness and restrict end-to-end learning and generalization. Advances in 3D reconstruction, particularly online 3D Gaussian Splatting (3DGS), offer spatially consistent and dynamically updatable scene representations, presenting new opportunities for persistent world modeling.


This paper builds upon these insights to develop a curiosity-driven exploration framework that integrates persistent 3D reconstruction with a Transformer policy encoding long-term episodic context, aiming to overcome the limitations of prior work and enable scalable exploration in realistic 3D environments.

Core Problem

The core problem addressed is enabling effective curiosity-driven exploration in complex, photorealistic 3D environments under sparse reward conditions. Key challenges include:


1. Lack of persistent spatial world models: Without a continuously updated and consistent representation of the environment, agents suffer from spatial forgetting, leading to repeated visits to the same locations and false novelty rewards.


2. Insufficient episodic memory in policies: Short-term or no memory policies cannot leverage historical trajectories to plan efficient exploration paths, limiting long-horizon exploration capabilities.


3. Reliance on privileged sensors: Many methods depend on depth sensors or explicit maps, complicating deployment and reducing flexibility.


4. Exploration collapse: Sparse rewards often cause policies to degenerate into repetitive or random behaviors, hindering sustained discovery of novel states.


Addressing these bottlenecks is critical for advancing autonomous agents capable of robust, scalable exploration and learning in real-world-like 3D settings.

Innovation

This work introduces several core innovations:


1. Persistent 3D World Model: Employing online 3D Gaussian Splatting (3DGS) as a persistent, dynamically updated world model provides spatial consistency and reliable intrinsic rewards based on prediction errors, overcoming spatial forgetting issues inherent in prior statistical models like ICM.


2. Transformer-based Episodic Policy: Designing a policy network that processes sequences of RGB observations and actions using causal temporal self-attention and a global linear attention memory module enables encoding of long-term episodic context, facilitating complex exploratory behaviors such as backtracking and branch discovery.


3. Pure RGB Deployment: While training leverages depth and pose data to build the 3DGS model, the policy requires only RGB inputs at test time, enhancing deployment flexibility and generalization.


4. Exploration Diversity via Random Action Mixing: Integrating a random action mixing strategy within PPO training maintains exploration diversity and prevents policy collapse in sparse-reward environments.


Together, these innovations address the dual challenges of spatial persistence and episodic context, establishing a new paradigm for curiosity-driven exploration in photorealistic 3D environments.

Methodology

  • �� Problem Setup: The agent interacts with a static 3D environment, executing discrete actions to move and receiving RGB observations. Training uses privileged inputs (depth, camera pose) to build the world model; testing relies solely on RGB.

  • �� Persistent World Model: An online 3D Gaussian Splatting (3DGS) model Gt is constructed incrementally from RGB-D frames and camera poses. Each pixel contributes a Gaussian primitive, optimized periodically for reconstruction quality and densified.

  • �� Intrinsic Reward Computation: At each timestep, the 3DGS model renders a predicted view ˆIt+1 from the agent's new pose. The prediction error et is computed as the mean squared difference between low-pass filtered and downsampled predicted and actual RGB images. A binary curiosity reward rcur_t is assigned based on whether et exceeds a threshold τ, rewarding novel views and penalizing redundant ones.

  • �� Policy Network Architecture: Inputs are sequences of paired RGB images Ii and previous actions ai-1, where actions are encoded as Plücker ray images representing intended camera transformations.

  • �� Image Encoding: Each RGB-action pair is encoded via a convolutional encoder and enriched with DINOv2 self-supervised visual features. A learnable query token cross-attends to patch tokens and DINOv2 features to produce frame tokens.

  • �� Temporal Modeling: Frame tokens pass through sliding-window causal self-attention layers capturing local temporal context, interleaved with a global linear attention memory module that maintains a running memory state, enabling long-term episodic context encoding.

  • �� Output Heads: Actor and critic heads produce action probability distributions πθ and value estimates Vθ respectively. Actions are sampled from πθ during training and testing.

  • �� Training Procedure: The policy is optimized using PPO with the curiosity reward as the sole signal. A random action mixing mechanism samples actions from a mixture of the learned policy and a uniform random distribution with annealed mixing coefficient β to maintain exploration diversity.

  • �� Deployment: The agent operates purely on RGB inputs without access to depth or pose information, relying on learned episodic memory and policy for navigation.

Experiments

  • �� Datasets and Environment: Training is conducted on the Habitat Matterport 3D (HM3D) training set comprising 800 indoor scenes. Evaluation occurs on the HM3D validation set (100 scenes), Gibson dataset (86 scenes), and two AI-generated environments (Hobbit World and Spaceship) for zero-shot generalization.

  • �� Baselines: Compared against map-based RL methods including Active Neural SLAM (ANS) and Occupancy Anticipation (OccAnt) variants with RGB, depth, and RGB-D inputs.

  • �� Metrics: Exploration is quantified via 3D scene completeness (percentage of ground truth mesh points observed within a threshold) at 256, 512, and 1024 steps, and average distance from ground truth points to nearest observed points.

  • �� Ablations: Independently vary world model memory (ICM vs 3DGS, short vs persistent memory) and agent memory capacity (RNN vs Transformer with varying context lengths) to assess their impact on exploration performance.

  • �� Downstream Tasks: Fine-tune the pretrained exploration policy on apple-picking and image-goal navigation tasks, comparing performance against from-scratch training baselines.

  • �� Training Details: Use Adam optimizer with learning rate 1e-5, train for 110 million steps. Random action mixing starts at 20% and decays to zero over 5 million steps. The agent simulates a spherical drone with four discrete actions (move forward, look left/right, pause) in Habitat simulator.

Results

  • �� On HM3D, the proposed method achieves 74.94% completeness at 1024 steps, surpassing OccAnt-RGBD's 74.62%, with average point distance reduced to 0.14cm, indicating superior coverage and precision.

  • �� On Gibson, the agent attains 82.42% completeness, outperforming baselines and demonstrating cross-dataset generalization.

  • �� Zero-shot tests on AI-generated Hobbit World and Spaceship show coherent exploratory behaviors with only 2-3 collisions over 256 steps, evidencing robustness to novel environments.

  • �� Ablations confirm that persistent 3DGS world model significantly improves exploration compared to ICM or short-memory variants. Transformer policies with longer context windows outperform RNNs and no-memory policies.

  • �� Fine-tuning on downstream tasks yields higher success rates than from-scratch training, especially under sparse reward conditions, highlighting the benefit of exploratory pretraining.

Applications

This approach is directly applicable to autonomous indoor robot navigation, enabling efficient exploration without reliance on depth sensors or explicit maps, facilitating deployment on resource-constrained platforms. The pretrained curiosity-driven policy serves as a strong initialization for downstream vision-based tasks such as object retrieval and goal-directed navigation, improving learning efficiency in sparse-reward scenarios. Additionally, the framework can be leveraged in virtual reality and gaming for intelligent agent exploration, enhancing environment understanding and interaction. Long-term, it can support dynamic environment monitoring, warehouse automation, and other robotics applications requiring robust, scalable exploration capabilities.

Limitations & Outlook

The method assumes static environments, limiting applicability in dynamic or highly changing scenes where persistent modeling is more complex. The computational cost of 3DGS reconstruction and rendering is substantial, potentially restricting real-time use in large-scale settings. Training requires privileged depth and camera pose data, increasing sensor and setup complexity. The policy's performance in multi-agent or highly dynamic environments remains unexplored. Addressing these limitations is essential for broader real-world deployment and robustness.

Plain Language Accessible to non-experts

Imagine you're exploring a huge amusement park without a map, trying to find new rides and attractions. Normally, you might forget where you've been and keep wandering in circles, thinking you've found something new when you haven't. This is what happens to many robots exploring complex 3D environments—they get stuck revisiting the same places because they lack a good memory and map.

This paper's method is like giving the robot a constantly updating 3D map of the park and a super-smart brain that remembers every path it took. The map is built from the robot's camera images and keeps getting better as it explores. The brain uses a Transformer model, which can remember long sequences of what it saw and did, helping it plan to visit new areas instead of going in circles.

During training, the robot gets rewarded when it sees something its map didn't predict well—meaning it's exploring new places. To keep things interesting, sometimes it takes random actions to avoid getting stuck. When it's time to work for real, the robot only needs its camera images, no fancy depth sensors or maps, making it easier to use.

Experiments show this robot explores more thoroughly and can even handle new parks it hasn't seen before. It can also quickly learn new tasks like picking apples or finding a specific spot using just pictures. In short, this method helps robots explore unknown places smartly and efficiently, just like a curious kid with a great memory and a map.

ELI14 Explained like you're 14

Hey! Imagine you're in a giant amusement park with no map. You want to find all the cool rides and secret spots, but you keep forgetting where you've been and end up going in circles. That’s kinda like what robots do when they explore new places—they get stuck because they forget their path.

Now, this new method is like giving the robot a magic 3D map that updates itself as it walks around. Plus, it has a super brain called a Transformer that remembers everything it saw and did, so it knows where to go next to find new stuff. When the robot sees something surprising—something its map didn’t expect—it gets a little reward, making it want to explore more.

Sometimes, the robot even tries random moves to keep things exciting and avoid getting bored. And guess what? When it’s actually exploring for real, it only needs a regular camera, no fancy sensors! That means it can work in lots of places without extra gear.

The coolest part? This robot can explore new parks it’s never seen before and quickly learn new games like picking apples or finding a spot from a photo. So basically, it’s like a super curious kid with a great memory and a magic map, ready for adventure!

Glossary

Intrinsic Curiosity Module

A mechanism providing intrinsic rewards based on the difference between predicted and actual observations, encouraging exploration of novel states.

Used as a baseline curiosity-driven exploration method; lacks spatial persistence leading to local loops.

3D Gaussian Splatting (3DGS)

A point cloud representation method using Gaussian primitives that supports online dynamic updates and efficient rendering with spatial consistency.

Employed as the persistent world model providing stable intrinsic rewards.

Transformer

A sequence model based on self-attention mechanisms, capable of capturing long-range dependencies in data sequences.

Used as the policy network backbone to encode episodic memory over RGB observations and actions.

Proximal Policy Optimization (PPO)

A reinforcement learning algorithm that stabilizes policy updates by limiting the deviation from the previous policy.

Used to train the policy network with curiosity-based intrinsic rewards.

Habitat Matterport 3D (HM3D)

A large-scale dataset of photorealistic indoor 3D scenes used for training and evaluating embodied AI agents.

Primary dataset for training and evaluation in experiments.

Episodic Memory

The agent's record of past observations and actions within a single exploration episode, used for planning and decision-making.

Encoded by the Transformer policy to avoid revisiting known states.

Random Policy Mixing

A training technique mixing learned policy actions with random actions to maintain exploration diversity and prevent policy collapse.

Implemented during PPO training to encourage persistent exploration.

Zero-shot Generalization

The ability of a model to perform well in unseen environments or tasks without additional training.

Demonstrated by the agent's performance on Gibson and AI-generated scenes.

Plücker Ray

A geometric representation encoding the direction and position of camera transformations as rays.

Used to encode actions as images concatenated with RGB inputs.

DINOv2 Features

Self-supervised visual features extracted to enrich image representations.

Combined with RGB encodings to enhance policy input features.

Open Questions Unanswered questions from this research

  • 1 Extending persistent world models to dynamic environments remains an open challenge, requiring real-time change detection and adaptation capabilities.
  • 2 Reducing the computational overhead of 3DGS reconstruction and rendering is essential for real-time applications in large-scale or resource-constrained settings.
  • 3 Achieving fully self-supervised training relying solely on RGB inputs, without privileged depth or pose information, is an unsolved problem limiting deployment flexibility.
  • 4 Understanding and improving policy robustness in multi-agent or highly dynamic scenarios is necessary for broader applicability.
  • 5 Integrating semantic understanding with geometric exploration to enhance task-specific performance and generalization is an important future direction.

Applications

Immediate Applications

Indoor Robot Navigation

Enables robots to autonomously explore complex indoor environments using only RGB cameras, without depth sensors or explicit maps, facilitating deployment on diverse platforms.

Visual Task Pretraining

Provides a pretrained exploration policy that accelerates learning on downstream sparse-reward tasks such as object retrieval and goal-directed navigation.

Virtual Environment Exploration

Applicable to autonomous agents in virtual reality or gaming, enhancing environment understanding and interactive behaviors across diverse scenarios.

Long-term Vision

Persistent Modeling in Dynamic Environments

Developing world models capable of real-time adaptation to environmental changes, enabling robust autonomous exploration in real-world dynamic settings.

Efficient Exploration on Resource-Constrained Devices

Optimizing 3D reconstruction and policy computation for deployment on mobile robots and embedded systems with limited computational resources.

Abstract

Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL-based active mapping baselines and generalizes zero-shot to Gibson and AI-generated worlds. Our end-to-end policy enables efficient adaptation to downstream tasks, such as apple picking and image-goal navigation, outperforming from-scratch baselines. Please see video results at https://recuriosity.github.io/.

cs.LG