NavTrust: Benchmarking Trustworthiness for Embodied Navigation

TL;DR

NavTrust benchmarks embodied navigation robustness by systematically introducing RGB, depth, and instruction corruptions, revealing significant robustness gaps in current models.

cs.RO 🔴 Advanced 2026-03-20 53 views

Huaide Jiang Yash Chaudhary Yuping Wang Zehao Wang Raghav Sharma Manan Mehta Yang Zhou Lichao Sun Zhiwen Fan Zhengzhong Tu Jiachen Li

AI Reader Arxiv Page Download PDF

embodied navigation robustness RGB-depth corruption instruction variation trust evaluation

Key Findings

Methodology

NavTrust provides a unified benchmark that systematically introduces RGB, depth, and instruction corruptions in realistic scenarios to evaluate the performance of embodied navigation models. This benchmark is the first to expose embodied navigation agents to diverse RGB-depth corruptions and instruction variations within a single framework. The study evaluates seven state-of-the-art approaches, revealing significant performance degradation under realistic corruptions, highlighting critical robustness gaps, and providing a roadmap toward more trustworthy embodied navigation systems. Additionally, the study systematically evaluates four distinct mitigation strategies to enhance robustness against RGB-depth and instruction corruptions.

Key Results

Under RGB image corruptions, RGB-only agents (e.g., Uni-NaVid and NaVid) are penalized more heavily than depth-involved or language-conditioned methods. Black-out and foreign-object corruptions reduce the success rate of RGB-only agents by 22% and 13%, respectively.
Under depth corruptions, Gaussian noise is the most destructive: L3MVN's success rate collapses from 50% to 2%, and VLFM similarly drops from 50% to 0%.
Under instruction corruptions, ETPNav, NaVid, and Uni-NaVid experience success rate declines of 28%, 12%, and 21%, respectively, under random masking.

Significance

The NavTrust study reveals the vulnerabilities of existing embodied navigation systems when faced with common real-world perception and language input corruptions. This benchmark provides an essential evaluation tool for developing more robust navigation systems, driving progress in academia and industry towards improving navigation system trustworthiness. By identifying and quantifying these systems' performance degradation under adverse conditions, NavTrust provides clear directions for future research and development.

Technical Contribution

NavTrust's technical contributions lie in its provision of a unified framework for evaluating the robustness of embodied navigation systems under various input corruptions. It covers not only RGB and depth sensor corruptions but also language instruction variations. Furthermore, the study systematically evaluates four mitigation strategies, including data augmentation, knowledge distillation, adapter tuning, and large language model fine-tuning, offering an empirical roadmap for future robustness enhancements.

Novelty

NavTrust is the first benchmark to evaluate embodied navigation agents' performance under diverse RGB-depth corruptions and instruction variations within a unified framework. Compared to existing work, its innovation lies in systematically introducing real-world input corruptions and providing evaluations of multiple mitigation strategies.

Limitations

NavTrust's evaluation primarily focuses on simulated environments, and although deployed on a real robot, further validation in more real-world scenarios is needed.
The effectiveness of mitigation strategies varies across different models and corruption types, potentially requiring customization for specific applications.
The current benchmark does not fully cover all possible perception and language input corruption types, which may need to be expanded in the future.

Future Work

Future research could explore validating NavTrust's applicability in more real-world scenarios and developing more effective mitigation strategies. Additionally, the benchmark's scope could be expanded to cover more types of input corruptions and explore applying these strategies across different embodied navigation tasks.

AI Executive Summary

Embodied navigation refers to the ability of robots to autonomously move in complex environments, often relying on vision and language instructions. However, existing navigation systems perform poorly when faced with common real-world perception and language input corruptions. The NavTrust benchmark systematically introduces RGB, depth, and instruction corruptions to evaluate the performance of embodied navigation models, revealing significant robustness gaps in current models.

NavTrust's framework is the first to expose embodied navigation agents to diverse RGB-depth corruptions and instruction variations within a single environment. The study evaluates seven state-of-the-art methods, revealing significant performance degradation under realistic corruptions, highlighting critical robustness gaps, and providing a roadmap toward more trustworthy embodied navigation systems.

In experiments, researchers found that RGB-only agents (e.g., Uni-NaVid and NaVid) are more heavily penalized under image corruptions, while depth-involved or language-conditioned methods are more robust. Furthermore, the study reveals that under depth corruptions, Gaussian noise is the most destructive, leading to significant drops in success rates for L3MVN and VLFM.

To enhance system robustness, researchers evaluated four mitigation strategies, including data augmentation, knowledge distillation, adapter tuning, and large language model fine-tuning. These strategies improved model performance under corruption conditions to varying degrees, providing an empirical roadmap.

The NavTrust study holds significant implications for academia and industry. By identifying and quantifying these systems' performance degradation under adverse conditions, NavTrust provides clear directions for future research and development. However, the study also has limitations, such as its applicability in real-world scenarios and the need for customization of mitigation strategies.

Deep Analysis

Background

Embodied navigation refers to the ability of robots to autonomously move in complex environments, often relying on vision and language instructions. In recent years, significant progress has been made in the field of embodied navigation with the development of deep learning and computer vision technologies. However, existing navigation systems perform poorly when faced with common real-world perception and language input corruptions. For example, Vision-Language Navigation (VLN) and Object-Goal Navigation (OGN) experience significant performance drops when encountering minor linguistic perturbations or small domain shifts. These vulnerabilities are often overlooked in existing benchmarks, which typically report performance under idealized input conditions. Additionally, current benchmarks lack a unified framework for systematically evaluating robustness mitigation strategies. To address these gaps, NavTrust provides a unified benchmark that systematically introduces RGB, depth, and instruction corruptions in realistic scenarios to evaluate the performance of embodied navigation models.

Core Problem

Existing embodied navigation systems perform poorly when faced with common real-world perception and language input corruptions. These corruptions include RGB image blurring, low lighting, noise, depth sensor Gaussian noise, data loss, multipath interference, quantization errors, and language instruction variations. These corruptions can lead to significant performance degradation in navigation systems, affecting their applicability in real-world scenarios. Therefore, evaluating and improving the robustness of embodied navigation systems under these corruption conditions is an important and challenging problem.

Innovation

NavTrust's innovations lie in its provision of a unified framework for evaluating the robustness of embodied navigation systems under various input corruptions. Specifically:

�� NavTrust systematically introduces RGB, depth, and instruction corruptions to evaluate the performance of embodied navigation models. These corruptions include RGB image blurring, low lighting, noise, depth sensor Gaussian noise, data loss, multipath interference, quantization errors, and language instruction variations.

�� NavTrust is the first to expose embodied navigation agents to diverse RGB-depth corruptions and instruction variations within a single framework.

�� NavTrust provides evaluations of multiple mitigation strategies, including data augmentation, knowledge distillation, adapter tuning, and large language model fine-tuning, offering an empirical roadmap for future robustness enhancements.

Methodology

The research methodology of NavTrust includes the following key steps:

�� Dataset Selection: Use the validation set from the Habitat-Matterport3D dataset for OGN evaluation; use the R2R and RxR datasets for VLN evaluation.

�� Corruption Types: Introduce eight types of RGB image corruptions (e.g., motion blur, low lighting, noise) and four types of depth corruptions (e.g., Gaussian noise, data loss), as well as five types of instruction corruptions (e.g., masking, stylistic variation).

�� Model Evaluation: Evaluate seven state-of-the-art methods, including ETPNav, NaVid, Uni-NaVid, WMNav, L3MVN, PSL, and VLFM.

�� Mitigation Strategies: Evaluate four mitigation strategies, including data augmentation, knowledge distillation, adapter tuning, and large language model fine-tuning.

�� Experimental Design: Conduct experiments in both simulated environments and real robots to evaluate model performance under different corruption conditions.

Experiments

The experimental design includes the following aspects:

�� Datasets: Use the validation set from the Habitat-Matterport3D dataset for OGN evaluation; use the R2R and RxR datasets for VLN evaluation.

�� Baselines: Evaluate seven state-of-the-art methods, including ETPNav, NaVid, Uni-NaVid, WMNav, L3MVN, PSL, and VLFM.

�� Evaluation Metrics: Use success rate (SR), success-weighted path length (SPL), and performance retention score (PRS) as metrics to evaluate model performance.

�� Hyperparameters: Set corruption intensity to 0.5 to induce significant but realistic performance degradation.

�� Ablation Studies: Evaluate the impact of different corruption types and mitigation strategies on model performance.

Results

The experimental results show that:

�� Under RGB image corruptions, RGB-only agents (e.g., Uni-NaVid and NaVid) are penalized more heavily than depth-involved or language-conditioned methods. Black-out and foreign-object corruptions reduce the success rate of RGB-only agents by 22% and 13%, respectively.

�� Under depth corruptions, Gaussian noise is the most destructive: L3MVN's success rate collapses from 50% to 2%, and VLFM similarly drops from 50% to 0%.

�� Under instruction corruptions, ETPNav, NaVid, and Uni-NaVid experience success rate declines of 28%, 12%, and 21%, respectively, under random masking.

�� Data augmentation strategies improve model performance under corruption conditions to varying degrees, with per-episode data augmentation performing better under RGB and depth corruptions.

Applications

The findings from NavTrust have potential applications in several fields:

�� Autonomous Driving: Improve the robustness of autonomous driving systems under adverse weather and lighting conditions, enhancing their navigation capabilities in complex environments.

�� Service Robots: Enhance the navigation capabilities of service robots in home and commercial environments, particularly in complex and dynamic settings.

�� Drone Navigation: Improve drone navigation performance in complex terrains and variable environments, supporting more application scenarios such as agricultural monitoring and disaster relief.

Limitations & Outlook

The NavTrust study has the following limitations:

�� The evaluation primarily focuses on simulated environments, and although deployed on a real robot, further validation in more real-world scenarios is needed.

�� The effectiveness of mitigation strategies varies across different models and corruption types, potentially requiring customization for specific applications.

�� The current benchmark does not fully cover all possible perception and language input corruption types, which may need to be expanded in the future.

Plain Language Accessible to non-experts

Imagine you're in a kitchen trying to cook, and the kitchen is filled with various tools and ingredients. Embodied navigation is like having a smart robot chef that needs to find and use the right tools and ingredients based on your instructions. However, sometimes the kitchen lighting is poor, or your instructions aren't clear, which is like adding noise and interference to the robot chef's vision and hearing. The NavTrust study is like a test kitchen that deliberately creates various lighting and instruction interferences to test how the robot chef performs in these situations. Through these tests, we can find out where the robot chef is likely to make mistakes and find ways to improve it. Just like in the kitchen, we can improve the robot chef's performance by adjusting the lighting, using clearer instructions, or adding new features to the robot chef. The NavTrust study helps us better understand how robots perform in complex environments and provides directions for future improvements.

ELI14 Explained like you're 14

Hey there! Did you know that scientists are working on something called embodied navigation? It's like helping robots find their way through a maze. Imagine you're playing a maze game, but this maze has lots of obstacles, like dim lighting or unclear maps. Scientists found that robots can get lost in such mazes. So, they designed something called the NavTrust test, which adds all sorts of challenges to the maze to see if the robot can still find its way out. Through these tests, scientists discovered where robots are likely to make mistakes and figured out ways to improve them. Just like when you face challenges in a game and find ways to level up your skills, scientists are working hard to make robots smarter and more reliable. In the future, robots might help us in many places, like doing chores at home or working in factories. Isn't that cool?

Glossary

Embodied Navigation

Embodied navigation refers to the ability of robots to autonomously move in complex environments, often relying on vision and language instructions.

In the paper, embodied navigation includes both Vision-Language Navigation and Object-Goal Navigation tasks.

Vision-Language Navigation

Vision-Language Navigation is a task where robots navigate by following natural language instructions.

In the paper, Vision-Language Navigation is one of the main tasks of embodied navigation.

Object-Goal Navigation

Object-Goal Navigation is a task where robots navigate to a specified target object.

In the paper, Object-Goal Navigation is another main task of embodied navigation.

Robustness

Robustness refers to the ability of a system to maintain stable performance in the face of input corruptions or uncertainties.

In the paper, robustness is a key metric for evaluating the performance of embodied navigation systems.

RGB Corruption

RGB corruption refers to interference with visual inputs (e.g., images), such as blurring, low lighting, and noise.

In the paper, RGB corruption is used to evaluate the visual robustness of embodied navigation systems.

Depth Corruption

Depth corruption refers to interference with depth sensor data, such as Gaussian noise, data loss, and multipath interference.

In the paper, depth corruption is used to evaluate the depth perception robustness of embodied navigation systems.

Instruction Variation

Instruction variation refers to interference with language instructions, such as masking, stylistic variation, and malicious prompts.

In the paper, instruction variation is used to evaluate the language robustness of embodied navigation systems.

Data Augmentation

Data augmentation refers to methods that improve model robustness by transforming or expanding training data.

In the paper, data augmentation is used as one of the mitigation strategies to enhance the robustness of embodied navigation systems.

Knowledge Distillation

Knowledge distillation is a method where knowledge from a large model is transferred to a smaller model to improve its performance.

In the paper, knowledge distillation is used as one of the mitigation strategies to enhance the robustness of embodied navigation systems.

Adapter Tuning

Adapter tuning refers to inserting lightweight modules into specific layers of a model to improve its robustness.

In the paper, adapter tuning is used as one of the mitigation strategies to enhance the robustness of embodied navigation systems.

LLM Fine-tuning

Large Language Model fine-tuning is a method where a pre-trained large language model is fine-tuned to improve its performance on specific tasks.

In the paper, LLM fine-tuning is used as one of the mitigation strategies to enhance the robustness of embodied navigation systems.

Success Rate

Success rate is the proportion of tasks successfully completed by a model in an experiment.

In the paper, success rate is used as one of the metrics to evaluate the performance of embodied navigation systems.

Success-weighted Path Length

Success-weighted path length is a normalized metric that balances task completion with navigation efficiency.

In the paper, success-weighted path length is used as one of the metrics to evaluate the performance of embodied navigation systems.

Performance Retention Score

Performance retention score is the proportion of a model's performance under corruption conditions relative to its performance under clean conditions.

In the paper, performance retention score is used as one of the metrics to evaluate the robustness of embodied navigation systems.

Open Questions Unanswered questions from this research

1 Existing embodied navigation systems perform poorly when faced with complex perception and language input corruptions, especially under common real-world conditions such as lighting changes, noise interference, and language variations. These corruptions can lead to significant performance degradation in navigation systems, affecting their applicability in real-world scenarios. Future research needs to develop more effective mitigation strategies and validate their applicability in more real-world scenarios to improve system robustness.
2 Although NavTrust provides a unified framework for evaluating the robustness of embodied navigation systems under various input corruptions, the current benchmark does not fully cover all possible perception and language input corruption types. Future research may need to expand its scope to cover more types of input corruptions and explore how to apply these strategies across different embodied navigation tasks.
3 The effectiveness of mitigation strategies varies across different models and corruption types, potentially requiring customization for specific applications. Future research could explore how to optimize these strategies based on specific application scenarios to improve system robustness and applicability.
4 The current study primarily focuses on simulated environments, and although deployed on a real robot, further validation in more real-world scenarios is needed. Future research could explore validating NavTrust's applicability in more real-world scenarios and developing more effective mitigation strategies.
5 Embodied navigation systems' performance in multilingual environments remains a challenge. Although ETPNav performs well under multilingual supervision, other models struggle with language switches. Future research could explore how to improve the robustness and applicability of embodied navigation systems in multilingual environments.

Applications

Immediate Applications

Autonomous Driving

NavTrust's findings can be used to improve the robustness of autonomous driving systems under adverse weather and lighting conditions, enhancing their navigation capabilities in complex environments.

Service Robots

By enhancing service robots' navigation capabilities in home and commercial environments, especially in complex and dynamic settings, NavTrust's findings can drive the application of service robots in more scenarios.

Drone Navigation

NavTrust's findings can improve drone navigation performance in complex terrains and variable environments, supporting more application scenarios such as agricultural monitoring and disaster relief.

Long-term Vision

Smart Cities

With advancements in embodied navigation technology, future smart cities could achieve more efficient traffic management and logistics distribution, driving urban smart development.

Human-Robot Interaction

By improving the robustness and applicability of embodied navigation systems, future human-robot interactions could become more natural and efficient, driving the adoption of smart assistants and service robots.

Abstract

There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.

cs.RO cs.AI cs.CV cs.LG eess.SY

References (20)

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Alexander Ku, Peter Anderson, Roma Patel et al.

2020 468 citations ⭐ Influential View Analysis →

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Peter Anderson, Qi Wu, Damien Teney et al.

2017 1654 citations ⭐ Influential View Analysis →

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

Matthew Chang, Gunjan Chhablani, Alexander Clegg et al.

2024 55 citations View Analysis →

Robustness of Embodied Point Navigation Agents

Frano Rajič

2022 3 citations

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel X. Chang, Angela Dai, T. Funkhouser et al.

2017 2343 citations View Analysis →

Physics-Based Noise Modeling for Extreme Low-Light Photography

Kaixuan Wei, Ying Fu, Yinqiang Zheng et al.

2021 145 citations View Analysis →

On the robustness of multimodal language model towards distractions

Ming Liu, Hao Chen, Jindong Wang et al.

2025 8 citations View Analysis →

RobustNav: Towards Benchmarking Robustness in Embodied Navigation

Prithvijit Chattopadhyay, Judy Hoffman, Roozbeh Mottaghi et al.

2021 68 citations View Analysis →

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra et al.

2023 256 citations View Analysis →

Noise Analysis and Modeling of the PMD Flexx2 Depth Camera for Robotic Applications

Yuke Cai, Davide Plozza, Steven Marty et al.

2024 2 citations View Analysis →

Habitat: A Platform for Embodied AI Research

M. Savva, Abhishek Kadian, Oleksandr Maksymets et al.

2019 1788 citations View Analysis →

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Bangguo Yu, H. Kasaei, M. Cao

2023 194 citations View Analysis →

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Dongyan An, H. Wang, Wenguan Wang et al.

2023 173 citations View Analysis →

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar et al.

2020 456 citations View Analysis →

Waypoint Models for Instruction-guided Navigation in Continuous Environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra et al.

2021 144 citations View Analysis →

ON as ALC: Active Loop Closing Object Goal Navigation

Daiki Iwata, Kanji Tanaka, Shoya Miyazaki et al.

2024 2 citations View Analysis →

Auxiliary Tasks and Exploration Enable ObjectGoal Navigation

Joel Ye, Dhruv Batra, Abhishek Das et al.

2021 128 citations View Analysis →

Modeling and correction of multipath interference in time of flight cameras

David Jiménez, Daniel Pizarro-Perez, M. Mazo et al.

2014 69 citations

Multipath Interference Compensation in Time-of-Flight Camera Images

S. Fuchs

2010 107 citations

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans et al.

2021 631 citations View Analysis →

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Embodied Navigation

Vision-Language Navigation

Object-Goal Navigation

Robustness

RGB Corruption

Depth Corruption

Instruction Variation

Data Augmentation

Knowledge Distillation

Adapter Tuning

LLM Fine-tuning

Success Rate

Success-weighted Path Length

Performance Retention Score

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Driving

Service Robots

Drone Navigation

Long-term Vision

Smart Cities

Human-Robot Interaction

Abstract

References (20)

Related Papers

Passage-Aware Structural Mapping for RGB-D Visual SLAM

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Pushing Radar Odometry Beyond the Pavement: Current Capabilities and Challenges

Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

Computational Design and Co-Robotic Fabrication for Material Reuse in Architecture

Guiding Vector Field Generation via Score-based Diffusion Model