GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution Imagery

TL;DR

GMBFormer integrates NDVI-guided global memory with Transformer for urban green-space extraction, achieving a mean IoU of 89.25%.

cs.CV 🔴 Advanced 2026-06-05 66 views

Hao Lei Xi Cheng Chenlu Shu Zhiheng Chen Zhengjie Duan Haoyu Wang Zhanfeng Shen

AI Reader Arxiv Page Download PDF

Remote Sensing Segmentation Transformer Global Memory NDVI-guided Urban Green Space

Key Findings

Methodology

GMBFormer builds upon the SegFormer architecture, processing only RGB channels through a hierarchical Transformer backbone. It employs a physics-informed NDVI gate to selectively admit high-confidence vegetation descriptors into a compact, fixed-capacity global memory bank. During training, the memory bank is updated via an exponential moving average (EMA) mechanism, ensuring stability. During inference, the current image patch queries stored prototypes through a cross-attention mechanism, retrieving semantically similar vegetation features across non-contiguous patches. This approach replaces traditional adjacency-based feature propagation, enabling effective cross-region semantic reuse. The framework is validated on a self-constructed Chengdu Ultra-High-Resolution (UHR) dataset with 7,700 labeled patches and on the ISPRS Potsdam dataset with reduced labels, consistently outperforming baseline models.

Key Results

On the Chengdu UHR validation set, GMBFormer achieved a mean IoU of 89.25% and a mean Dice of 94.31%, surpassing the baseline SegFormer-B4 which scored 87.40% and 92.83%, respectively. The improvements demonstrate the effectiveness of NDVI-guided memory retrieval in enhancing vegetation recognition accuracy.
In the ISPRS Potsdam binary classification task, GMBFormer obtained a green space IoU of 90.45%, outperforming other models such as Swin-UPerNet and DeepLabV3, indicating strong generalization across different datasets and spatial resolutions.
Ablation studies revealed that the combination of NDVI gating, memory capacity (S=64), and EMA momentum (α=0.99) significantly contributed to performance gains, confirming the robustness of the proposed design.

Significance

This work introduces a novel paradigm for remote sensing semantic segmentation by leveraging a physics-informed, similarity-driven global memory bank. It effectively addresses the longstanding challenge of cross-region semantic reuse, which is crucial for large-scale urban green space monitoring. The approach enhances the continuity and accuracy of green space maps, especially in fragmented and shadowed areas, facilitating urban ecological assessments and planning. Its low computational overhead and high accuracy make it suitable for real-world deployment in city management systems. The methodology paves the way for future multi-modal, large-capacity memory-based models that can handle complex, dynamic urban environments, contributing significantly to the evolution of intelligent remote sensing analysis.

Technical Contribution

The core technical innovation lies in integrating NDVI as a physics-informed gate to control the admission of vegetation prototypes into a fixed-capacity global memory bank, coupled with a Transformer-based cross-attention retrieval mechanism. This design decouples physical vegetation confidence from visual appearance learning, preventing feature entanglement. The memory bank is updated via EMA, ensuring stability without gradient interference. The retrieval process enhances non-contiguous green space recognition, overcoming the limitations of adjacency-based propagation. The architecture maintains end-to-end trainability while significantly improving cross-region semantic reuse, setting a new benchmark for urban green space segmentation.

Novelty

This research is the first to embed NDVI as a physics-informed gate within a Transformer-based memory retrieval framework for remote sensing segmentation. Unlike prior multimodal fusion strategies that concatenate or fuse features at the pixel or feature level, GMBFormer explicitly separates physical vegetation confidence from visual appearance, enabling more reliable cross-region semantic reuse. The use of a fixed-capacity, EMA-updated memory bank for prototype storage and retrieval in the context of ultra-high-resolution urban imagery is a novel contribution, opening new avenues for scalable, physically grounded semantic segmentation.

Limitations

The reliance on NDVI as the sole physical index may limit performance under adverse conditions such as cloud cover, shadows, or extreme illumination, where NDVI signals are unreliable. This could lead to contamination of the memory bank with false positives or missed detections.
The fixed capacity of the memory bank (S=64) constrains the diversity of stored prototypes, which might be insufficient for highly heterogeneous or large-scale scenes, potentially reducing cross-region generalization.
Currently, the framework focuses on single physical index NDVI; integrating additional modalities like multispectral or LiDAR data could further improve robustness but requires more complex fusion strategies and computational resources.

Future Work

Future research will explore adaptive memory capacity strategies, multi-modal data integration (e.g., multispectral, LiDAR), and dynamic gating mechanisms to improve robustness under challenging environmental conditions. Additionally, scaling the memory bank and optimizing retrieval efficiency will be key to deploying the model in real-time urban monitoring systems. Extending the framework to temporal data for change detection and dynamic green space mapping is another promising direction. These advancements aim to realize a comprehensive, scalable urban ecological monitoring platform that can adapt to diverse cityscapes and environmental variations.

AI Executive Summary

Urban green spaces are vital for ecological balance, air quality, and residents’ well-being. Accurate mapping of these areas using remote sensing imagery, especially ultra-high-resolution (UHR) data, has become increasingly important for urban planning and environmental management. Traditional segmentation methods, relying on pixel-based classification or object-based analysis, struggle to handle the complexity and fragmentation inherent in dense urban environments. Deep learning models, such as convolutional neural networks (CNNs) and Transformer architectures like SegFormer, have advanced the field but still face significant challenges in achieving consistent semantic reuse across spatially separated but visually similar vegetation patches.

The core problem lies in the patch-wise processing paradigm, which discards spatial and semantic continuity after each crop. This leads to fragmented green space maps, with limited ability to recognize recurring vegetation patterns across different city regions. Moreover, visual appearance alone can be misleading due to shadows, illumination differences, and artificial surfaces that mimic vegetation. While NDVI provides a physical measure of vegetation confidence, naive integration with RGB features often entangles appearance and physical signals, reducing interpretability and robustness.

To address these issues, Hao Lei et al. propose GMBFormer, a novel framework that combines the strengths of Transformer-based segmentation with a physics-informed, similarity-driven global memory bank. The key innovation is the use of NDVI as a gate to selectively admit high-confidence vegetation descriptors into a fixed-capacity memory, which is updated during training via EMA to ensure stability. During inference, each image patch queries this memory through cross-attention, retrieving prototypes that are semantically similar but spatially distant. This mechanism enables the model to recognize recurring vegetation patterns across disconnected urban scenes, significantly improving the continuity and accuracy of green space maps.

Extensive experiments on a self-constructed Chengdu UHR dataset and the public ISPRS Potsdam dataset demonstrate the effectiveness of GMBFormer. The model achieves a mean IoU of 89.25% on Chengdu, outperforming the baseline SegFormer-B4 by over 1.8 percentage points. On Potsdam, it attains a green space IoU of 90.45%, showing strong generalization. Ablation studies confirm that NDVI-guided admission, memory capacity, and EMA momentum are critical for optimal performance. The approach maintains low computational overhead, making it suitable for large-scale urban monitoring.

This research marks a significant step forward in remote sensing semantic segmentation, offering a scalable, physically grounded solution for cross-region green space recognition. It opens new avenues for integrating physical indices into deep models, fostering more robust and interpretable urban ecological analysis. Future directions include multi-modal fusion, adaptive memory management, and real-time deployment, aiming to support smarter, greener cities worldwide.

Deep Dive

Abstract

Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separated but visually similar vegetation patterns. Directly injecting the Normalized Difference Vegetation Index (NDVI) into red-green-blue (RGB) backbones can also blur the roles of visual appearance learning and physical vegetation confidence. We propose GMBFormer, a SegFormer-based framework that replaces adjacency-driven feature propagation with selective, similarity-driven prototype retrieval. Only RGB channels enter the backbone and decoder, while NDVI is decoupled as a physics-informed gate that admits high-confidence vegetation descriptors into a compact global memory bank through momentum updates. During training and inference, the current patch queries stored prototypes through memory-mediated cross-attention, and the retrieved response is integrated with bounded overhead. Experiments use a self-constructed Chengdu UHR dataset with 7,700 labeled 512 x 512 patches and two reduced-label settings derived from the public International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset. Under the same training and evaluation protocol, GMBFormer obtains mean intersection over union (mIoU)/mean Dice (mDice) scores of 89.25%/94.31%, 92.17%/95.92%, and 83.72%/90.86%, respectively, improving the controlled SegFormer-B4 baseline in each setting. Ablation studies indicate that decoupled NDVI admission, memory retrieval, capacity, and momentum jointly shape the final performance.

cs.CV

GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution Imagery

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence