EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

TL;DR

EndoVGGT enhances surgical 3D reconstruction with DeGAT, improving PSNR by 24.6% and SSIM by 9.1%.

cs.CV 🔴 Advanced 2026-03-26 48 views
Falong Fan Yi Xie Arnis Lektauers Bo Liu Jerzy Rozenblit
depth estimation GNN surgical reconstruction 3D reconstruction cross-dataset generalization

Key Findings

Methodology

EndoVGGT framework employs a Deformation-aware Graph Attention (DeGAT) module to dynamically construct feature-space semantic graphs, capturing long-range correlations among coherent tissue regions. Unlike static spatial neighborhoods, DeGAT enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on the SCARED dataset validate its efficacy in surgical scene 3D reconstruction.

Key Results

  • On the SCARED dataset, EndoVGGT significantly improves PSNR by 24.6%, enhancing reconstruction fidelity. Additionally, SSIM increased by 9.1%, indicating substantial improvements in structural consistency.
  • In terms of cross-dataset generalization, EndoVGGT exhibits strong zero-shot generalization to unseen SCARED and EndoNeRF domains, confirming that the DeGAT module learns domain-agnostic geometric priors.
  • Ablation studies confirm that feature-level integration of the DeGAT module significantly enhances reconstruction performance, especially in handling non-rigid scenarios.

Significance

The EndoVGGT framework addresses the geometric continuity fragmentation issues faced by traditional fixed-topology methods when dealing with low-texture surfaces, specular highlights, and instrument occlusions. This method not only improves the accuracy of 3D reconstruction in surgical scenes but also demonstrates strong cross-dataset generalization capabilities. This research provides more accurate geometric information for surgical robotic perception, advancing surgical navigation and training.

Technical Contribution

EndoVGGT's technical contributions lie in its dynamic feature-space modeling capability, enabling strong zero-shot generalization without relying on scene optimization. The DeGAT module dynamically constructs semantic graphs to capture long-range correlations across occlusions, significantly enhancing non-rigid deformation recovery. Compared to existing NeRF and Gaussian Splatting methods, it offers a more efficient solution.

Novelty

EndoVGGT is the first to introduce the DeGAT module in surgical 3D reconstruction, achieving zero-shot cross-dataset generalization through dynamic feature-space modeling. Unlike traditional methods, it does not rely on scene optimization, maintaining high fidelity in complex non-rigid scenarios.

Limitations

  • EndoVGGT may experience performance degradation in extremely complex surgical scenes, particularly with significant occlusions and rapid movements.
  • The method still has high computational complexity, which may limit its use in real-time applications.
  • Further optimization may be required in specific surgical scenarios to enhance accuracy.

Future Work

Future research directions include further optimizing EndoVGGT's computational efficiency to meet real-time application demands. Additionally, exploring its application to other types of surgical scenarios will validate its applicability in broader fields. Extensions towards temporal consistency and robotic navigation are also promising directions.

AI Executive Summary

In modern surgical practice, accurate three-dimensional reconstruction is crucial for surgical robotic perception. However, existing methods often face challenges in maintaining geometric continuity when dealing with low-texture surfaces, specular highlights, and instrument occlusions, limiting their application in surgical scenes.

To address these issues, this paper proposes the EndoVGGT framework, which introduces a Deformation-aware Graph Attention (DeGAT) module to dynamically construct feature-space semantic graphs, capturing long-range correlations among coherent tissue regions. Unlike traditional fixed-topology methods, the DeGAT module enables robust propagation of structural cues across occlusions, improving non-rigid deformation recovery.

Extensive experiments on the SCARED dataset demonstrate that EndoVGGT significantly outperforms existing methods in terms of reconstruction fidelity and structural consistency. Specifically, PSNR increased by 24.6%, and SSIM improved by 9.1%, validating its effectiveness in surgical scene 3D reconstruction.

Moreover, EndoVGGT exhibits strong zero-shot cross-dataset generalization capabilities, performing well in unseen SCARED and EndoNeRF domains. This indicates that the DeGAT module learns domain-agnostic geometric priors, providing more accurate geometric information for surgical robotic perception.

Despite its impressive performance in handling complex non-rigid scenarios, EndoVGGT's computational complexity remains high, potentially limiting its use in real-time applications. Future research directions include further optimizing computational efficiency and exploring its applicability to other types of surgical scenarios.

Deep Analysis

Background

Three-dimensional reconstruction techniques play a critical role in surgical navigation, robotic assistance, and skill assessment. Early geometric and deep learning pipelines laid the foundation for this field, but recent research trends have shifted towards implicit neural representations like NeRF and explicit 3D Gaussian Splatting. However, these methods face challenges in surgical scenes due to intrinsic non-rigidity, soft-tissue deformation, and dynamic instrument occlusion, limiting their large-scale application. To address these efficiency bottlenecks, recent developments have introduced scene-agnostic, feed-forward alternatives based on large reconstruction models.

Core Problem

Existing large reconstruction models are primarily trained on rigid, object-centric datasets, assuming static geometry and stable illumination. In contrast, surgical scenes feature intrinsic non-rigidity, soft-tissue deformation, and dynamic instrument occlusion. Direct deployment of general-domain models results in artifacts, including disrupted tissue topology and depth errors. Additionally, surgical approaches based on NeRF or Gaussian Splatting rely on per-scene optimization, requiring repeated fitting for each new case, limiting their large-scale generalization across diverse procedures.

Innovation

The EndoVGGT framework introduces the DeGAT module to dynamically construct feature-space semantic graphs, capturing long-range correlations among coherent tissue regions. Unlike traditional fixed-topology methods, the DeGAT module enables robust propagation of structural cues across occlusions, improving non-rigid deformation recovery. Extensive experiments on the SCARED dataset validate its efficacy in surgical scene 3D reconstruction.

Methodology

  • �� DeGAT Module: Dynamically constructs feature-space semantic graphs, capturing long-range correlations among coherent tissue regions.
  • �� Feature-Space Modeling: Enables robust propagation of structural cues across occlusions through dynamic graph construction in feature space.
  • �� Experimental Validation: Extensive experiments on the SCARED dataset validate its efficacy in surgical scene 3D reconstruction.

Experiments

The experimental design includes extensive experiments on the SCARED dataset to validate EndoVGGT's efficacy in surgical scene 3D reconstruction. The results show that EndoVGGT significantly outperforms existing methods in terms of reconstruction fidelity and structural consistency. Specifically, PSNR increased by 24.6%, and SSIM improved by 9.1%. Additionally, EndoVGGT exhibits strong zero-shot cross-dataset generalization capabilities, performing well in unseen SCARED and EndoNeRF domains.

Results

The experimental results show that EndoVGGT significantly outperforms existing methods in terms of reconstruction fidelity and structural consistency. Specifically, PSNR increased by 24.6%, and SSIM improved by 9.1%. Additionally, EndoVGGT exhibits strong zero-shot cross-dataset generalization capabilities, performing well in unseen SCARED and EndoNeRF domains. This indicates that the DeGAT module learns domain-agnostic geometric priors, providing more accurate geometric information for surgical robotic perception.

Applications

The EndoVGGT framework can be directly applied to surgical navigation and robotic assistance, improving the accuracy of 3D reconstruction in surgical scenes. Its strong zero-shot cross-dataset generalization capabilities allow it to maintain high fidelity in unseen surgical scenarios, providing more accurate geometric information for surgical robotic perception.

Limitations & Outlook

Despite its impressive performance in handling complex non-rigid scenarios, EndoVGGT's computational complexity remains high, potentially limiting its use in real-time applications. Additionally, performance degradation may occur in extremely complex surgical scenes, particularly with significant occlusions and rapid movements. Future research directions include further optimizing computational efficiency and exploring its applicability to other types of surgical scenarios.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You need to know the exact location and state of each ingredient to make a delicious dish. EndoVGGT is like a super-smart kitchen assistant that can observe every corner of the kitchen and accurately tell you the location and state of the ingredients. It not only sees what's in front of it but also predicts changes in the ingredients by analyzing past experiences. So even if some ingredients are covered by a lid, it can still accurately tell you where they are. This ability is crucial in surgery, where doctors need precise 3D information to make critical decisions. EndoVGGT, with its unique DeGAT module, provides high-precision 3D reconstruction in complex surgical scenarios, offering better visual guidance to doctors.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool 3D game, and you need to know the exact position of every character and object to win. EndoVGGT is like a game cheat that helps you see every detail in the game, even if some characters are hidden behind obstacles. This ability is super important in real-life surgeries because doctors need precise 3D information to make critical decisions. EndoVGGT, with its unique DeGAT module, provides high-precision 3D reconstruction in complex surgical scenarios, giving doctors better visual guidance. Isn't that awesome?

Glossary

EndoVGGT

A framework for surgical 3D reconstruction that enhances depth estimation accuracy through the DeGAT module.

Used in the paper to improve 3D reconstruction accuracy in surgical scenes.

DeGAT

Deformation-aware Graph Attention module for dynamically constructing feature-space semantic graphs to capture long-range correlations.

Used in the EndoVGGT framework to enhance robust propagation of structural cues.

PSNR

Peak Signal-to-Noise Ratio, a metric for measuring image reconstruction quality; higher values indicate better quality.

Used to evaluate EndoVGGT's reconstruction fidelity on the SCARED dataset.

SSIM

Structural Similarity Index, a metric for measuring image structural consistency; higher values indicate better consistency.

Used to evaluate EndoVGGT's structural consistency on the SCARED dataset.

SCARED

A dataset providing realistic surgical data for validating 3D reconstruction methods.

Used in the paper to evaluate EndoVGGT's reconstruction performance.

EndoNeRF

A dataset for assessing reconstruction robustness, including scenarios with topological changes and tissue deformation.

Used to validate EndoVGGT's generalization capabilities in complex scenarios.

NeRF

Neural Radiance Fields, an implicit neural network method for scene representation.

Used as a baseline method for performance comparison with EndoVGGT.

3D Gaussian Splatting

An explicit method for real-time radiance field rendering using Gaussian distributions.

Used as a baseline method for performance comparison with EndoVGGT.

LPIPS

Learned Perceptual Image Patch Similarity, a metric for measuring perceptual image quality.

Used to evaluate EndoVGGT's reconstruction quality across different scenarios.

Zero-shot generalization

The ability of a model to maintain high performance on unseen datasets.

EndoVGGT demonstrates strong zero-shot cross-dataset generalization capabilities.

Open Questions Unanswered questions from this research

  • 1 How can EndoVGGT's computational complexity be further reduced while maintaining high accuracy to meet real-time application demands? The current method still has high computational complexity, potentially limiting its use in real-time applications.
  • 2 What causes performance degradation in extremely complex surgical scenes for EndoVGGT? Can further optimization of the DeGAT module address this issue?
  • 3 How can EndoVGGT be applied to other types of surgical scenarios? Are specific adjustments needed to adapt the model to different surgical environments?
  • 4 How does EndoVGGT perform in scenarios with rapid movements and significant occlusions? Can incorporating temporal consistency improve the model's robustness?
  • 5 Can EndoVGGT be integrated with other surgical navigation technologies in the future to provide more comprehensive surgical assistance? What technical breakthroughs are needed?

Applications

Immediate Applications

Surgical Navigation

EndoVGGT can be used in surgical navigation to provide precise 3D reconstruction information, aiding doctors in making critical decisions in complex surgical scenarios.

Robotic Assistance

In robotic-assisted surgery, EndoVGGT can provide high-precision geometric information, enhancing the perception and operational accuracy of surgical robots.

Surgical Training

EndoVGGT can be used in surgical training to provide realistic surgical scene reconstruction, helping doctors improve their surgical skills.

Long-term Vision

Real-time Surgical Monitoring

In the future, EndoVGGT could be used in real-time surgical monitoring, providing real-time 3D reconstruction information to help doctors better understand the surgical process.

Cross-domain Applications

EndoVGGT's dynamic feature-space modeling capability could be extended to other fields, such as industrial inspection and autonomous driving, providing high-precision 3D reconstruction solutions.

Abstract

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

cs.CV cs.AI

References (20)

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev et al.

2025 864 citations ⭐ Influential View Analysis →

Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery

Yuehao Wang, Yonghao Long, Siu Hin Fan et al.

2022 194 citations ⭐ Influential View Analysis →

FiLM: Visual Reasoning with a General Conditioning Layer

Ethan Perez, Florian Strub, H. D. Vries et al.

2017 3250 citations View Analysis →

EndoSLAM dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos

K. Ozyoruk, Guliz Irem Gokceler, Gulfize Coskun et al.

2021 242 citations

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter

2015 5992 citations View Analysis →

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, R. Birkl, Diana Wofk et al.

2023 843 citations View Analysis →

Scale-aware monocular reconstruction via robot kinematics and visual data in neural radiance fields

Ruofeng Wei, Jiaxin Guo, Yiang Lu et al.

2024 6 citations

Structure-from-Motion Revisited

Johannes L. Schönberger, Jan-Michael Frahm

2016 7012 citations

Robot-Based Procedure for 3D Reconstruction of Abdominal Organs Using the Iterative Closest Point and Pose Graph Algorithms

B. Göbel, Jonas Huurdeman, A. Reiterer et al.

2025 3 citations

EndoGaussian: Real-time Gaussian Splatting for Dynamic Endoscopic Scene Reconstruction

Yifan Liu, Chenxin Li, Chen Yang et al.

2024 28 citations View Analysis →

EndoSurf: Neural Surface Reconstruction of Deformable Tissues with Stereo Endoscope Videos

Ruyi Zha, Xuelian Cheng, Hongdong Li et al.

2023 91 citations View Analysis →

Confidence-aware self-supervised learning for dense monocular depth estimation in dynamic laparoscopic scene

Yasuhide Hirohata, Maina Sogabe, Tetsuro Miyazaki et al.

2023 5 citations

Surgical-DINO: adapter learning of foundation models for depth estimation in endoscopic surgery

Beilei Cui, Mobarak Islam Hoque, Long Bai et al.

2024 75 citations View Analysis →

Autonomous Intelligent Navigation for Flexible Endoscopy Using Monocular Depth Guidance and 3-D Shape Planning

Yiang Lu, Ruofeng Wei, Bin Li et al.

2023 16 citations View Analysis →

Vision Transformers for Dense Prediction

René Ranftl, Alexey Bochkovskiy, V. Koltun

2021 2543 citations View Analysis →

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Yao Yao, Zixin Luo, Shiwei Li et al.

2018 1549 citations View Analysis →

A Review of 3D Reconstruction Techniques for Deformable Tissues in Robotic Surgery

Mengya Xu, Ziqi Guo, An-Chi Wang et al.

2024 9 citations View Analysis →

Video-based surgical skill assessment using 3D convolutional neural networks

Isabel Funke, S. T. Mees, J. Weitz et al.

2019 224 citations View Analysis →

Surgical Navigation in the Anterior Skull Base Using 3-Dimensional Endoscopy and Surface Reconstruction.

Ryan A. Bartholomew, Haoyin Zhou, Maud Boreel et al.

2024 17 citations

Stereo Correspondence and Reconstruction of Endoscopic Data Challenge

M. Allan, J. Mcleod, Congcong Wang et al.

2021 199 citations View Analysis →