EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction
EndoVGGT enhances surgical 3D reconstruction with DeGAT, improving PSNR by 24.6% and SSIM by 9.1%.
Key Findings
Methodology
EndoVGGT framework employs a Deformation-aware Graph Attention (DeGAT) module to dynamically construct feature-space semantic graphs, capturing long-range correlations among coherent tissue regions. Unlike static spatial neighborhoods, DeGAT enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on the SCARED dataset validate its efficacy in surgical scene 3D reconstruction.
Key Results
- On the SCARED dataset, EndoVGGT significantly improves PSNR by 24.6%, enhancing reconstruction fidelity. Additionally, SSIM increased by 9.1%, indicating substantial improvements in structural consistency.
- In terms of cross-dataset generalization, EndoVGGT exhibits strong zero-shot generalization to unseen SCARED and EndoNeRF domains, confirming that the DeGAT module learns domain-agnostic geometric priors.
- Ablation studies confirm that feature-level integration of the DeGAT module significantly enhances reconstruction performance, especially in handling non-rigid scenarios.
Significance
The EndoVGGT framework addresses the geometric continuity fragmentation issues faced by traditional fixed-topology methods when dealing with low-texture surfaces, specular highlights, and instrument occlusions. This method not only improves the accuracy of 3D reconstruction in surgical scenes but also demonstrates strong cross-dataset generalization capabilities. This research provides more accurate geometric information for surgical robotic perception, advancing surgical navigation and training.
Technical Contribution
EndoVGGT's technical contributions lie in its dynamic feature-space modeling capability, enabling strong zero-shot generalization without relying on scene optimization. The DeGAT module dynamically constructs semantic graphs to capture long-range correlations across occlusions, significantly enhancing non-rigid deformation recovery. Compared to existing NeRF and Gaussian Splatting methods, it offers a more efficient solution.
Novelty
EndoVGGT is the first to introduce the DeGAT module in surgical 3D reconstruction, achieving zero-shot cross-dataset generalization through dynamic feature-space modeling. Unlike traditional methods, it does not rely on scene optimization, maintaining high fidelity in complex non-rigid scenarios.
Limitations
- EndoVGGT may experience performance degradation in extremely complex surgical scenes, particularly with significant occlusions and rapid movements.
- The method still has high computational complexity, which may limit its use in real-time applications.
- Further optimization may be required in specific surgical scenarios to enhance accuracy.
Future Work
Future research directions include further optimizing EndoVGGT's computational efficiency to meet real-time application demands. Additionally, exploring its application to other types of surgical scenarios will validate its applicability in broader fields. Extensions towards temporal consistency and robotic navigation are also promising directions.
AI Executive Summary
In modern surgical practice, accurate three-dimensional reconstruction is crucial for surgical robotic perception. However, existing methods often face challenges in maintaining geometric continuity when dealing with low-texture surfaces, specular highlights, and instrument occlusions, limiting their application in surgical scenes.
To address these issues, this paper proposes the EndoVGGT framework, which introduces a Deformation-aware Graph Attention (DeGAT) module to dynamically construct feature-space semantic graphs, capturing long-range correlations among coherent tissue regions. Unlike traditional fixed-topology methods, the DeGAT module enables robust propagation of structural cues across occlusions, improving non-rigid deformation recovery.
Extensive experiments on the SCARED dataset demonstrate that EndoVGGT significantly outperforms existing methods in terms of reconstruction fidelity and structural consistency. Specifically, PSNR increased by 24.6%, and SSIM improved by 9.1%, validating its effectiveness in surgical scene 3D reconstruction.
Moreover, EndoVGGT exhibits strong zero-shot cross-dataset generalization capabilities, performing well in unseen SCARED and EndoNeRF domains. This indicates that the DeGAT module learns domain-agnostic geometric priors, providing more accurate geometric information for surgical robotic perception.
Despite its impressive performance in handling complex non-rigid scenarios, EndoVGGT's computational complexity remains high, potentially limiting its use in real-time applications. Future research directions include further optimizing computational efficiency and exploring its applicability to other types of surgical scenarios.
Deep Analysis
Background
Three-dimensional reconstruction techniques play a critical role in surgical navigation, robotic assistance, and skill assessment. Early geometric and deep learning pipelines laid the foundation for this field, but recent research trends have shifted towards implicit neural representations like NeRF and explicit 3D Gaussian Splatting. However, these methods face challenges in surgical scenes due to intrinsic non-rigidity, soft-tissue deformation, and dynamic instrument occlusion, limiting their large-scale application. To address these efficiency bottlenecks, recent developments have introduced scene-agnostic, feed-forward alternatives based on large reconstruction models.
Core Problem
Existing large reconstruction models are primarily trained on rigid, object-centric datasets, assuming static geometry and stable illumination. In contrast, surgical scenes feature intrinsic non-rigidity, soft-tissue deformation, and dynamic instrument occlusion. Direct deployment of general-domain models results in artifacts, including disrupted tissue topology and depth errors. Additionally, surgical approaches based on NeRF or Gaussian Splatting rely on per-scene optimization, requiring repeated fitting for each new case, limiting their large-scale generalization across diverse procedures.
Innovation
The EndoVGGT framework introduces the DeGAT module to dynamically construct feature-space semantic graphs, capturing long-range correlations among coherent tissue regions. Unlike traditional fixed-topology methods, the DeGAT module enables robust propagation of structural cues across occlusions, improving non-rigid deformation recovery. Extensive experiments on the SCARED dataset validate its efficacy in surgical scene 3D reconstruction.
Methodology
- �� DeGAT Module: Dynamically constructs feature-space semantic graphs, capturing long-range correlations among coherent tissue regions.
- �� Feature-Space Modeling: Enables robust propagation of structural cues across occlusions through dynamic graph construction in feature space.
- �� Experimental Validation: Extensive experiments on the SCARED dataset validate its efficacy in surgical scene 3D reconstruction.
Experiments
The experimental design includes extensive experiments on the SCARED dataset to validate EndoVGGT's efficacy in surgical scene 3D reconstruction. The results show that EndoVGGT significantly outperforms existing methods in terms of reconstruction fidelity and structural consistency. Specifically, PSNR increased by 24.6%, and SSIM improved by 9.1%. Additionally, EndoVGGT exhibits strong zero-shot cross-dataset generalization capabilities, performing well in unseen SCARED and EndoNeRF domains.
Results
The experimental results show that EndoVGGT significantly outperforms existing methods in terms of reconstruction fidelity and structural consistency. Specifically, PSNR increased by 24.6%, and SSIM improved by 9.1%. Additionally, EndoVGGT exhibits strong zero-shot cross-dataset generalization capabilities, performing well in unseen SCARED and EndoNeRF domains. This indicates that the DeGAT module learns domain-agnostic geometric priors, providing more accurate geometric information for surgical robotic perception.
Applications
The EndoVGGT framework can be directly applied to surgical navigation and robotic assistance, improving the accuracy of 3D reconstruction in surgical scenes. Its strong zero-shot cross-dataset generalization capabilities allow it to maintain high fidelity in unseen surgical scenarios, providing more accurate geometric information for surgical robotic perception.
Limitations & Outlook
Despite its impressive performance in handling complex non-rigid scenarios, EndoVGGT's computational complexity remains high, potentially limiting its use in real-time applications. Additionally, performance degradation may occur in extremely complex surgical scenes, particularly with significant occlusions and rapid movements. Future research directions include further optimizing computational efficiency and exploring its applicability to other types of surgical scenarios.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. You need to know the exact location and state of each ingredient to make a delicious dish. EndoVGGT is like a super-smart kitchen assistant that can observe every corner of the kitchen and accurately tell you the location and state of the ingredients. It not only sees what's in front of it but also predicts changes in the ingredients by analyzing past experiences. So even if some ingredients are covered by a lid, it can still accurately tell you where they are. This ability is crucial in surgery, where doctors need precise 3D information to make critical decisions. EndoVGGT, with its unique DeGAT module, provides high-precision 3D reconstruction in complex surgical scenarios, offering better visual guidance to doctors.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super cool 3D game, and you need to know the exact position of every character and object to win. EndoVGGT is like a game cheat that helps you see every detail in the game, even if some characters are hidden behind obstacles. This ability is super important in real-life surgeries because doctors need precise 3D information to make critical decisions. EndoVGGT, with its unique DeGAT module, provides high-precision 3D reconstruction in complex surgical scenarios, giving doctors better visual guidance. Isn't that awesome?
Glossary
EndoVGGT
A framework for surgical 3D reconstruction that enhances depth estimation accuracy through the DeGAT module.
Used in the paper to improve 3D reconstruction accuracy in surgical scenes.
DeGAT
Deformation-aware Graph Attention module for dynamically constructing feature-space semantic graphs to capture long-range correlations.
Used in the EndoVGGT framework to enhance robust propagation of structural cues.
PSNR
Peak Signal-to-Noise Ratio, a metric for measuring image reconstruction quality; higher values indicate better quality.
Used to evaluate EndoVGGT's reconstruction fidelity on the SCARED dataset.
SSIM
Structural Similarity Index, a metric for measuring image structural consistency; higher values indicate better consistency.
Used to evaluate EndoVGGT's structural consistency on the SCARED dataset.
SCARED
A dataset providing realistic surgical data for validating 3D reconstruction methods.
Used in the paper to evaluate EndoVGGT's reconstruction performance.
EndoNeRF
A dataset for assessing reconstruction robustness, including scenarios with topological changes and tissue deformation.
Used to validate EndoVGGT's generalization capabilities in complex scenarios.
NeRF
Neural Radiance Fields, an implicit neural network method for scene representation.
Used as a baseline method for performance comparison with EndoVGGT.
3D Gaussian Splatting
An explicit method for real-time radiance field rendering using Gaussian distributions.
Used as a baseline method for performance comparison with EndoVGGT.
LPIPS
Learned Perceptual Image Patch Similarity, a metric for measuring perceptual image quality.
Used to evaluate EndoVGGT's reconstruction quality across different scenarios.
Zero-shot generalization
The ability of a model to maintain high performance on unseen datasets.
EndoVGGT demonstrates strong zero-shot cross-dataset generalization capabilities.
Open Questions Unanswered questions from this research
- 1 How can EndoVGGT's computational complexity be further reduced while maintaining high accuracy to meet real-time application demands? The current method still has high computational complexity, potentially limiting its use in real-time applications.
- 2 What causes performance degradation in extremely complex surgical scenes for EndoVGGT? Can further optimization of the DeGAT module address this issue?
- 3 How can EndoVGGT be applied to other types of surgical scenarios? Are specific adjustments needed to adapt the model to different surgical environments?
- 4 How does EndoVGGT perform in scenarios with rapid movements and significant occlusions? Can incorporating temporal consistency improve the model's robustness?
- 5 Can EndoVGGT be integrated with other surgical navigation technologies in the future to provide more comprehensive surgical assistance? What technical breakthroughs are needed?
Applications
Immediate Applications
Surgical Navigation
EndoVGGT can be used in surgical navigation to provide precise 3D reconstruction information, aiding doctors in making critical decisions in complex surgical scenarios.
Robotic Assistance
In robotic-assisted surgery, EndoVGGT can provide high-precision geometric information, enhancing the perception and operational accuracy of surgical robots.
Surgical Training
EndoVGGT can be used in surgical training to provide realistic surgical scene reconstruction, helping doctors improve their surgical skills.
Long-term Vision
Real-time Surgical Monitoring
In the future, EndoVGGT could be used in real-time surgical monitoring, providing real-time 3D reconstruction information to help doctors better understand the surgical process.
Cross-domain Applications
EndoVGGT's dynamic feature-space modeling capability could be extended to other fields, such as industrial inspection and autonomous driving, providing high-precision 3D reconstruction solutions.
Abstract
Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.
References (20)
VGGT: Visual Geometry Grounded Transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev et al.
Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery
Yuehao Wang, Yonghao Long, Siu Hin Fan et al.
FiLM: Visual Reasoning with a General Conditioning Layer
Ethan Perez, Florian Strub, H. D. Vries et al.
EndoSLAM dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos
K. Ozyoruk, Guliz Irem Gokceler, Gulfize Coskun et al.
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, R. Birkl, Diana Wofk et al.
Scale-aware monocular reconstruction via robot kinematics and visual data in neural radiance fields
Ruofeng Wei, Jiaxin Guo, Yiang Lu et al.
Structure-from-Motion Revisited
Johannes L. Schönberger, Jan-Michael Frahm
Robot-Based Procedure for 3D Reconstruction of Abdominal Organs Using the Iterative Closest Point and Pose Graph Algorithms
B. Göbel, Jonas Huurdeman, A. Reiterer et al.
EndoGaussian: Real-time Gaussian Splatting for Dynamic Endoscopic Scene Reconstruction
Yifan Liu, Chenxin Li, Chen Yang et al.
EndoSurf: Neural Surface Reconstruction of Deformable Tissues with Stereo Endoscope Videos
Ruyi Zha, Xuelian Cheng, Hongdong Li et al.
Confidence-aware self-supervised learning for dense monocular depth estimation in dynamic laparoscopic scene
Yasuhide Hirohata, Maina Sogabe, Tetsuro Miyazaki et al.
Surgical-DINO: adapter learning of foundation models for depth estimation in endoscopic surgery
Beilei Cui, Mobarak Islam Hoque, Long Bai et al.
Autonomous Intelligent Navigation for Flexible Endoscopy Using Monocular Depth Guidance and 3-D Shape Planning
Yiang Lu, Ruofeng Wei, Bin Li et al.
Vision Transformers for Dense Prediction
René Ranftl, Alexey Bochkovskiy, V. Koltun
MVSNet: Depth Inference for Unstructured Multi-view Stereo
Yao Yao, Zixin Luo, Shiwei Li et al.
A Review of 3D Reconstruction Techniques for Deformable Tissues in Robotic Surgery
Mengya Xu, Ziqi Guo, An-Chi Wang et al.
Video-based surgical skill assessment using 3D convolutional neural networks
Isabel Funke, S. T. Mees, J. Weitz et al.
Surgical Navigation in the Anterior Skull Base Using 3-Dimensional Endoscopy and Surface Reconstruction.
Ryan A. Bartholomew, Haoyin Zhou, Maud Boreel et al.
Stereo Correspondence and Reconstruction of Endoscopic Data Challenge
M. Allan, J. Mcleod, Congcong Wang et al.