M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

TL;DR

M^3 integrates multi-view foundation models with monocular Gaussian splatting SLAM, reducing ATE RMSE by 64.3%.

cs.CV 🔴 Advanced 2026-03-18 67 views
Kerui Ren Guanghao Li Changjian Jiang Yingxiang Xu Tao Lu Linning Xu Junting Dong Jiangmiao Pang Mulin Yu Bo Dai
SLAM multi-view models Gaussian splatting monocular video pose estimation

Key Findings

Methodology

The M^3 method enhances multi-view foundation models by introducing a matching head to improve pixel-level dense correspondences, integrating it into a robust monocular Gaussian splatting SLAM. This approach enhances tracking stability through dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments demonstrate state-of-the-art accuracy in pose estimation and scene reconstruction.

Key Results

  • On the ScanNet++ dataset, M^3 outperforms ARTDECO with a 2.11 dB improvement in PSNR, showcasing superior performance in scene reconstruction.
  • Compared to VGGT-SLAM 2.0, M^3 reduces ATE RMSE by 64.3%, significantly improving pose estimation accuracy.
  • Extensive experiments across diverse indoor and outdoor benchmarks demonstrate that M^3 maintains competitive efficiency on long-duration monocular video streams.

Significance

The M^3 method holds significant implications for both academia and industry. It addresses the long-standing challenge of high-precision pose estimation and efficient online refinement in dynamic environments using monocular video streams. By integrating multi-view foundation models with SLAM frameworks, M^3 offers an innovative solution that enables high-accuracy scene reconstruction and tracking in real-time applications.

Technical Contribution

The technical contributions of M^3 lie in its tight integration of multi-view foundation models with SLAM frameworks, achieved by introducing a dedicated matching head for pixel-level dense correspondences. This approach not only improves pose estimation accuracy but also enhances tracking stability through dynamic area suppression and cross-inference intrinsic alignment. Additionally, M^3 simultaneously updates geometry and tracking in a single feed-forward inference, significantly reducing redundant computations.

Novelty

M^3's novelty lies in its first-time integration of multi-view foundation models with monocular Gaussian splatting SLAM, achieving pixel-level dense correspondences through a matching head. Compared to existing SLAM methods, it provides higher accuracy and stability, especially in dynamic scenes.

Limitations

  • M^3 may experience performance degradation in extremely dynamic scenes, as dynamic area suppression may not fully eliminate the influence of all moving objects.
  • The method may be limited on devices with constrained computational resources, as multi-view processing and Gaussian splatting require high computational power.
  • In some complex outdoor scenes, lighting variations may affect the accuracy of the matching head.

Future Work

Future research directions include optimizing M^3's performance on resource-constrained devices and further enhancing its robustness in extremely dynamic scenes. Additionally, exploring the application of M^3 to other types of sensor data, such as LiDAR, could expand its applicability.

AI Executive Summary

Real-time reconstruction from monocular video streams has been a challenge in the field of computer vision, particularly in dynamic environments where high-precision pose estimation and efficient online refinement are required. Existing methods often rely on batch-oriented multi-view foundation models, which have limitations in real-time feedback and scalability in open environments.

The M^3 method proposed in this paper enhances multi-view foundation models by introducing a matching head to improve pixel-level dense correspondences, integrating it into a robust monocular Gaussian splatting SLAM. This approach enhances tracking stability through dynamic area suppression and cross-inference intrinsic alignment, significantly reducing redundant computations.

The core technical principles of M^3 include leveraging multi-view processing capabilities to simultaneously update geometry and tracking in a single feed-forward inference, and detecting and suppressing transient objects through a dynamic region identification module. These innovations enable M^3 to maintain competitive efficiency on long-duration monocular video streams.

Extensive experiments across diverse indoor and outdoor benchmarks demonstrate that M^3 achieves state-of-the-art accuracy in pose estimation and scene reconstruction. For example, on the ScanNet++ dataset, M^3 outperforms ARTDECO with a 2.11 dB improvement in PSNR, while reducing ATE RMSE by 64.3% compared to VGGT-SLAM 2.0.

The M^3 method holds significant implications for both academia and industry. It addresses the long-standing challenge of high-precision pose estimation and efficient online refinement in dynamic environments using monocular video streams, offering an innovative solution for real-time applications.

However, M^3 may experience performance degradation in extremely dynamic scenes and may be limited on devices with constrained computational resources. Future research directions include optimizing M^3's performance on resource-constrained devices and further enhancing its robustness in extremely dynamic scenes.

Deep Analysis

Background

3D scene reconstruction has become a fundamental capability in computer vision, enabling applications ranging from robotic perception to large-scale scene digitization. Recently, the field has been revolutionized by two paradigms: per-scene optimization, such as 3D Gaussian Splatting (3DGS), which delivers high-fidelity rendering, and feed-forward geometric foundation models, which infer dense priors in a single pass. However, most existing foundation models are inherently batch-oriented, designed to process a fixed set of images jointly. This offline nature precludes real-time feedback and limits scalability in open-ended environments, underscoring the urgent need for streaming reconstruction, where camera trajectories and scene geometry are incrementally updated as new observations arrive.

Core Problem

Existing efforts toward streaming 3D reconstruction generally follow two trajectories, yet both face significant hurdles. The first family attempts to adapt feed-forward models to a streaming context by incorporating memory mechanisms that summarize past observations to predict geometry incrementally. While these methods are efficient, they typically produce low-resolution results and struggle with cumulative drift, as they lack the iterative global refinement mechanisms found in classical SLAM. The second family instead integrates foundation-model priors into a SLAM pipeline to guide optimization. However, these approaches are often trapped in a fundamental trade-off: pairwise-prior methods, such as MASt3R-SLAM, suffer from redundant computation and quadratic complexity, whereas multi-frame prior methods like VGGT-SLAM 2.0 provide global geometry but lack the pixel-level dense correspondences necessary for rigorous geometric optimization.

Innovation

M^3 bridges the gap by tightly coupling a multi-view foundation model with a robust SLAM pipeline. • It enhances a state-of-the-art multi-view geometric foundation model by introducing a dedicated dense matching head, specifically trained to recover pixel-level correspondences. • This enables the SLAM framework to leverage the foundation model’s geometry for accurate, high-frequency pose refinement. • Unlike previous black-box integrations, M^3 performs a single feed-forward inference over both historical keyframes and incoming frames to simultaneously update geometry and tracking, significantly reducing redundant model invocations. • A dynamic region identification module is introduced to detect and suppress transient objects, ensuring stable static scene reconstruction in real-world environments.

Methodology

The M^3 method involves several key steps: • Enhancing a multi-view geometric foundation model with a dedicated dense matching head, trained to recover pixel-level correspondences. • Leveraging the foundation model’s geometry for accurate, high-frequency pose refinement within the SLAM framework. • Performing a single feed-forward inference over both historical keyframes and incoming frames to simultaneously update geometry and tracking, reducing redundant model invocations. • Introducing a dynamic region identification module to detect and suppress transient objects, ensuring stable static scene reconstruction in real-world environments. • Extensive experiments demonstrate state-of-the-art accuracy in both pose estimation and 3D reconstruction.

Experiments

The experimental design includes testing on diverse indoor and outdoor datasets such as ScanNet++, ScanNetV2, Waymo, and KITTI. Baselines for comparison include DROID-SLAM, MASt3R-SLAM, VGGT-SLAM, VGGT-SLAM 2.0, and ARTDECO. Evaluation metrics include Absolute Trajectory Error (ATE) RMSE, PSNR, SSIM, and LPIPS. Key hyperparameters include the matching search radius and keyframe insertion threshold. Ablation studies are conducted to evaluate the contribution of individual components.

Results

Experimental results show that M^3 achieves state-of-the-art accuracy in pose estimation and scene reconstruction. For example, on the ScanNet++ dataset, M^3 outperforms ARTDECO with a 2.11 dB improvement in PSNR, while reducing ATE RMSE by 64.3% compared to VGGT-SLAM 2.0. Ablation studies reveal that the matching head and dynamic region identification module are crucial for performance improvement. M^3 maintains competitive efficiency on long-duration monocular video streams.

Applications

The M^3 method can be directly applied to scenarios such as robotic navigation, augmented reality, and drone surveillance. These applications require high-precision pose estimation and scene reconstruction to enable real-time environmental perception and interaction. M^3's efficient computational performance makes it suitable for processing long-duration video streams, particularly in dynamic environments.

Limitations & Outlook

Despite its strong performance across various scenarios, M^3 may experience performance degradation in extremely dynamic scenes. Additionally, the method may be limited on devices with constrained computational resources, as multi-view processing and Gaussian splatting require high computational power. Future research directions include optimizing M^3's performance on resource-constrained devices and further enhancing its robustness in extremely dynamic scenes.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You have a recipe (multi-view foundation model) that tells you how to make a delicious dish step by step. But there's a problem: the recipe assumes you have all the ingredients (image data) ready and perfect. In real life, ingredients might change (dynamic environments), and you need to adjust in real-time. M^3 is like a smart assistant that not only helps you find the ingredients you need but also adjusts the recipe as you cook, ensuring every dish you make is perfect. It uses a special tool called a matching head to identify and adjust the position of each ingredient (pixel-level dense correspondences) and ensures your kitchen environment remains stable (dynamic area suppression). In the end, you can make delicious dishes efficiently and stably in any environment.

ELI14 Explained like you're 14

Hey there, buddy! Imagine you're playing a super cool game where you need to find your way out of a challenging maze. This maze keeps changing, just like our world. M^3 is like your game assistant, helping you find the best path and adjusting strategies as you move forward. It has a super powerful tool called a matching head that helps you recognize every important clue and marker (pixel-level dense correspondences). Plus, it identifies obstacles that might get in your way (dynamic area identification module), ensuring you reach the finish line smoothly. Isn't that awesome? So, no matter how complex the maze is, you can easily find the exit with M^3's help!

Glossary

SLAM (Simultaneous Localization and Mapping)

SLAM is a technique for simultaneous localization and mapping, widely used in robotic navigation and augmented reality.

In this paper, SLAM is used for real-time updates of camera trajectories and scene geometry.

ATE RMSE (Absolute Trajectory Error Root Mean Square)

ATE RMSE is a metric for evaluating trajectory estimation accuracy, representing the average error between estimated and true trajectories.

Used to evaluate M^3's performance in pose estimation.

PSNR (Peak Signal-to-Noise Ratio)

PSNR is a metric for evaluating image quality, indicating the difference between reconstructed and original images.

Used to evaluate M^3's performance in scene reconstruction.

Multi-view Foundation Model

A multi-view foundation model is a model for 3D reconstruction using image data from multiple viewpoints.

M^3 enhances multi-view foundation models for high-precision scene reconstruction.

Gaussian Splatting

Gaussian splatting is a technique for 3D scene reconstruction using Gaussian distributions to represent points in a scene.

Used in M^3 for efficient scene reconstruction.

Matching Head

A matching head is a module for identifying pixel-level correspondences in images.

Used in M^3 to enhance the accuracy of multi-view foundation models.

Dynamic Area Suppression

Dynamic area suppression is a technique for identifying and suppressing dynamic objects in a scene.

Used in M^3 to improve tracking stability.

Cross-inference Intrinsic Alignment

Used in M^3 to ensure geometric consistency.

Ablation Study

A method for evaluating the contribution of model components by removing or modifying them.

Used to evaluate the contribution of components in M^3.

ScanNet++

ScanNet++ is an indoor dataset for 3D scene reconstruction, containing various complex scenes.

Used to evaluate M^3's performance in scene reconstruction.

Open Questions Unanswered questions from this research

  • 1 How can M^3's robustness in extremely dynamic scenes be further improved? Existing dynamic area suppression may not fully eliminate the influence of all moving objects, requiring more advanced identification and suppression techniques.
  • 2 How can M^3's performance be optimized on devices with constrained computational resources? Multi-view processing and Gaussian splatting require high computational power, potentially necessitating the development of more efficient algorithms.
  • 3 How can M^3 be applied to other types of sensor data, such as LiDAR? This would require adaptive adjustments to M^3 to handle different types of data.
  • 4 How do lighting variations in complex outdoor scenes affect M^3's accuracy? Research is needed to maintain the accuracy of the matching head under lighting changes.
  • 5 How can M^3's computational redundancy be further reduced? Although M^3 simultaneously updates geometry and tracking in a single feed-forward inference, there may still be optimization space.

Applications

Immediate Applications

Robotic Navigation

M^3 can be used for real-time path planning and environmental perception in autonomous robots, helping them navigate efficiently in complex environments.

Augmented Reality

With high-precision scene reconstruction, M^3 can enhance AR devices' environmental interaction capabilities, improving user experience.

Drone Surveillance

M^3 can be used for real-time environmental monitoring by drones, helping to identify and track dynamic targets, improving surveillance efficiency.

Long-term Vision

Smart Cities

M^3 can be used for real-time environmental monitoring and management in smart cities, helping to optimize the allocation and use of city resources.

Autonomous Driving

With high-precision environmental perception, M^3 can provide safer and more efficient navigation solutions for autonomous vehicles.

Abstract

Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

cs.CV

References (20)

ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

Chandan Yeshwanth, Yueh-Cheng Liu, M. Nießner et al.

2023 564 citations ⭐ Influential View Analysis →

ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation

Guanghao Li, Kerui Ren, Linning Xu et al.

2025 4 citations ⭐ Influential View Analysis →

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, Luca Carlone

2025 65 citations ⭐ Influential View Analysis →

VGGT-SLAM 2.0: Real-time Dense Feed-forward Scene Reconstruction

Dominic Maggio, Luca Carlone

2026 3 citations ⭐ Influential View Analysis →

MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

Riku Murai, Eric Dexheimer, Andrew J. Davison

2024 149 citations ⭐ Influential View Analysis →

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

Zachary Teed, Jia Deng

2021 904 citations ⭐ Influential View Analysis →

Grounding Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, Jérôme Revaud

2024 647 citations ⭐ Influential View Analysis →

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu et al.

2025 67 citations ⭐ Influential View Analysis →

On-the-fly Reconstruction for Large-Scale Novel View Synthesis from Unposed Images

Andreas Meuleman, I. Shah, Alexandre Lanvin et al.

2025 19 citations ⭐ Influential View Analysis →

Structure-from-Motion Revisited

Johannes L. Schönberger, Jan-Michael Frahm

2016 6956 citations

PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Changjian Jiang, Kerui Ren, Xudong Li et al.

2026 1 citations View Analysis →

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Angela Dai, Angel X. Chang, M. Savva et al.

2017 5152 citations View Analysis →

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

2016 6460 citations View Analysis →

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Tao Lu, Mulin Yu, Linning Xu et al.

2023 668 citations View Analysis →

Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps

Chong Cheng, Sicheng Yu, Zijian Wang et al.

2025 12 citations View Analysis →

Optimal Transport Aggregation for Visual Place Recognition

Sergio Izquierdo, Javier Civera

2023 170 citations View Analysis →

2D Gaussian Splatting for Geometrically Accurate Radiance Fields

Binbin Huang, Zehao Yu, Anpei Chen et al.

2024 1076 citations View Analysis →

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Yuqi Wu, Wenzhao Zheng, Jie Zhou et al.

2025 40 citations View Analysis →

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Chi Yan, Delin Qu, Dong Wang et al.

2023 406 citations View Analysis →

VGGT-Long: Chunk it, Loop it, Align it - Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu et al.

2025 45 citations View Analysis →