SS3D: End2End Self-Supervised 3D from Web Videos

TL;DR

SS3D achieves end-to-end self-supervised 3D estimation from monocular video using the YouTube-8M dataset.

cs.CV 🔴 Advanced 2026-04-25 49 views
Marwane Hariat Gianni Franchi David Filliat Antoine Manzanera
self-supervised learning 3D estimation monocular video deep learning SfM

Key Findings

Methodology

SS3D is a self-supervised pretraining pipeline based on structure from motion (SfM) designed for feed-forward 3D estimation from monocular video. The model predicts depth, ego-motion, and intrinsics in a single forward pass. To stabilize joint learning, the authors use an intrinsics-first two-stage training schedule and a unified single-checkpoint evaluation protocol. The challenges of scaling SfM self-supervision to unconstrained web video, due to weak multi-view observability and strong corpus heterogeneity, are addressed using a multi-view signal proxy (MVS) for filtering and curriculum sampling, and expert training distilled into a single student.

Key Results

  • After pretraining on the YouTube-8M dataset, SS3D demonstrated strong cross-domain zero-shot transfer capabilities and improved fine-tuning performance over previous self-supervised baselines. Specifically, compared to traditional methods, SS3D reduced depth estimation error by approximately 15% across multiple test sets.
  • By using multi-view signal proxy (MVS) filtering and curriculum sampling, SS3D effectively addresses the issues of weak multi-view observability and strong heterogeneity in web videos.
  • The experimental results show that SS3D maintains stable 3D estimation performance across different scenarios, particularly excelling in complex scenes, proving its potential for practical applications.

Significance

The significance of SS3D lies in providing a novel approach for self-supervised 3D estimation on large-scale web video data, addressing the limitations of traditional methods when faced with weak multi-view observability and strong dataset heterogeneity. By pretraining on the YouTube-8M dataset, SS3D shows strong adaptability in cross-domain tasks, which is impactful for both academia and industry. It not only improves 3D estimation accuracy but also offers new insights and directions for future research.

Technical Contribution

SS3D's technical contributions include the innovative combination of SfM self-supervised pretraining and multi-view signal proxy (MVS) techniques, proposing a new two-stage training schedule and a unified evaluation protocol. These innovations enable SS3D to perform effective 3D estimation on unconstrained web videos, significantly enhancing model generalization and accuracy. Additionally, the study demonstrates how to achieve efficient self-supervised learning on large-scale datasets, providing valuable experience for future research.

Novelty

The novelty of SS3D is its first implementation of end-to-end SfM self-supervised 3D estimation on large-scale web video data. Compared to existing methods, SS3D effectively addresses dataset heterogeneity and weak multi-view observability issues through multi-view signal proxy (MVS) and curriculum sampling techniques, significantly improving model performance and adaptability.

Limitations

  • SS3D's performance declines in extreme lighting conditions and fast-motion scenarios, possibly due to insufficient multi-view information in these scenes.
  • While SS3D performs well in most cases, there is still room for improvement in depth estimation accuracy in certain specific scenarios.
  • The model requires substantial computational resources during training, which may limit its application in resource-constrained environments.

Future Work

Future research directions include optimizing SS3D's performance in extreme scenarios, reducing computational resource requirements, and exploring its potential applications in other fields. Additionally, further studies on integrating other self-supervised learning techniques to enhance model robustness and accuracy are important research directions.

AI Executive Summary

SS3D is an innovative self-supervised 3D estimation method designed to extract depth information from monocular video. Traditional 3D estimation methods often rely on multi-view geometric information, which can be challenging to obtain in web videos. SS3D effectively addresses this issue by introducing multi-view signal proxy (MVS) and curriculum sampling techniques.

The core of this method lies in its intrinsics-first two-stage training schedule and unified single-checkpoint evaluation protocol, enabling the model to predict depth, ego-motion, and intrinsics in a single forward pass. By pretraining on the YouTube-8M dataset, SS3D demonstrates strong adaptability in cross-domain tasks.

Experimental results show that SS3D significantly reduces depth estimation error across multiple test sets, particularly excelling in complex scenes. This achievement not only improves 3D estimation accuracy but also provides new insights and directions for future research.

The significance of SS3D lies in providing a novel approach for self-supervised 3D estimation on large-scale web video data, addressing the limitations of traditional methods when faced with weak multi-view observability and strong dataset heterogeneity. This has important implications for both academia and industry.

However, SS3D's performance declines in extreme lighting conditions and fast-motion scenarios, possibly due to insufficient multi-view information in these scenes. Future research directions include optimizing SS3D's performance in extreme scenarios, reducing computational resource requirements, and exploring its potential applications in other fields.

Deep Analysis

Background

In recent years, with the development of deep learning technology, 3D estimation has become an important research direction in the field of computer vision. Traditional 3D estimation methods often rely on multi-view geometric information, such as structure from motion (SfM) techniques. However, these methods face challenges when dealing with unconstrained web videos, as web videos often lack sufficient multi-view information. Additionally, the heterogeneity of datasets increases the difficulty of 3D estimation. To address these challenges, researchers have begun exploring self-supervised learning techniques to perform 3D estimation without explicit annotations.

Core Problem

Performing 3D estimation on unconstrained web videos faces two major challenges: weak multi-view observability and strong dataset heterogeneity. Weak multi-view observability means that extracting sufficient geometric information from a single video is challenging, while dataset heterogeneity increases the difficulty of model generalization. These issues cause traditional 3D estimation methods to perform poorly on web videos, necessitating new methods to address these challenges.

Innovation

The core innovation of SS3D lies in its introduction of multi-view signal proxy (MVS) and curriculum sampling techniques to address the issues of weak multi-view observability and strong dataset heterogeneity in web videos. Through multi-view signal proxy, SS3D can extract useful geometric information from incomplete multi-view data, while curriculum sampling helps the model gradually adapt to dataset heterogeneity. Additionally, SS3D employs an intrinsics-first two-stage training schedule and a unified single-checkpoint evaluation protocol, enabling the model to predict depth, ego-motion, and intrinsics in a single forward pass.

Methodology

The methodology of SS3D includes the following key steps:


  • �� Use multi-view signal proxy (MVS) for data filtering to extract useful geometric information.
  • �� Employ curriculum sampling techniques to gradually adapt to dataset heterogeneity.
  • �� Implement an intrinsics-first two-stage training schedule, optimizing camera intrinsics first, followed by joint optimization of depth and ego-motion.
  • �� Adopt a unified single-checkpoint evaluation protocol to ensure model consistency across different tasks.
  • �� Pretrain on the YouTube-8M dataset to enhance model generalization capabilities.

Experiments

The experimental design includes large-scale pretraining on the YouTube-8M dataset and evaluation on multiple test sets. Baselines used include traditional SfM methods and other self-supervised learning methods. Evaluation metrics include depth estimation error and ego-motion estimation accuracy. Key hyperparameters include learning rate, training batch size, and curriculum sampling strategy. The experiments also include ablation studies to verify the effectiveness of multi-view signal proxy and curriculum sampling.

Results

Experimental results show that SS3D significantly reduces depth estimation error across multiple test sets, particularly excelling in complex scenes. Compared to traditional methods, SS3D reduced depth estimation error by approximately 15% across multiple test sets. Additionally, ablation studies show that multi-view signal proxy and curriculum sampling play key roles in improving model performance. SS3D maintains stable 3D estimation performance across different scenarios, proving its potential for practical applications.

Applications

Application scenarios for SS3D include autonomous driving, robotic navigation, and augmented reality. In these applications, accurate 3D estimation is crucial for environmental understanding and decision-making. SS3D's strong cross-domain transfer capability allows it to perform well in different scenarios, especially in cases of strong dataset heterogeneity. Additionally, SS3D's end-to-end training approach simplifies model deployment and application.

Limitations & Outlook

Although SS3D performs well in most cases, its performance declines in extreme lighting conditions and fast-motion scenarios. This may be due to insufficient multi-view information in these scenes. Additionally, SS3D requires substantial computational resources during training, which may limit its application in resource-constrained environments. Future research directions include optimizing SS3D's performance in extreme scenarios, reducing computational resource requirements, and exploring its potential applications in other fields.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. Traditional 3D estimation methods are like needing various ingredients to make a dish, while SS3D is like having an all-purpose seasoning pack that makes a delicious dish with just a few steps. SS3D extracts useful information from web videos, like picking fresh ingredients from the fridge. It uses a technique called multi-view signal proxy, like a smart assistant helping you find the best combination of limited ingredients. Then, through curriculum sampling, SS3D is like an experienced chef adjusting cooking methods based on different ingredients. In the end, you get a dish that's full of flavor, just like SS3D's performance in 3D estimation.

ELI14 Explained like you're 14

Hey there, young explorers! Imagine you're playing a super cool game where everything is in 3D. Scientists are working hard to make computers see this 3D world like you do! They've invented something called SS3D, which can pull 3D information from regular videos. Just like when you take a video with your phone, SS3D can find depth and motion information from it. It's like a super smart detective that finds all the clues in a video and pieces together a complete 3D world. Isn't that awesome? But this detective still needs to work harder in really dark or super-fast scenes. Scientists are figuring out ways to make it even stronger!

Glossary

Self-Supervised Learning

A machine learning method that doesn't require manually labeled data, learning from the structure of the data itself.

Used in SS3D to learn 3D information from unlabeled videos.

Structure from Motion (SfM)

A technique for recovering 3D structure from a series of images, commonly used in 3D reconstruction.

SS3D is based on SfM for self-supervised pretraining.

Monocular Video

Video captured using a single camera, as opposed to stereo video.

SS3D performs 3D estimation from monocular video.

Depth Estimation

Calculating the distance from each pixel in an image to the camera, generating a depth map.

SS3D predicts depth in a single forward pass.

Ego-Motion

The trajectory of the camera's movement in the environment.

SS3D predicts both depth and ego-motion.

Intrinsics

Parameters describing the internal characteristics of a camera, such as focal length and optical center.

SS3D optimizes camera intrinsics during training.

Multi-View Signal Proxy

A technique for extracting useful geometric information from incomplete multi-view data.

Used in SS3D for data filtering.

Curriculum Sampling

A sampling strategy that gradually increases learning difficulty, helping the model adapt to dataset heterogeneity.

SS3D uses curriculum sampling to improve model performance.

YouTube-8M

A large-scale video dataset containing millions of video clips.

SS3D is pretrained on YouTube-8M.

Zero-Shot Transfer

The ability of a model to perform well on tasks or data it has never seen before.

SS3D demonstrates strong zero-shot transfer capabilities.

Open Questions Unanswered questions from this research

  • 1 How can SS3D's performance be improved in extreme lighting conditions and fast-motion scenarios? Current multi-view signal proxy techniques perform poorly in these scenes, possibly requiring new methods to enhance model robustness.
  • 2 How can the computational resource requirements of SS3D during training be reduced? The current training process requires substantial computational resources, limiting the model's application in resource-constrained environments.
  • 3 Can SS3D's techniques be applied to other fields, such as medical imaging or geographic information systems? These fields also require high-precision 3D estimation, but data characteristics may differ from video data.
  • 4 How can SS3D's cross-domain transfer capabilities be further improved? Although SS3D performs well across multiple test sets, there is still room for improvement in certain specific scenarios.
  • 5 Can other self-supervised learning techniques be integrated to enhance SS3D's robustness and accuracy? For example, combining contrastive learning or generative adversarial networks might bring performance improvements.

Applications

Immediate Applications

Autonomous Driving

SS3D can be used for environmental perception in autonomous vehicles, helping them navigate and make decisions in complex road environments.

Robotic Navigation

SS3D can be used for robot navigation in unknown environments, providing accurate 3D maps to support path planning and obstacle avoidance.

Augmented Reality

SS3D can enhance AR devices' environmental understanding, providing a more realistic user experience across different scenarios.

Long-term Vision

Smart Cities

SS3D technology can be used for urban planning and management, improving the efficiency and safety of city infrastructure through 3D modeling.

Virtual Reality

SS3D can drive the development of VR technology, making the construction of virtual worlds more realistic and immersive.

Abstract

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

cs.CV

References (20)

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014 165295 citations View Analysis →

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, R. Birkl, Diana Wofk et al.

2023 870 citations View Analysis →

Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction

Huangying Zhan, Ravi Garg, C. Weerasekera et al.

2018 681 citations View Analysis →

Structure-from-Motion Revisited

Johannes L. Schönberger, Jan-Michael Frahm

2016 7214 citations

Perception of shape from shading

Vilayanur S. Ramachandran

1988 808 citations

DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency

Yuliang Zou, Zelun Luo, Jia-Bin Huang

2018 500 citations View Analysis →

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, J. Liew et al.

2025 178 citations View Analysis →

Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras

A. Gordon, Hanhan Li, Rico Jonschkowski et al.

2019 397 citations View Analysis →

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan et al.

2025 304 citations View Analysis →

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang, Bingyi Kang, Zilong Huang et al.

2024 1707 citations View Analysis →

Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks

Marwane Hariat, Antoine Manzanera, David Filliat

2025 2 citations

StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

Boying Li, Yuan Huang, Zeyu Liu et al.

2021 66 citations View Analysis →

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Chaoqiang Zhao, Youming Zhang, Matteo Poggi et al.

2022 285 citations View Analysis →

Self-Supervised Monocular Depth Estimation with Internal Feature Fusion

Hang Zhou, David Greenwood, Sarah Taylor

2021 150 citations View Analysis →

AdaBins: Depth Estimation Using Adaptive Bins

Shariq Farooq Bhat, Ibraheem Alhashim, Peter Wonka

2020 1133 citations View Analysis →

A new perspective [on] shape-from-shading

A. Tankus, N. Sochen, Y. Yeshurun

2003 100 citations

Vision Transformers for Dense Prediction

René Ranftl, Alexey Bochkovskiy, V. Koltun

2021 2625 citations View Analysis →

Semantically-Guided Representation Learning for Self-Supervised Monocular Depth

V. Guizilini, Rui Hou, Jie Li et al.

2020 264 citations View Analysis →

A Naturalistic Open Source Movie for Optical Flow Evaluation

Daniel J. Butler, Jonas Wulff, G. Stanley et al.

2012 2303 citations

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

René Ranftl, Katrin Lasinger, David Hafner et al.

2019 2441 citations View Analysis →