PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

TL;DR

PASR achieves 81.59% Top-1 retrieval accuracy on Pix3D and 76.43% on Pascal3D datasets.

cs.CV 🔴 Advanced 2026-04-24 44 views

Jiaxin Shi Guofeng Zhang Wufei Ma Naifu Liang Adam Kortylewski Alan Vuile

3D shape retrieval single view pose-aware occlusion handling multi-task learning

Key Findings

Methodology

PASR redefines 3D shape retrieval by distilling knowledge from a 2D foundation model, DINOv3, into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, PASR bridges the gap between real-world images and synthetic meshes. During inference, PASR performs test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the input image's feature map. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details.

Key Results

On the Pix3D dataset, PASR achieved an 81.59% Top-1 retrieval accuracy, showing an average relative improvement of 11.09% over previous best baseline methods.
On the Pascal3D dataset, PASR achieved a 76.43% Top-1 retrieval accuracy, with an average relative improvement of 7.15% over previous best baseline methods.
PASR demonstrates strong performance under occlusion scenarios, maintaining an accuracy of 63.05% under L3 occlusion conditions.

Significance

PASR is significant for both academia and industry as it addresses long-standing robustness and generalization issues in single-view 3D shape retrieval, particularly in handling partial occlusions and fine-grained geometric details. By aligning 2D and 3D features in 2D space and performing pose optimization during inference, PASR offers a novel approach to enhancing accuracy and robustness in 3D shape retrieval. This method holds substantial value in academic research and practical applications such as autonomous driving and robotic navigation.

Technical Contribution

PASR's technical contributions lie in redefining 3D shape retrieval as an analysis-by-synthesis problem, providing new theoretical guarantees and engineering possibilities compared to existing SOTA methods. By injecting knowledge from a 2D foundation model into a 3D encoder, PASR achieves strong generalization capabilities to novel mesh shapes. Additionally, the method excels in multi-task learning, performing robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.

Novelty

PASR's novelty lies in introducing pose-awareness into 3D shape retrieval for the first time, using analysis-by-synthesis for feature-level alignment. This approach provides greater robustness and generalization compared to prior work, especially in handling partial occlusions and fine-grained geometric details.

Limitations

PASR may experience performance degradation under extreme occlusion or complex backgrounds, as these situations can complicate feature alignment.
The method requires pose optimization during inference, which may increase computational overhead, particularly when handling large-scale datasets.
In some cases, higher-quality 3D model databases may be needed to achieve optimal performance.

Future Work

Future research directions include exploring ways to improve pose optimization efficiency without increasing computational overhead and enhancing robustness in more complex scenarios. Additionally, applying PASR to more practical applications, such as 3D object recognition in augmented and virtual reality, could be explored.

AI Executive Summary

Single-view 3D shape retrieval is a fundamental challenge in computer vision, becoming increasingly important with the growth of available 3D data. However, existing methods often fall short in handling partial occlusions and fine-grained geometric details, limiting their robustness and generalization to real-world applications. To address this, the paper proposes a novel framework called Pose-Aware 3D Shape Retrieval (PASR).

In experiments, PASR demonstrates significant performance improvements on the Pix3D and Pascal3D datasets. On Pix3D, PASR achieved an 81.59% Top-1 retrieval accuracy, showing an average relative improvement of 11.09% over previous best baseline methods. On Pascal3D, PASR achieved a 76.43% Top-1 retrieval accuracy, with an average relative improvement of 7.15% over previous best baseline methods.

However, PASR may experience performance degradation under extreme occlusion or complex backgrounds, as these situations can complicate feature alignment. The method requires pose optimization during inference, which may increase computational overhead. Future research directions include exploring ways to improve pose optimization efficiency without increasing computational overhead and enhancing robustness in more complex scenarios.

Deep Analysis

Background

As computer vision technology evolves, single-view 3D shape retrieval has become an important research direction. Traditional methods often rely on large-scale multimodal alignment, aligning 3D shape features with existing image-text embedding spaces. However, these methods have limited generalization capabilities when dealing with real-world images, as available 3D models are often not exact instance-level matches for 2D images. Additionally, existing methods typically learn holistic, global embeddings rather than explicit 3D geometry representations. This view-agnostic design is inherently vulnerable to partial occlusions and limits generalization to unseen 3D mesh models.

Core Problem

The core problem of single-view 3D shape retrieval is how to retrieve a corresponding 3D mesh given only a single RGB image. As large-scale 3D data becomes increasingly common, the need for effective retrieval methods becomes significant. Existing methods often perform poorly when handling partial occlusions and fine-grained geometric details, limiting their robustness and generalization to real-world applications.

Innovation

PASR's core innovation lies in redefining 3D shape retrieval as an analysis-by-synthesis problem. First, the method distills fine-grained knowledge from a 2D foundation model, DINOv3, into a 3D encoder, achieving strong generalization capabilities to novel mesh shapes. Second, PASR aligns pose-conditioned 3D projections with 2D feature maps, bridging the gap between real-world images and synthetic meshes. Finally, during inference, PASR performs test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the input image's feature map.

Methodology

�� During training, extract fine-grained knowledge from a 2D foundation model, DINOv3, and inject it into a 3D encoder.
�� Align pose-conditioned 3D projections with 2D feature maps, bridging the gap between real-world images and synthetic meshes.
�� During inference, perform test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the input image's feature map.
�� Use a differentiable renderer to project 3D features into the 2D feature space, aligning according to a given camera pose.

Experiments

Experiments were conducted on the Pix3D and Pascal3D datasets, using Top-1 retrieval accuracy as the primary evaluation metric. Baseline models included CMIC, SC-IBSR, OpenShape, and Uni3D. The experimental design considered different levels of occlusion (L0-L3) to evaluate the model's robustness in complex scenarios. Results showed that PASR outperformed existing methods under all occlusion conditions, particularly excelling under L3 occlusion.

Results

On the Pix3D dataset, PASR achieved an 81.59% Top-1 retrieval accuracy, showing an average relative improvement of 11.09% over previous best baseline methods. On the Pascal3D dataset, PASR achieved a 76.43% Top-1 retrieval accuracy, with an average relative improvement of 7.15% over previous best baseline methods. PASR demonstrates strong performance under occlusion scenarios, maintaining an accuracy of 63.05% under L3 occlusion conditions.

Applications

PASR has broad application potential in fields such as autonomous driving, robotic navigation, augmented reality, and virtual reality. In these applications, accurate 3D shape retrieval and pose estimation are crucial for efficient environmental perception and interaction. PASR offers a novel approach to enhancing accuracy and robustness in 3D shape retrieval, particularly in handling partial occlusions and fine-grained geometric details.

Limitations & Outlook

PASR may experience performance degradation under extreme occlusion or complex backgrounds, as these situations can complicate feature alignment. The method requires pose optimization during inference, which may increase computational overhead, particularly when handling large-scale datasets. In some cases, higher-quality 3D model databases may be needed to achieve optimal performance.

Plain Language Accessible to non-experts

Imagine you're in a kitchen trying to cook a dish. You have a recipe (2D image), but you need to know how to make the entire dish (3D shape). Existing methods are like guessing the final dish directly from the recipe, which can go wrong, especially when ingredients (image details) are missing. PASR is like a smart chef who first extracts key steps from the recipe (2D features) and then tries them out in the kitchen (3D space) until the dish matches the recipe description. Even if some ingredients are missing, it can infer the most reasonable way by analyzing other steps. This way, whether it's a simple dish or a complex banquet, it can cook it well.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you only have a picture, but you need to find the 3D model that matches it. It's like in Minecraft, where you have a picture of a block, but you want to know what the whole building looks like. Existing methods are like guessing the building directly from the picture, which can be wrong, especially when part of the picture is missing. PASR is like a super smart player who first extracts key information from the picture and then tries it out in the game until finding the building that matches the picture description. Even if some parts are missing, it can infer the most reasonable shape by analyzing other parts. This way, whether it's a simple building or a complex castle, it can find the closest model!

Glossary

3D Shape Retrieval

The process of retrieving a corresponding 3D shape from a given 2D image.

In this paper, 3D shape retrieval is the core task, with PASR improving retrieval accuracy through pose-awareness.

Single View

Analyzing and processing using only one viewpoint of an image.

PASR performs 3D shape retrieval using single-view images, overcoming occlusion and detail loss.

Pose-Aware

Considering the spatial pose information of objects to improve analysis accuracy.

PASR uses pose-awareness for 3D shape retrieval, enhancing sensitivity to fine-grained geometric details.

Occlusion

When parts of an image are obscured by other objects, leading to information loss.

PASR excels in handling partial occlusions, accurately retrieving 3D shapes.

Multi-task Learning

Learning multiple related tasks simultaneously to improve overall performance.

PASR excels in multi-task learning, performing 3D shape retrieval, pose estimation, and category classification.

DINOv3

A 2D foundation model used to extract fine-grained image features.

PASR extracts knowledge from DINOv3 and injects it into the 3D encoder.

Analysis-by-Synthesis

Analyzing through synthesis to improve understanding of complex scenes.

PASR uses analysis-by-synthesis for test-time optimization, enhancing robustness to occlusion and detail.

Differentiable Renderer

A renderer capable of computing gradients during rendering.

PASR uses a differentiable renderer to project 3D features into 2D feature space.

Pix3D

A benchmark dataset for 3D shape retrieval, containing multiple categories of 3D models.

PASR was tested on the Pix3D dataset to validate its performance.

Pascal3D

A benchmark dataset for 3D shape retrieval, containing multiple categories of 3D models.

PASR was tested on the Pascal3D dataset to validate its performance.

Open Questions Unanswered questions from this research

1 How to improve feature alignment accuracy under extreme occlusion or complex backgrounds? Existing methods may experience performance degradation in these situations, requiring further research to enhance robustness.
2 How to improve pose optimization efficiency without increasing computational overhead? Existing methods require pose optimization during inference, which may lead to increased computational costs.
3 How to enhance robustness in more complex scenarios? Existing methods perform well in simple scenarios but may experience performance degradation in complex ones.
4 How to apply PASR to more practical applications, such as 3D object recognition in augmented and virtual reality?
5 How to achieve optimal performance on large-scale datasets? Existing methods may require higher-quality 3D model databases to achieve optimal performance.

Applications

Immediate Applications

Autonomous Driving

In autonomous driving, PASR can be used for real-time 3D shape retrieval and pose estimation, improving the accuracy and robustness of environmental perception.

Robotic Navigation

In robotic navigation, PASR can be used to identify and locate 3D objects, enhancing navigation precision and safety.

Augmented Reality

In augmented reality, PASR can be used to recognize and track 3D objects, enhancing the immersion and interactivity of user experiences.

Long-term Vision

Virtual Reality

In virtual reality, PASR can be used to create and manipulate 3D objects in virtual environments, providing more realistic and immersive experiences.

Smart Cities

In smart cities, PASR can be used for real-time monitoring and management of urban infrastructure, improving the efficiency and safety of city management.

Abstract

Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.

cs.CV

References (20)

Uni3D: Exploring Unified 3D Representation at Scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma et al.

2023 203 citations ⭐ Influential View Analysis →

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

Minghua Liu, Ruoxi Shi, Kaiming Kuang et al.

2023 206 citations ⭐ Influential View Analysis →

DINOv3

Oriane Sim'eoni, Huy V. Vo, Maximilian Seitzer et al.

2025 626 citations View Analysis →

Towards Large-Scale 3D Representation Learning with Multi-Dataset Point Prompt Training

Xiaoyang Wu, Zhuotao Tian, Xin Wen et al.

2023 89 citations View Analysis →

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

Feng Yan, Fanfan Liu, Liming Zheng et al.

2024 13 citations View Analysis →

Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural Features

Wufei Ma, Angtian Wang, A. Yuille et al.

2022 32 citations View Analysis →

PointCLIP: Point Cloud Understanding by CLIP

Renrui Zhang, Ziyu Guo, Wei Zhang et al.

2021 618 citations View Analysis →

Location Field Descriptors: Single Image 3D Model Retrieval in the Wild

Alexander Grabner, P. Roth, V. Lepetit

2019 40 citations View Analysis →

Generalizing Single-View 3D Shape Retrieval to Occlusions and Unseen Objects

Qirui Wu, Daniel Ritchie, M. Savva et al.

2023 9 citations View Analysis →

A survey of content based 3D shape retrieval methods

J. Tangelder, R. Veltkamp

2004 1463 citations

Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning

Ming-Xian Lin, Jie Yang, He Wang et al.

2021 34 citations

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

Wufei Ma, Guanning Zeng, Guofeng Zhang et al.

2024 18 citations View Analysis →

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

Deepti Hegde, Jeya Maria Jose Valanarasu, Vishal M. Patel

2023 103 citations View Analysis →

Scaling 3D Compositional Models for Robust Classification and Pose Estimation

Xiaoding Yuan, ∗. GuofengZhang, ∗. PrakharKaushik et al.

2 citations

Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding

Jiaxin Shi, Mingyue Xiang, Hao Sun et al.

2025 2 citations

Templates for 3D Object Pose Estimation Revisited: Generalization to New Objects and Robustness to Occlusions

Van Nguyen Nguyen, Yinlin Hu, Yang Xiao et al.

2022 99 citations View Analysis →

A survey on deep geometry learning: From a representation perspective

Yun-Peng Xiao, Yu-Kun Lai, Fang-Lue Zhang et al.

2020 122 citations View Analysis →

OPEN: Occlusion-Invariant Perception Network for Single Image-Based 3D Shape Retrieval

Fupeng Chu, Yang Cong, Ronghan Chen

2024 4 citations

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Wufei Ma, Yu-Cheng Chou, Qihao Liu et al.

2025 38 citations View Analysis →

Splat-Nav: Safe Real-Time Robot Navigation in Gaussian Splatting Maps

Timothy Chen, O. Shorinwa, Joseph Bruno et al.

2024 90 citations View Analysis →

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

3D Shape Retrieval

Single View

Pose-Aware

Occlusion

Multi-task Learning

DINOv3

Analysis-by-Synthesis

Differentiable Renderer

Pix3D

Pascal3D

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Driving

Robotic Navigation

Augmented Reality

Long-term Vision

Virtual Reality

Smart Cities

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock

EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges