PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

TL;DR

PAR3D introduces part-aware 3D multimodal large language models, significantly enhancing fine-grained scene understanding via the ScenePart dataset.

cs.CV 🔴 Advanced 2026-06-05 80 views

Shaohui Dai Yansong Qu You Shen Shengchuan Zhang Liujuan Cao

AI Reader Arxiv Page Download PDF

3D Scene Understanding Multimodal Large Language Models Part-Awareness Scene Reasoning Visual-Language Interaction

Key Findings

Methodology

The PAR3D framework integrates a pretrained Point Transformer-based visual encoder, hierarchical query generation, and multi-task training. The core includes the ScenePart synthetic dataset, which provides object-part annotations within realistic indoor scenes. The training employs contrastive learning and representation-preserving regularization to enhance part discrimination. Hierarchical segmentation queries generate separate [OBJ] and [PART] tokens, enabling multi-granularity grounding. The training proceeds in two stages: first, part-aware pretraining on ScenePart and ScanNet; second, instruction tuning with multi-task datasets, including question answering and referring segmentation. The approach leverages specific algorithms such as InfoNCE loss for contrastive learning, self-distillation for semantic preservation, and a layered query mechanism for multi-level grounding.

Key Results

On the newly introduced ScenePart-Seg and ScenePart-QA datasets, PAR3D achieves 54.6% mIoU and 81.4% accuracy in fine-grained part segmentation and question answering, respectively, outperforming existing 3D multimodal models like 3D-LLaVA (43.3% mIoU, 92.6% accuracy). In object-level tasks, it reaches 49.9% and 53.4% mIoU on ScanRefer and Multi3DRefer, demonstrating strong generalization across hierarchies. Ablation studies confirm that hierarchical queries and contrastive learning contribute over 15% performance gains in fine-grained tasks. The model maintains robustness across complex scenes, validating its multi-task capability.
Results indicate a significant performance boost in detailed scene understanding, especially in recognizing and reasoning about functional parts within objects. The multi-task training and hierarchical query design effectively address granularity conflicts, enabling the model to understand scene semantics at multiple levels. The model's ability to transfer knowledge across tasks and scene types underscores its potential for real-world applications.
Furthermore, the model demonstrates excellent cross-task transferability, maintaining high performance across diverse scene understanding benchmarks, which highlights its scalability and robustness in practical scenarios.

Significance

This research advances the field of 3D scene understanding by explicitly modeling object parts, addressing the limitations of object-centric approaches. The integration of a synthetic dataset with detailed part annotations fills a critical data gap, enabling models to grasp the internal structure and functional components of objects in complex environments. Such fine-grained understanding is crucial for embodied AI, robotics, AR/VR, and digital twins, where precise manipulation and interaction depend on recognizing parts and their functions. The proposed hierarchical query mechanism and multi-task training strategy set new standards for scene comprehension, fostering more intelligent and interactive systems. The work also opens avenues for future research in dynamic scene understanding, real-time processing, and cross-modal reasoning, promising broader impacts in both academia and industry.

Technical Contribution

The key technical contributions include the creation of ScenePart, a synthetic dataset with detailed object-part annotations, enabling supervised training of part-aware models. The design of a part-aware 3D visual backbone, incorporating contrastive learning and representation-preserving regularization, enhances geometric and semantic feature extraction at the part level. The hierarchical segmentation query generation mechanism introduces a layered approach to target grounding, allowing the model to distinguish between object and part references effectively. The multi-task training pipeline, combining object-level and part-level supervision, improves the model's ability to perform diverse scene understanding tasks simultaneously. These innovations collectively push the boundary of 3D multimodal scene understanding, especially in fine-grained structural comprehension.

Novelty

This work is the first to systematically incorporate part-awareness into a unified 3D multimodal large language model. Unlike prior models that focus solely on object recognition, PAR3D models the internal structure of objects via hierarchical queries and part-level supervision. The introduction of ScenePart as a synthetic, richly annotated dataset further distinguishes this work, providing the necessary supervision for fine-grained understanding. The layered query generation and multi-task training strategies collectively enable the model to handle multiple granularities seamlessly, representing a significant leap forward in scene comprehension technology.

Limitations

Despite its strengths, PAR3D's performance may decline in highly complex or real-world scenes due to domain gap between synthetic training data and real environments. Generalization to dynamic scenes with moving parts remains limited, as the current dataset and model are primarily static.
The training process is computationally intensive, requiring substantial hardware resources, which may hinder widespread adoption. Real-time inference in large-scale scenes still poses challenges.
The current model's ability to handle unseen or novel parts is constrained, indicating a need for open-set or zero-shot capabilities in future iterations.

Future Work

Future research will focus on bridging the synthetic-real domain gap, possibly through domain adaptation techniques or real-scene fine-tuning. Incorporating temporal dynamics and motion understanding will enable the model to handle dynamic scenes. Efforts to reduce computational costs and improve inference speed are also critical. Additionally, extending the framework to handle open-set scenarios and unseen parts will broaden its applicability, paving the way for more autonomous and adaptable scene understanding systems.

AI Executive Summary

Understanding complex 3D scenes with fine-grained detail has long been a challenge in computer vision and robotics. Traditional models excel at recognizing objects but falter when it comes to internal structures and functional parts, which are crucial for embodied interaction, manipulation, and scene editing. Existing 3D multimodal large language models (3D-MLLMs) like 3D-LLaVA have made significant strides in object-level understanding, enabling tasks such as visual question answering and referring segmentation. However, their object-centric design limits their ability to model the intricate part structures within objects, which are vital for real-world applications such as robotic grasping, assembly, and scene customization.

Shaohui Dai and colleagues address this gap with PAR3D, a novel framework that integrates part-aware scene understanding into 3D-MLLMs. The cornerstone of their approach is ScenePart, a synthetic dataset meticulously crafted to include object-part annotations within realistic indoor scenes. This dataset provides dense masks, object-part correspondences, and language instructions, serving as a foundation for supervised training. The authors leverage a pretrained Point Transformer as the visual backbone, which captures geometric and semantic cues at a fine-grained level.

To enhance the model's ability to distinguish and ground parts, the authors introduce contrastive learning and representation-preserving regularization, which improve intra-part feature compactness and semantic consistency. A key innovation is the hierarchical segmentation query generation mechanism, which produces separate [OBJ] and [PART] tokens, enabling the model to handle multi-granularity grounding tasks seamlessly. The training pipeline involves a two-stage process: first, pretraining on ScenePart and ScanNet for part-aware perception; second, instruction tuning on diverse datasets for multi-task capabilities.

Extensive experiments demonstrate that PAR3D surpasses existing models in fine-grained scene understanding. On the new ScenePart datasets, it achieves 54.6% mIoU in part segmentation and 81.4% accuracy in part-aware question answering, outperforming prior methods by significant margins. The model also maintains strong performance on traditional object-level benchmarks, confirming its versatility. These results highlight the importance of explicit part modeling for advancing scene understanding, especially in applications requiring detailed interaction and manipulation.

The significance of this work lies in its potential to revolutionize embodied AI, robotics, and AR/VR by providing a more comprehensive understanding of scene structures. It bridges the gap between object recognition and functional part comprehension, enabling systems to interpret and manipulate environments with human-like precision. The creation of ScenePart also offers a valuable resource for future research, fostering further innovations in fine-grained 3D scene understanding.

Looking ahead, the authors plan to extend their framework to dynamic scenes, improve real-time inference, and explore open-set recognition for unseen parts. These developments will bring us closer to truly intelligent agents capable of understanding and interacting with the world at a human level, transforming industries and everyday life alike.

Deep Analysis

Background

随着3D感知技术的快速发展，场景理解逐渐成为计算机视觉的核心任务之一。早期工作如PointNet、PointNet++等主要解决点云的分类和分割问题，随后出现了基于深度学习的对象检测和语义分割方法。近年来，结合大规模预训练模型的出现，如Point Transformer、PVCNN等，极大提升了场景理解的能力。多模态融合方面，ScanRefer、ReferIt3D等模型实现了自然语言与3D场景的对齐，但多集中于对象级别，缺乏对场景中功能性部件的细粒度理解。与此同时，3D部件感知研究主要集中在单个对象的细粒度分割（如ShapeNetPart、PartNet），但在完整场景中的应用仍有限。近年来，基于大模型的多模态学习（如3D-LLaVA、Scene-LLM）推动了场景理解的边界，但仍未充分考虑对象内部的结构层次。综上，场景中功能性部件的理解仍是未来的重要研究方向，尤其是在多任务、多粒度场景理解中，亟需结合场景布局与部件标注，推动模型向更深层次的理解迈进。

Core Problem

现有的3D多模态大模型在场景理解中主要依赖对象级别的特征，忽略了场景中功能性部件的细粒度结构。这导致模型在执行诸如操控、交互、局部编辑等任务时，难以准确识别和定位目标部件，限制了其应用范围。具体问题包括：缺乏细粒度的部件标注数据、模型视觉编码器未能充分捕获部件几何与语义信息、以及问答和指代任务中对多粒度目标的统一建模机制不足。这些瓶颈阻碍了模型在复杂场景中的精细化理解能力，亟需引入部件感知机制，建立多层次的场景表示体系。

Innovation

本文的核心创新在于提出PAR3D框架，系统性引入场景中的部件感知能力。首先，构建ScenePart合成数据集，提供场景中对象及其部件的标注，弥补了真实场景数据的不足。其次，设计了基于对比学习和表示保持的正则化策略，增强模型对部件的区分能力和语义一致性。再次，提出层次化的查询生成机制，通过生成[OBJ]和[PART]标记，实现对象与部件的多层次语义对齐。这些创新共同推动模型在细粒度场景理解中的表现，突破了传统对象中心的限制，为多模态场景理解提供了新思路。

Methodology

�� 数据准备：利用ScenePart合成场景，结合3D-CoMPaT、3D-FRONT等资源，生成带有对象和部件标注的场景点云，提供丰富的语言任务指令。
�� 视觉编码：采用预训练点云Transformer（Point Transformer）作为基础编码器，提取场景的几何和语义特征。
�� 表示增强：引入对比学习（InfoNCE）损失，增强模型对同一部件内部特征的紧凑性，同时区分不同部件；同时采用表示保持正则化，确保模型在微调过程中不偏离预训练的语义结构。
�� 层次化查询：设计对象和部件的层次化查询生成机制，通过生成[OBJ]和[PART]标记，实现多粒度的目标指代和分割。
�� 多任务训练：在两个阶段中完成，第一阶段在ScenePart和ScanNet上进行部件感知预训练，第二阶段在多模态指令数据上进行调优，支持问答、指代等多任务。
�� 模型融合：结合大规模语言模型（如LLaVA-1.5-7B）与视觉编码器，通过LoRA微调实现多任务适应。

Experiments

模型在两个新提出的场景理解数据集ScenePart-Seg和ScenePart-QA上进行评估，前者衡量场景中对象与部件的指代分割性能，后者评估细粒度问答能力。还在ScanRefer、Multi3DRefer、ScanQA等传统对象任务数据集上进行对比。指标包括mIoU、[email protected]、问答准确率等。训练过程中采用256轮预训练和2轮指令调优，使用AdamW优化器，学习率分别为3×10^-4和2×10^-4。通过消融实验验证对比学习、表示保持和层次化查询机制的贡献。模型在细粒度任务中提升显著，验证了多任务、多粒度的能力。

Results

在ScenePart-Seg任务中，PAR3D达到54.6%的mIoU，优于传统对象模型的43.3%；在ScenePart-QA中，问答准确率达81.4%，高于3D-LLaVA的92.6%。在对象识别任务中，模型在ScanRefer和Multi3DRefer上分别获得49.9%和53.4%的mIoU，显示其在多层次场景理解中的优越性。消融分析表明，层次化查询机制和对比学习各自提升了模型的细粒度识别能力15%以上。模型在复杂场景中的表现稳定，验证了其多任务、多粒度的能力。

Applications

该模型可广泛应用于机器人操控、增强现实、虚拟导览等场景，支持智能体对场景中功能性部件的精准识别与操作。通过细粒度理解，提升交互的自然性和效率。未来可结合动态场景和时序信息，推动场景理解向实时、动态方向发展，为智能系统赋能。

Limitations & Outlook

当前模型在极端复杂或动态场景中的表现仍有限，主要由于合成数据与真实环境的差异。此外，训练成本较高，模型推理速度仍需优化。未来需增强模型的泛化能力，减少对大规模标注数据的依赖，并探索更高效的训练策略。

Plain Language Accessible to non-experts

想象你在一个大型工厂里工作，工厂里有很多不同的机器和零件。有些零件是用来装东西的，有些是用来控制机器的。以前的机器人只能认出这些机器，但不能理解每个零件的作用，也不知道它们是怎么组合在一起的。PAR3D就像给机器人装上了“聪明的眼睛”和“聪明的大脑”，让它不仅能看到机器，还能理解每个零件的功能和位置。

比如说，工厂里有一台咖啡机，机器人可以告诉你“这是咖啡机”，但PAR3D可以告诉你“这是咖啡机的把手”，还能理解“把手用来拿咖啡”。它通过学习很多虚拟的场景，知道每个零件的细节和作用，然后用语言告诉你或帮你找到这些零件。

这就像你在厨房里做饭，不仅知道锅和碗，还知道每个碗的盖子、把手、过滤器等细节。这样，机器人就能帮你找到需要的零件，甚至帮你修理或改装。这种能力让机器人变得更聪明、更懂场景，也能更好地帮助人类完成复杂任务。

ELI14 Explained like you're 14

想象你在玩一个超级复杂的拼图游戏，里面有很多不同的块。有些块是大块，比如整个房子，有些块是小块，比如门把手、窗户、灯泡。以前的机器人只能认出大块，告诉你“这是房子”。但PAR3D就像给机器人装上了超级眼睛和大脑，让它不仅知道房子，还能认出每个小块，比如“这是门把手”或“这是窗户的玻璃”。

它通过学习很多虚拟的房子场景，知道每个小块的样子和作用，然后用语言告诉你，比如“这个门把手可以用来开门”。这样，机器人就能帮你找到特定的零件，甚至帮你修理或改装房子。

就像你在学校里学会了认识各种零件的名字和功能，PAR3D让机器人也变得很聪明，能理解场景中的每个细节。未来，它可以帮你做很多事情，比如帮你整理房间、修理东西，甚至帮你设计新房子！

Abstract

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

cs.CV

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence