DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

TL;DR

DreamPartGen achieves semantically grounded part-level 3D generation via collaborative latent denoising, improving geometric fidelity by 53%.

cs.CV 🔴 Advanced 2026-03-20 44 views

Tianjiao Yu Xinzhuo Li Muntasir Wahed Jerry Xiong Yifan Shen Ying Shen Ismini Lourentzou

AI Reader Arxiv Page Download PDF

3D generation semantic grounding part-level denoising text-to-3D

Key Findings

Methodology

DreamPartGen introduces a collaborative latent denoising framework, employing Duplex Part Latents (DPLs) and Relational Semantic Latents (RSLs) for part-level 3D generation. DPLs jointly model the geometry and appearance of each part, while RSLs capture inter-part dependencies derived from language. A synchronized co-denoising process ensures mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis.

Key Results

Result 1: Across multiple benchmarks, DreamPartGen excels in geometric fidelity, reducing Chamfer Distance by 53% and improving text-shape alignment by 20%.
Result 2: On the PartRel3D dataset, DreamPartGen surpasses previous baselines in geometric precision (53% reduction in CD, 33% reduction in EMD) and text-shape alignment (20% improvement in CLIP/ULIP).
Result 3: In generalization tests for rare parts and unseen relation predicates, DreamPartGen outperforms prior baselines, improving Render-FID by 14.7-16.3%, CD by 68.2-71.2%, and ULIP-T by 39.6-47.9%.

Significance

The significance of DreamPartGen lies in addressing the oversight of semantic and functional structures in existing text-to-3D generation methods. By introducing a semantically grounded part-level generation framework, DreamPartGen not only enhances geometric fidelity and text alignment but also provides fine control capabilities for downstream applications such as fine-grained part editing, articulated object generation, and mini-scene synthesis. This research offers new perspectives and methods for the 3D generation field, potentially attracting widespread attention in academia and industry.

Technical Contribution

DreamPartGen's technical contributions include its collaborative latent denoising framework, which unifies geometric, visual, and relational reasoning through the introduction of Duplex Part Latents (DPLs) and Relational Semantic Latents (RSLs). Compared to existing methods, DreamPartGen achieves significant improvements in geometric fidelity and text alignment, offering new theoretical guarantees and engineering possibilities, such as large-scale supervised training and maintaining local part fidelity and global consistency in complex 3D structures.

Novelty

The novelty of DreamPartGen lies in its first introduction of semantically grounded part-level generation into text-to-3D generation. Unlike existing geometry-focused methods, DreamPartGen achieves geometric and semantic consistency through a collaborative denoising process, ensuring that the generated 3D objects are precise in local details and coherent in global structure.

Limitations

Limitation 1: DreamPartGen may encounter performance bottlenecks when handling very complex scenes, as the model's complexity and computational cost increase significantly.
Limitation 2: The method's reliance on language descriptions may lead to inconsistent generation results when dealing with ambiguous or unclear text inputs.
Limitation 3: There may still be issues with unstable generation or missing details in certain specific 3D shapes or structures.

Future Work

Future research directions include optimizing DreamPartGen's computational efficiency to handle larger-scale and more complex 3D scenes. Additionally, further exploration of how to improve the consistency and stability of generation results under more diverse language inputs is an important research topic. Researchers may also consider applying this framework to other fields, such as virtual reality and augmented reality, to explore its potential in practical applications.

AI Executive Summary

The generation of 3D objects has been a significant research topic in the field of computer vision. However, existing text-to-3D generation methods often overlook the semantic and functional structures of objects, leading to deficiencies in geometric fidelity and text alignment. The emergence of DreamPartGen offers a new solution to this problem.

DreamPartGen is a semantically grounded part-level 3D generation framework that achieves geometric and semantic consistency through collaborative latent denoising. The method introduces Duplex Part Latents (DPLs) and Relational Semantic Latents (RSLs), which respectively model the geometry and appearance of each part and capture inter-part semantic dependencies derived from language. Through a synchronized co-denoising process, DreamPartGen can generate coherent, interpretable, and text-aligned 3D objects.

In experiments, DreamPartGen demonstrates outstanding performance across multiple benchmarks, significantly improving geometric fidelity by reducing Chamfer Distance by 53% and enhancing text-shape alignment by 20%. Additionally, in generalization tests for rare parts and unseen relation predicates, DreamPartGen outperforms previous baselines, showcasing its robust capabilities in complex 3D structures.

The significance of DreamPartGen lies not only in enhancing the accuracy and consistency of 3D generation but also in providing fine control capabilities for downstream applications such as fine-grained part editing, articulated object generation, and mini-scene synthesis. This research offers new perspectives and methods for the 3D generation field, potentially attracting widespread attention in academia and industry.

However, DreamPartGen also has some limitations, such as potential performance bottlenecks when handling very complex scenes and reliance on language descriptions that may lead to inconsistent generation results when dealing with ambiguous or unclear text inputs. Future research directions include optimizing computational efficiency and improving the consistency and stability of generation results.

In summary, DreamPartGen brings new possibilities to the field of 3D generation, providing an effective solution to the shortcomings of existing methods with its semantically grounded part-level generation framework. Future research will continue to explore its potential in broader applications.

Deep Analysis

Background

3D object generation is a crucial research direction in computer vision and graphics, involving tasks that generate three-dimensional shapes from text descriptions. Traditional 3D generation methods mainly rely on geometric information, overlooking the semantic and functional structures of objects, leading to deficiencies in geometric fidelity and text alignment. In recent years, with the development of deep learning technology, neural network-based generation methods have gradually become mainstream, such as DreamFusion and ProlificDreamer. However, these methods typically focus only on generating whole objects without considering the relationships and semantic consistency between parts. To overcome these challenges, researchers have begun exploring part-level generation methods, introducing part decomposition and semantically grounded generation frameworks to improve the accuracy and consistency of generation. DreamPartGen was proposed in this context, achieving semantically grounded part-level 3D generation through collaborative latent denoising, providing new ideas for addressing the shortcomings of existing methods.

Core Problem

Existing text-to-3D generation methods often overlook the semantic and functional structures of objects when handling complex objects, leading to deficiencies in geometric fidelity and text alignment. Specifically, these methods typically focus only on generating whole objects without considering the relationships and semantic consistency between parts. Additionally, existing methods may produce unstable and inconsistent generation results when dealing with ambiguous or unclear text inputs. Achieving semantically consistent part-level generation while maintaining geometric fidelity is a significant challenge in current research.

Innovation

The core innovation of DreamPartGen lies in its collaborative latent denoising framework, which unifies geometric, visual, and relational reasoning through the introduction of Duplex Part Latents (DPLs) and Relational Semantic Latents (RSLs). Specifically, DPLs are used to jointly model the geometry and appearance of each part, while RSLs capture inter-part dependencies derived from language. Through a synchronized co-denoising process, DreamPartGen ensures geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Compared to existing methods, DreamPartGen achieves significant improvements in geometric fidelity and text alignment, offering new theoretical guarantees and engineering possibilities.

Methodology

The methodology of DreamPartGen can be divided into several key steps:

�� Introduction of Duplex Part Latents (DPLs): DPLs are used to jointly model the geometry and appearance of each part, capturing local geometric and visual details through 3D and 2D latent sequences.

�� Introduction of Relational Semantic Latents (RSLs): RSLs capture inter-part dependencies derived from language, providing control signals for part interactions through global relational and local semantic tokens.

�� Collaborative denoising process: Through a synchronized co-denoising process, DPLs and RSLs co-evolve under part-level and object-level synchronization, ensuring geometric and semantic consistency.

�� Use of the large-scale PartRel3D dataset: The PartRel3D dataset provides rich functional and spatial relational triplets for explicit language-based supervision of inter-part relations.

Experiments

In the experimental design, researchers used multiple benchmark datasets, including Objaverse, ShapeNet, ABO, and PartRel3D, to evaluate the performance of DreamPartGen. The baseline methods used in the experiments include Trellis, CLAY, HoloPart, and PartCrafter, which represent the latest advancements in the field of 3D generation. To assess the quality of the generated results, researchers adopted various metrics, including Chamfer Distance (CD), Earth Mover’s Distance (EMD), Render-FID, and Render-KID. Additionally, ablation studies were conducted to analyze the contribution of different components to the generation results.

Results

Experimental results show that DreamPartGen performs exceptionally well across multiple benchmarks, significantly improving geometric fidelity by reducing Chamfer Distance by 53% and enhancing text-shape alignment by 20%. On the PartRel3D dataset, DreamPartGen surpasses previous baselines in geometric precision (53% reduction in CD, 33% reduction in EMD) and text-shape alignment (20% improvement in CLIP/ULIP). Additionally, in generalization tests for rare parts and unseen relation predicates, DreamPartGen outperforms prior baselines, improving Render-FID by 14.7-16.3%, CD by 68.2-71.2%, and ULIP-T by 39.6-47.9%. These results demonstrate DreamPartGen's robust capabilities in complex 3D structures.

Applications

Application scenarios for DreamPartGen include fine-grained part editing, articulated object generation, and mini-scene synthesis. Through its semantically grounded part-level generation framework, DreamPartGen provides fine control capabilities for these applications. Additionally, DreamPartGen can be applied to fields such as virtual reality and augmented reality, offering new solutions for 3D generation tasks in these areas. Its potential impact in academia and industry could be extensive and far-reaching.

Limitations & Outlook

Despite the significant advancements achieved by DreamPartGen in the field of 3D generation, there are still some limitations. First, DreamPartGen may encounter performance bottlenecks when handling very complex scenes, as the model's complexity and computational cost increase significantly. Second, the method's reliance on language descriptions may lead to inconsistent generation results when dealing with ambiguous or unclear text inputs. Additionally, there may still be issues with unstable generation or missing details in certain specific 3D shapes or structures. Future research directions include optimizing computational efficiency and improving the consistency and stability of generation results.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. You have a recipe that lists all the ingredients and steps you need. Now, imagine you have a smart assistant that not only helps you prepare the ingredients but also automatically creates the dish based on your description. DreamPartGen is like this smart assistant, but instead of food, it generates three-dimensional objects.

In this process, DreamPartGen takes your description and breaks the object down into different parts, like the legs, seat, and backrest of a chair. Then, it ensures that each part matches your description and that the relationships between these parts are reasonable, just like ensuring the legs of a chair are under the seat.

What makes DreamPartGen special is that it not only focuses on the details of each part but also on how these parts come together to form a complete object. It's like making sure each ingredient is prepared correctly and that they ultimately combine into a delicious dish.

In this way, DreamPartGen can generate 3D objects that are both consistent with the description and structurally sound, bringing new possibilities to the field of 3D generation.

ELI14 Explained like you're 14

Hey there, friends! Today I want to tell you about something super cool called DreamPartGen. Imagine you can describe an object with words, and then that thing magically turns into a 3D model on your computer! Isn't that amazing?

DreamPartGen is like a wizard that can turn your words into little parts, like the legs, seat, and backrest of a chair. Then, it puts these parts together to make a complete chair. Plus, it makes sure the parts are in the right place, like making sure the legs are under the seat.

This technology is really awesome because it can create detailed parts and make sure the whole object looks real, just like what you'd see in a store. And it can make different objects based on different descriptions, like a chair with armrests or one without a backrest.

So, next time you imagine an object, DreamPartGen can help you bring it to life! Isn't that cool?

Glossary

Duplex Part Latents

Latent variables that jointly model the geometry and appearance of each part. They capture local geometric and visual details through 3D and 2D latent sequences.

Used in DreamPartGen for part-level 3D generation.

Relational Semantic Latents

Latent variables that capture inter-part dependencies derived from language. They provide control signals for part interactions through global relational and local semantic tokens.

Used in DreamPartGen to ensure geometric and semantic consistency.

Collaborative Denoising

A process that ensures geometric and semantic consistency through synchronized denoising, enabling coherent, interpretable, and text-aligned 3D synthesis.

Used in DreamPartGen for semantically grounded part-level generation.

Chamfer Distance

A metric used to measure the distance between two sets of points, commonly used to evaluate the geometric precision of 3D generation results.

Used in experiments to evaluate DreamPartGen's geometric fidelity.

Earth Mover’s Distance

A metric used to measure the distance between two probability distributions, commonly used to evaluate the geometric precision of generation results.

Used in experiments to evaluate DreamPartGen's geometric fidelity.

Render-FID

A metric used to evaluate the quality of generated images by comparing the feature distributions of generated and real images.

Used in experiments to evaluate DreamPartGen's visual fidelity.

Ablation Study

A study that evaluates the impact of removing or modifying certain components of a model on its overall performance.

Used in experiments to analyze the contribution of different components of DreamPartGen.

Text-to-3D Generation

A task that generates three-dimensional shapes from text descriptions, involving natural language processing and computer vision techniques.

The main research focus of DreamPartGen.

Part Decomposition

The process of breaking down complex objects into multiple parts for better modeling and generation.

Used in DreamPartGen for part-level generation.

Semantic Grounding

Providing semantic guidance to the generation process through language descriptions, ensuring consistency between the generated results and the descriptions.

Used in DreamPartGen for semantically consistent 3D generation.

Open Questions Unanswered questions from this research

1 Open question 1: How can DreamPartGen's performance in complex scenes be further improved without increasing computational costs? Existing methods may encounter performance bottlenecks when handling complex scenes, requiring more efficient computational strategies.
2 Open question 2: How can the consistency of generation results be improved when dealing with ambiguous or unclear text inputs? The reliance on language descriptions in existing methods may lead to inconsistent generation results, requiring more robust semantic parsing.
3 Open question 3: How can the stability of generation results be improved under more diverse language inputs? Existing methods may produce unstable generation results when handling diverse language inputs, requiring more powerful language models.
4 Open question 4: How can the geometric fidelity of generation results be improved without losing details? Existing methods may still have issues with missing details in certain specific 3D shapes or structures.
5 Open question 5: How can DreamPartGen be applied to other fields, such as virtual reality and augmented reality? Exploring its potential and challenges in practical applications is needed.

Applications

Immediate Applications

Fine-Grained Part Editing

Designers can use DreamPartGen to finely edit specific parts of 3D models, achieving higher design precision and flexibility.

Articulated Object Generation

DreamPartGen can be used to generate 3D objects with complex articulated structures, such as robots and mechanical arms, improving their design and manufacturing efficiency.

Mini-Scene Synthesis

With DreamPartGen, users can quickly generate small 3D scenes for game development and virtual reality applications.

Long-term Vision

3D Generation in Virtual Reality

DreamPartGen can be used for real-time 3D generation in virtual reality environments, providing users with a more immersive experience.

Object Recognition and Generation in Augmented Reality

By integrating DreamPartGen, augmented reality applications can achieve more accurate object recognition and generation, enhancing user interaction experiences.

Abstract

Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

cs.CV cs.AI cs.LG

References (20)

PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

Yuchen Lin, Chenguo Lin, Panwang Pan et al.

2025 40 citations ⭐ Influential View Analysis →

From One to More: Contextual Part Latents for 3D Generation

Shaocong Dong, Lihe Ding, Xiao Chen et al.

2025 12 citations ⭐ Influential View Analysis →

Magic3D: High-Resolution Text-to-3D Content Creation

Chen-Hsuan Lin, Jun Gao, Luming Tang et al.

2022 1495 citations ⭐ Influential View Analysis →

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu et al.

2024 505 citations ⭐ Influential View Analysis →

HoloPart: Generative 3D Part Amodal Segmentation

Yu-nuo Yang, Yuan-Chen Guo, Yukun Huang et al.

2025 38 citations ⭐ Influential View Analysis →

CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence

Tianjiao Yu, Xinzhuo Li, Yifan Shen et al.

2025 2 citations View Analysis →

3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models

Biao Zhang, Jiapeng Tang, M. Nießner et al.

2023 392 citations View Analysis →

SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation

Juil Koo, Seungwoo Yoo, Minh Hoai Nguyen et al.

2023 75 citations View Analysis →

AUTO-ENCODING VARIATIONAL BAYES

Romain Lopez, Pierre Boyeau, N. Yosef et al.

2020 22049 citations

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

Le Xue, Mingfei Gao, Chen Xing et al.

2022 329 citations View Analysis →

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu et al.

2025 3867 citations View Analysis →

DreamBooth3D: Subject-Driven Text-to-3D Generation

Amit Raj, S. Kaza, Ben Poole et al.

2023 275 citations View Analysis →

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Xizhou Zhu, Yuntao Chen, Hao Tian et al.

2023 316 citations View Analysis →

Text to 3D Scene Generation with Rich Lexical Grounding

Angel X. Chang, Will Monroe, M. Savva et al.

2015 116 citations View Analysis →

PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

Muntasir Wahed, Kiet A. Nguyen, Adheesh Juvekar et al.

2024 11 citations View Analysis →

DreamArt: Generating Interactable Articulated Objects from a Single Image

Ruijie Lu, Yu Liu, Jiaxiang Tang et al.

2025 15 citations View Analysis →

OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

Yu-nuo Yang, Yufan Zhou, Yuan-Chen Guo et al.

2025 30 citations View Analysis →

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Xinzhuo Li, Adheesh Juvekar, Xing Liu et al.

2025 1 citations View Analysis →

ShapeNet: An Information-Rich 3D Model Repository

Angel X. Chang, T. Funkhouser, L. Guibas et al.

2015 6259 citations View Analysis →

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye et al.

2023 930 citations View Analysis →

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Duplex Part Latents

Relational Semantic Latents

Collaborative Denoising

Chamfer Distance

Earth Mover’s Distance

Render-FID

Ablation Study

Text-to-3D Generation

Part Decomposition

Semantic Grounding

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Fine-Grained Part Editing

Articulated Object Generation

Mini-Scene Synthesis

Long-term Vision

3D Generation in Virtual Reality

Object Recognition and Generation in Augmented Reality

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock