Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control
Instruct-Particulate employs large-scale heterogeneous datasets and instruction-guided neural networks to efficiently predict 3D articulated structures, significantly improving generalization.
Key Findings
Methodology
The proposed Instruct-Particulate model adopts an encoder-decoder architecture that integrates multimodal inputs—point clouds, textual instructions, and point prompts—to predict the articulated structure of 3D meshes. The core involves a transformer-based multi-head attention mechanism that fuses shape features, part descriptions, and query points. During training, a large-scale heterogeneous dataset comprising over 150,000 articulated 3D models is utilized, generated through pseudo-labeling via vision-language models (VLMs). The model optimizes a multi-task loss that jointly predicts part segmentation and joint motion parameters, enabling it to handle diverse categories and varying annotation granularities. At inference, the model automatically extracts kinematic specifications from large vision-language models, allowing it to generalize to any input mesh, including AI-generated or real-world scanned models.
Key Results
- On the Lightwheel dataset, Instruct-Particulate achieves a part match accuracy of 94.3%, outperforming baseline methods such as PartField and Particulate by over 20%. The geometric Intersection over Union (gIoU) reaches 0.583, representing a 15% improvement over previous state-of-the-art. The model accurately predicts joint axes with an average angular error (AE) below 10 degrees and position error (LE) within 2mm, demonstrating precise articulation estimation. Its ability to generalize across unseen categories and to AI-generated meshes is validated through extensive experiments, maintaining high accuracy in complex scenarios.
- The incorporation of large-scale pseudo-labeled datasets, including synthetic and real models, significantly enhances the model's robustness. Ablation studies reveal that adding diverse data sources increases part match accuracy from 89.3% to 96.8%, and reduces joint parameter errors by approximately 30%. The model performs well even with minimal supervision, indicating strong zero-shot capabilities. It supports from static meshes to image-based reconstructions, enabling applications like automatic asset generation, robotic manipulation, and virtual avatar creation.
- Practically, the model facilitates automatic reconstruction of articulated assets from images, supporting real-time applications in robotics, animation, and AR/VR. It enables end-to-end pipelines for converting 2D images into manipulable 3D models with accurate joint structures, reducing manual effort and increasing scalability. Its ability to predict complex joint configurations across categories opens new avenues for content creation, virtual prototyping, and intelligent scene understanding.
Significance
This research addresses a fundamental bottleneck in 3D understanding—limited annotated data for articulated structures—by leveraging large-scale heterogenous datasets and instruction-guided learning. The approach significantly enhances the generalization capacity of neural models, enabling accurate articulation prediction across diverse object categories, including unseen ones. Such advancements have profound implications for robotics, where understanding object kinematics is crucial for manipulation; for animation and gaming, where automatic asset creation accelerates workflows; and for AR/VR, where realistic virtual objects are essential. By reducing reliance on manual annotation and enabling zero-shot generalization, this work paves the way for scalable, intelligent 3D scene understanding and interaction.
Technical Contribution
The core technical innovation lies in integrating large-scale pseudo-labeled datasets with a transformer-based multi-modal architecture that encodes shape, part descriptions, and query points. The model employs a multi-task loss to jointly optimize part segmentation and joint parameter prediction, supported by a novel over-parameterized geometric fitting for joint axes. The data augmentation pipeline leverages vision-language models for automatic annotation, vastly expanding the training corpus. The architecture supports flexible conditioning via explicit kinematic instructions, enabling multi-category, multi-granularity predictions. This combination of data-driven pseudo-labeling, instruction-guided modeling, and geometric fitting constitutes a significant leap over prior methods limited by small datasets and rigid assumptions.
Novelty
This work is the first to systematically incorporate large-scale vision-language pseudo-labeling for 3D articulated structure prediction, enabling models to learn from heterogeneous, automatically annotated datasets. Unlike prior approaches that rely on manual labels or limited procedural generation, this method leverages off-the-shelf VLMs to generate diverse, category-agnostic annotations, significantly broadening the training scope. The explicit instruction mechanism allows the model to disambiguate multiple plausible structures, providing tailored predictions based on input prompts. The architecture's ability to handle multi-category, multi-granularity predictions in a single feed-forward pass marks a substantial advancement in 3D understanding, setting a new standard for scalable, generalizable articulated object reconstruction.
Limitations
- The model's performance degrades in scenarios with severe occlusion or highly deformable objects, due to limitations in pseudo-label accuracy and geometric fitting under complex conditions.
- Handling dynamic scenes or non-rigid objects remains challenging, requiring integration of temporal information and non-rigid modeling techniques in future work.
- The reliance on vision-language models for pseudo-labeling introduces biases and errors, especially in categories with limited training data or ambiguous visual features, which can affect the overall accuracy.
Future Work
Future research will focus on extending the model to handle non-rigid and deformable objects, incorporating temporal sequences for dynamic scene understanding. Enhancing the robustness of pseudo-labeling through self-supervised refinement and active learning is also a priority. Additionally, efforts will be made to develop real-time inference systems suitable for robotics and interactive applications, as well as exploring unsupervised or weakly supervised learning paradigms to further reduce dependency on annotated data.
AI Executive Summary
Understanding the articulated structure of 3D objects is a cornerstone challenge in computer vision, robotics, and digital content creation. Traditional approaches relied heavily on manual annotations or multi-view optimization, which are labor-intensive and limited in scalability. Recent advances in neural networks, especially in 3D point cloud and mesh understanding, have made strides, but the scarcity of annotated datasets for complex articulated structures remains a significant bottleneck.
In response, the authors introduce Instruct-Particulate, a novel framework that leverages large-scale heterogeneous datasets and instruction-guided neural modeling to predict 3D object articulation with unprecedented accuracy and generalization. The core idea is to use vision-language models (VLMs) to automatically generate pseudo-labels for a vast array of synthetic and real-world models, capturing diverse categories and granularities of articulation. These labels include part segmentations, connectivity, joint types, and optional point prompts, which serve as rich supervision signals for training a transformer-based encoder-decoder model.
The architecture of Instruct-Particulate is designed to incorporate multimodal inputs—shape point clouds, textual part descriptions, and point prompts—processed through a series of attention blocks that fuse geometric and semantic information. The model predicts per-point part labels and joint motion parameters, including axes and ranges, in a single feed-forward pass. During training, a multi-task loss ensures the model learns to produce coherent, accurate articulated structures. The training dataset, comprising over 150,000 models, is assembled by combining synthetic data, existing annotated datasets, and AI-generated models, all labeled via the VLM pipeline.
Experimental results demonstrate that Instruct-Particulate outperforms existing methods such as PartField and Particulate across multiple metrics, including part match accuracy (over 94%), geometric IoU (0.583), and joint axis error (below 10 degrees). The model exhibits strong cross-category generalization, successfully predicting articulation for unseen object types and AI-generated meshes. These capabilities enable practical applications like automatic asset generation from images, robotic manipulation of complex objects, and virtual environment creation, significantly reducing manual annotation efforts.
This work marks a substantial advance in 3D understanding, addressing the data scarcity challenge through innovative pseudo-labeling and instruction-guided modeling. It opens new avenues for scalable, automated 3D asset reconstruction, with broad implications for industry and academia. Future directions include extending the framework to non-rigid and deformable objects, integrating temporal information, and developing real-time inference systems, promising a future where intelligent agents can autonomously understand and manipulate complex articulated objects in diverse environments.
Deep Dive
Abstract
Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.
References (20)
GRUtopia: Dream General Robots in a City at Scale
Hanqing Wang, Jiahe Chen, Wensi Huang et al.
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond
Minghua Liu, M. Uy, Donglai Xiang et al.
HY3D-Bench: Generation of 3D Assets
Bowen Zhang, Chunchao Guo, Dong Guo et al.
SAPIEN: A SimulAted Part-Based Interactive ENvironment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo et al.
PhysX-3D: Physical-Grounded 3D Asset Generation
Ziang Cao, Zhaoxi Chen, Liang Pan et al.
Particulate: Feed-Forward 3D Object Articulation
Ruining Li, Yuxin Yao, Chuanxia Zheng et al.
P3-SAM: Native 3D Part Segmentation
Changfeng Ma, Yang Li, Xinhao Yan et al.
PAct: Part-Decomposed Single-View Articulated Object Generation
Qingming Liu, Xinyue Yao, Shuyuan Zhang et al.
Anymate: A Dataset and Baselines for Learning 3D Object Rigging
Yufan Deng, Yuhao Zhang, Chen Geng et al.
URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images
Z. Chen, Aaron Walsman, Marius Memmel et al.
REACTO: Reconstructing Articulated Objects from a Single Video
Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo et al.
ShapeNet: An Information-Rich 3D Model Repository
Angel X. Chang, T. Funkhouser, L. Guibas et al.
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder et al.
SAMPart3D: Segment Any Part in 3D Objects
Yu-nuo Yang, Yukun Huang, Yuan-Chen Guo et al.
FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Chuhao Chen, Isabella Liu, Xinyue Wei et al.
DreamArt: Generating Interactable Articulated Objects from a Single Image
Ruijie Lu, Yu Liu, Jiaxiang Tang et al.
URDF-Anything+: Autoregressive Articulated 3D Models Generation for Physical Simulation
Zhuangzhe Wu, Yuelin Xin, Chengkai Hou et al.
Infinigen-Sim: Procedural Generation of Articulated Simulation Assets
Abhishek Joshi, Beining Han, Jack Nugent et al.
RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets
Isabella Liu, Zhan Xu, Wang Yifan et al.
WorldSimBench: Towards Video Generation Models as World Simulators
Yiran Qin, Zhelun Shi, Jiwen Yu et al.