Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

TL;DR

Instruct-Particulate employs large-scale heterogeneous datasets and instruction-guided neural networks to efficiently predict 3D articulated structures, significantly improving generalization.

cs.CV 🔴 Advanced 2026-06-13 55 views

Ruining Li Yuxin Yao Matt Zhou Chuanxia Zheng Christian Rupprecht Joan Lasenby Shangzhe Wu Andrea Vedaldi

AI Reader Arxiv Page Download PDF

3D reconstruction articulation detection neural networks large-scale datasets instruction control

Key Findings

Methodology

The proposed Instruct-Particulate model adopts an encoder-decoder architecture that integrates multimodal inputs—point clouds, textual instructions, and point prompts—to predict the articulated structure of 3D meshes. The core involves a transformer-based multi-head attention mechanism that fuses shape features, part descriptions, and query points. During training, a large-scale heterogeneous dataset comprising over 150,000 articulated 3D models is utilized, generated through pseudo-labeling via vision-language models (VLMs). The model optimizes a multi-task loss that jointly predicts part segmentation and joint motion parameters, enabling it to handle diverse categories and varying annotation granularities. At inference, the model automatically extracts kinematic specifications from large vision-language models, allowing it to generalize to any input mesh, including AI-generated or real-world scanned models.

Key Results

On the Lightwheel dataset, Instruct-Particulate achieves a part match accuracy of 94.3%, outperforming baseline methods such as PartField and Particulate by over 20%. The geometric Intersection over Union (gIoU) reaches 0.583, representing a 15% improvement over previous state-of-the-art. The model accurately predicts joint axes with an average angular error (AE) below 10 degrees and position error (LE) within 2mm, demonstrating precise articulation estimation. Its ability to generalize across unseen categories and to AI-generated meshes is validated through extensive experiments, maintaining high accuracy in complex scenarios.
The incorporation of large-scale pseudo-labeled datasets, including synthetic and real models, significantly enhances the model's robustness. Ablation studies reveal that adding diverse data sources increases part match accuracy from 89.3% to 96.8%, and reduces joint parameter errors by approximately 30%. The model performs well even with minimal supervision, indicating strong zero-shot capabilities. It supports from static meshes to image-based reconstructions, enabling applications like automatic asset generation, robotic manipulation, and virtual avatar creation.
Practically, the model facilitates automatic reconstruction of articulated assets from images, supporting real-time applications in robotics, animation, and AR/VR. It enables end-to-end pipelines for converting 2D images into manipulable 3D models with accurate joint structures, reducing manual effort and increasing scalability. Its ability to predict complex joint configurations across categories opens new avenues for content creation, virtual prototyping, and intelligent scene understanding.

Significance

This research addresses a fundamental bottleneck in 3D understanding—limited annotated data for articulated structures—by leveraging large-scale heterogenous datasets and instruction-guided learning. The approach significantly enhances the generalization capacity of neural models, enabling accurate articulation prediction across diverse object categories, including unseen ones. Such advancements have profound implications for robotics, where understanding object kinematics is crucial for manipulation; for animation and gaming, where automatic asset creation accelerates workflows; and for AR/VR, where realistic virtual objects are essential. By reducing reliance on manual annotation and enabling zero-shot generalization, this work paves the way for scalable, intelligent 3D scene understanding and interaction.

Technical Contribution

The core technical innovation lies in integrating large-scale pseudo-labeled datasets with a transformer-based multi-modal architecture that encodes shape, part descriptions, and query points. The model employs a multi-task loss to jointly optimize part segmentation and joint parameter prediction, supported by a novel over-parameterized geometric fitting for joint axes. The data augmentation pipeline leverages vision-language models for automatic annotation, vastly expanding the training corpus. The architecture supports flexible conditioning via explicit kinematic instructions, enabling multi-category, multi-granularity predictions. This combination of data-driven pseudo-labeling, instruction-guided modeling, and geometric fitting constitutes a significant leap over prior methods limited by small datasets and rigid assumptions.

Novelty

This work is the first to systematically incorporate large-scale vision-language pseudo-labeling for 3D articulated structure prediction, enabling models to learn from heterogeneous, automatically annotated datasets. Unlike prior approaches that rely on manual labels or limited procedural generation, this method leverages off-the-shelf VLMs to generate diverse, category-agnostic annotations, significantly broadening the training scope. The explicit instruction mechanism allows the model to disambiguate multiple plausible structures, providing tailored predictions based on input prompts. The architecture's ability to handle multi-category, multi-granularity predictions in a single feed-forward pass marks a substantial advancement in 3D understanding, setting a new standard for scalable, generalizable articulated object reconstruction.

Limitations

The model's performance degrades in scenarios with severe occlusion or highly deformable objects, due to limitations in pseudo-label accuracy and geometric fitting under complex conditions.
Handling dynamic scenes or non-rigid objects remains challenging, requiring integration of temporal information and non-rigid modeling techniques in future work.
The reliance on vision-language models for pseudo-labeling introduces biases and errors, especially in categories with limited training data or ambiguous visual features, which can affect the overall accuracy.

Future Work

Future research will focus on extending the model to handle non-rigid and deformable objects, incorporating temporal sequences for dynamic scene understanding. Enhancing the robustness of pseudo-labeling through self-supervised refinement and active learning is also a priority. Additionally, efforts will be made to develop real-time inference systems suitable for robotics and interactive applications, as well as exploring unsupervised or weakly supervised learning paradigms to further reduce dependency on annotated data.

AI Executive Summary

Understanding the articulated structure of 3D objects is a cornerstone challenge in computer vision, robotics, and digital content creation. Traditional approaches relied heavily on manual annotations or multi-view optimization, which are labor-intensive and limited in scalability. Recent advances in neural networks, especially in 3D point cloud and mesh understanding, have made strides, but the scarcity of annotated datasets for complex articulated structures remains a significant bottleneck.

In response, the authors introduce Instruct-Particulate, a novel framework that leverages large-scale heterogeneous datasets and instruction-guided neural modeling to predict 3D object articulation with unprecedented accuracy and generalization. The core idea is to use vision-language models (VLMs) to automatically generate pseudo-labels for a vast array of synthetic and real-world models, capturing diverse categories and granularities of articulation. These labels include part segmentations, connectivity, joint types, and optional point prompts, which serve as rich supervision signals for training a transformer-based encoder-decoder model.

The architecture of Instruct-Particulate is designed to incorporate multimodal inputs—shape point clouds, textual part descriptions, and point prompts—processed through a series of attention blocks that fuse geometric and semantic information. The model predicts per-point part labels and joint motion parameters, including axes and ranges, in a single feed-forward pass. During training, a multi-task loss ensures the model learns to produce coherent, accurate articulated structures. The training dataset, comprising over 150,000 models, is assembled by combining synthetic data, existing annotated datasets, and AI-generated models, all labeled via the VLM pipeline.

Experimental results demonstrate that Instruct-Particulate outperforms existing methods such as PartField and Particulate across multiple metrics, including part match accuracy (over 94%), geometric IoU (0.583), and joint axis error (below 10 degrees). The model exhibits strong cross-category generalization, successfully predicting articulation for unseen object types and AI-generated meshes. These capabilities enable practical applications like automatic asset generation from images, robotic manipulation of complex objects, and virtual environment creation, significantly reducing manual annotation efforts.

This work marks a substantial advance in 3D understanding, addressing the data scarcity challenge through innovative pseudo-labeling and instruction-guided modeling. It opens new avenues for scalable, automated 3D asset reconstruction, with broad implications for industry and academia. Future directions include extending the framework to non-rigid and deformable objects, integrating temporal information, and developing real-time inference systems, promising a future where intelligent agents can autonomously understand and manipulate complex articulated objects in diverse environments.

Deep Dive

Abstract

Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

cs.CV cs.GR cs.RO

References (20)

GRUtopia: Dream General Robots in a City at Scale

Hanqing Wang, Jiahe Chen, Wensi Huang et al.

2024 64 citations ⭐ Influential View Analysis →

PartField: Learning 3D Feature Fields for Part Segmentation and Beyond

Minghua Liu, M. Uy, Donglai Xiang et al.

2025 70 citations ⭐ Influential View Analysis →

HY3D-Bench: Generation of 3D Assets

Bowen Zhang, Chunchao Guo, Dong Guo et al.

2026 8 citations ⭐ Influential View Analysis →

SAPIEN: A SimulAted Part-Based Interactive ENvironment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo et al.

2020 797 citations ⭐ Influential View Analysis →

PhysX-3D: Physical-Grounded 3D Asset Generation

Ziang Cao, Zhaoxi Chen, Liang Pan et al.

2025 31 citations ⭐ Influential View Analysis →

Particulate: Feed-Forward 3D Object Articulation

Ruining Li, Yuxin Yao, Chuanxia Zheng et al.

2025 9 citations ⭐ Influential View Analysis →

P3-SAM: Native 3D Part Segmentation

Changfeng Ma, Yang Li, Xinhao Yan et al.

2025 27 citations ⭐ Influential View Analysis →

PAct: Part-Decomposed Single-View Articulated Object Generation

Qingming Liu, Xinyue Yao, Shuyuan Zhang et al.

2026 4 citations View Analysis →

Anymate: A Dataset and Baselines for Learning 3D Object Rigging

Yufan Deng, Yuhao Zhang, Chen Geng et al.

2025 25 citations View Analysis →

URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

Z. Chen, Aaron Walsman, Marius Memmel et al.

2024 106 citations View Analysis →

REACTO: Reconstructing Articulated Objects from a Single Video

Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo et al.

2024 49 citations View Analysis →

ShapeNet: An Information-Rich 3D Model Repository

Angel X. Chang, T. Funkhouser, L. Guibas et al.

2015 6458 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 59338 citations View Analysis →

SAMPart3D: Segment Any Part in 3D Objects

Yu-nuo Yang, Yukun Huang, Yuan-Chen Guo et al.

2024 78 citations View Analysis →

FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion

Chuhao Chen, Isabella Liu, Xinyue Wei et al.

2025 19 citations View Analysis →

DreamArt: Generating Interactable Articulated Objects from a Single Image

Ruijie Lu, Yu Liu, Jiaxiang Tang et al.

2025 20 citations View Analysis →

URDF-Anything+: Autoregressive Articulated 3D Models Generation for Physical Simulation

Zhuangzhe Wu, Yuelin Xin, Chengkai Hou et al.

4 citations

Infinigen-Sim: Procedural Generation of Articulated Simulation Assets

Abhishek Joshi, Beining Han, Jack Nugent et al.

2025 5 citations

RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets

Isabella Liu, Zhan Xu, Wang Yifan et al.

2025 43 citations View Analysis →

WorldSimBench: Towards Video Generation Models as World Simulators

Yiran Qin, Zhelun Shi, Jiwen Yu et al.

2024 1115 citations View Analysis →

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence