SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

TL;DR

SldprtNet is a large-scale multimodal dataset with 242,000 industrial parts for semantic-driven CAD modeling.

cs.RO 🔴 Advanced 2026-03-13 2 views
Ruogu Li Sikai Li Yao Mu Mingyu Ding
CAD multimodal dataset 3D design deep learning

Key Findings

Methodology

SldprtNet provides 3D models in .sldprt and .step formats to support diverse training and testing needs. The study developed encoder and decoder tools supporting 13 CAD commands, enabling lossless transformation between 3D models and structured text representation. Each sample is paired with a composite image created from seven different viewpoints, combined with parameterized text output from the encoder, using the lightweight multimodal language model Qwen2.5-VL-7B to generate natural language descriptions.

Key Results

  • Result 1: In the fine-tuning of baseline models comparing image-plus-text inputs with text-only inputs, the model with image-plus-text inputs showed significant advantages in exact match score, achieving 0.0099 compared to 0.0058 for the text-only model.
  • Result 2: In command-level F1 score, the multimodal input model achieved 0.3670, compared to 0.3247 for the text-only model, indicating improved geometric semantic understanding.
  • Result 3: In partial match rate, the multimodal model scored 0.6162, surpassing the text-only model's 0.5554, further validating the effectiveness of multimodal supervision.

Significance

The SldprtNet dataset holds significant importance for both academia and industry. It addresses long-standing issues of data scarcity and multimodal alignment in CAD modeling tasks, providing a solid foundation for semantic-driven CAD modeling and cross-modal learning. By offering rich supervision signals and diverse model complexities, SldprtNet not only supports geometric deep learning applications but also bridges the gap between natural language and CAD modeling, advancing research in related fields.

Technical Contribution

The technical contributions of SldprtNet lie in its construction of a multimodal dataset that combines precise geometry (3D CAD models), rendered images (multi-view projections), structured modeling sequences (parametric CAD commands), and natural language descriptions. This dataset provides rich supervision signals for understanding and generating models, significantly enhancing the model's capabilities in semantic-driven CAD modeling tasks. Additionally, the developed encoder and decoder tools enable closed-loop transformation between models and instructions, supporting data augmentation and future application expansion.

Novelty

SldprtNet is the first study to apply a multimodal dataset to semantic-driven CAD modeling. Unlike existing 3D model datasets, SldprtNet not only provides geometric information but also includes multi-view images and natural language descriptions, filling the gap in CAD automation and parametric modeling. Compared to other datasets, SldprtNet has significant advantages in data scale and diversity.

Limitations

  • Limitation 1: Although SldprtNet has advantages in multimodal alignment and data scale, its generated natural language descriptions still require manual verification to ensure accuracy, which may impact large-scale automation applications.
  • Limitation 2: The dataset is primarily focused on industrial parts, which may limit its generalizability to other fields.
  • Limitation 3: Due to the complexity of the dataset, the training and inference process may require high computational resources.

Future Work

Future research directions include further optimizing the automatic generation of natural language descriptions to reduce the need for manual verification. Additionally, exploring the application of SldprtNet to CAD modeling tasks in other fields to expand its generalizability. Research can also focus on reducing the computational resource requirements to enable the dataset's use in a wider range of application scenarios.

AI Executive Summary

Computer-Aided Design (CAD) plays a critical role in mechanical design and manufacturing, offering significant advantages. However, existing CAD datasets are small in scale and cannot meet the needs of semantic-driven CAD modeling tasks. To address this issue, researchers have introduced SldprtNet, a large-scale multimodal dataset containing over 242,000 industrial parts, supporting semantic-driven CAD modeling and geometric deep learning applications.

The SldprtNet dataset provides 3D models in .sldprt and .step formats to support diverse training and testing needs. Researchers developed encoder and decoder tools supporting 13 CAD commands, enabling lossless transformation between 3D models and structured text representation. Each sample is paired with a composite image created from seven different viewpoints, combined with parameterized text output from the encoder, using the lightweight multimodal language model Qwen2.5-VL-7B to generate natural language descriptions.

In experiments, researchers compared the fine-tuning of baseline models with image-plus-text inputs against text-only inputs. Results showed that models with image-plus-text inputs exhibited significant advantages in exact match score, command-level F1 score, and partial match rate, indicating the importance of multimodal supervision in enhancing model performance.

The SldprtNet dataset holds significant importance for both academia and industry. It addresses long-standing issues of data scarcity and multimodal alignment in CAD modeling tasks, providing a solid foundation for semantic-driven CAD modeling and cross-modal learning. By offering rich supervision signals and diverse model complexities, SldprtNet not only supports geometric deep learning applications but also bridges the gap between natural language and CAD modeling, advancing research in related fields.

However, SldprtNet also has some limitations. Although it has advantages in multimodal alignment and data scale, its generated natural language descriptions still require manual verification to ensure accuracy, which may impact large-scale automation applications. Additionally, the dataset is primarily focused on industrial parts, which may limit its generalizability to other fields. Future research directions include further optimizing the automatic generation of natural language descriptions to reduce the need for manual verification and exploring the application of SldprtNet to CAD modeling tasks in other fields.

Deep Analysis

Background

Computer-Aided Design (CAD) plays a critical role in mechanical design and manufacturing, offering significant advantages over traditional paper-based drafting. Unlike manual drawings, CAD allows intuitive visualization of a part’s shape and dimensions and simplifies modifications. SolidWorks, a powerful CAD platform, has become a default choice for many mechanical designers. Its native .sldprt format records the feature operations and parameters used in model creation, enabling quick iteration and flexible editing of designs. This parametric, feature-based representation ensures higher precision and editability than discrete 3D formats like point clouds or meshes. However, compared to other categories of 3D model datasets, CAD datasets are unique in that each sample must be manually created using specialized software. The high skill and time required for quality CAD modeling result in datasets that are much smaller in scale than image or text datasets. Furthermore, limited data quantity and quality, annotation difficulties, and the lack of a standardized parametric representation format for 3D models have constrained progress in this area. As a result, despite the surge of interest in LLMs, research on semantic-driven CAD modeling tasks remains in its early stages.

Core Problem

Existing CAD datasets are small in scale and cannot meet the needs of semantic-driven CAD modeling tasks. CAD datasets are unique in that each sample must be manually created using specialized software, which requires high skill and time, resulting in datasets that are much smaller in scale than image or text datasets. Furthermore, limited data quantity and quality, annotation difficulties, and the lack of a standardized parametric representation format for 3D models have constrained progress in this area. These issues have made it difficult for research on semantic-driven CAD modeling tasks to progress, despite the surge of interest in large language models (LLMs).

Innovation

The core innovations of SldprtNet lie in its construction of a multimodal dataset that combines precise geometry (3D CAD models), rendered images (multi-view projections), structured modeling sequences (parametric CAD commands), and natural language descriptions. This dataset provides rich supervision signals for understanding and generating models, significantly enhancing the model's capabilities in semantic-driven CAD modeling tasks. Additionally, the developed encoder and decoder tools enable closed-loop transformation between models and instructions, supporting data augmentation and future application expansion. SldprtNet is the first study to apply a multimodal dataset to semantic-driven CAD modeling. Unlike existing 3D model datasets, SldprtNet not only provides geometric information but also includes multi-view images and natural language descriptions, filling the gap in CAD automation and parametric modeling.

Methodology

The construction and application methodology of the SldprtNet dataset includes the following key steps:


  • �� Dataset Construction: Collect over 242,000 industrial parts in .sldprt and .step formats to support diverse training and testing needs.

  • �� Encoder and Decoder Tools: Develop encoder and decoder tools supporting 13 CAD commands, enabling lossless transformation between 3D models and structured text representation.

  • �� Multimodal Input: Each sample is paired with a composite image created from seven different viewpoints, combined with parameterized text output from the encoder, using the lightweight multimodal language model Qwen2.5-VL-7B to generate natural language descriptions.

  • �� Experimental Design: Compare the fine-tuning of baseline models with image-plus-text inputs against text-only inputs to evaluate the effectiveness of multimodal supervision in enhancing model performance.

Experiments

To evaluate the effectiveness of the SldprtNet dataset in CAD generation tasks, researchers fine-tuned two baseline models on a 50K-sample subset of SldprtNet. The experimental design includes:


  • �� Dataset: Use a 50,000-sample subset of the SldprtNet dataset for experiments.

  • �� Baseline Models: Compare Qwen2.5-7B (trained with Encodertext only) and Qwen2.5-7B-VL (trained with both image and Encodertext).

  • �� Evaluation Metrics: Use metrics such as exact match score, command-level F1 score, parameter tolerance accuracy, and partial match rate to evaluate model performance.

  • �� Ablation Studies: Analyze the impact of multimodal input on model performance, validating the importance of multimodal supervision in enhancing model performance.

Results

The experimental results show that the Qwen2.5-7B-VL model with multimodal input outperforms the Qwen2.5-7B model with text-only input across several key metrics. Specifically:


  • �� Exact Match Score: The Qwen2.5-7B-VL model achieved a score of 0.0099, compared to 0.0058 for the Qwen2.5-7B model, showing significant advantages.

  • �� Command-Level F1 Score: The Qwen2.5-7B-VL model achieved 0.3670, compared to 0.3247 for the Qwen2.5-7B, indicating improved geometric semantic understanding.

  • �� Partial Match Rate: The multimodal model scored 0.6162, surpassing the text-only model's 0.5554, further validating the effectiveness of multimodal supervision.

  • �� Parameter Tolerance Accuracy: Although the text-only model slightly outperformed in parameter tolerance accuracy (0.5016 vs. 0.4630), this may suggest a tendency toward overfitting on numeric values.

Applications

The SldprtNet dataset has potential value in multiple application scenarios:


  • �� Semantic-Driven CAD Modeling: Generate CAD models through natural language descriptions, supporting automated design and rapid prototyping.

  • �� Cross-Modal Learning: Combine geometric information, images, and natural language descriptions to enhance cross-modal learning and reasoning capabilities.

  • �� Industrial Design Optimization: Support the design and optimization of complex industrial parts, improving design efficiency and accuracy.

  • �� Education and Training: As a teaching tool, help students and engineers learn and master CAD modeling techniques.

Limitations & Outlook

Despite its advantages in multimodal alignment and data scale, SldprtNet also has some limitations:


  • �� The generated natural language descriptions still require manual verification to ensure accuracy, which may impact large-scale automation applications.

  • �� The dataset is primarily focused on industrial parts, which may limit its generalizability to other fields.

  • �� Due to the complexity of the dataset, the training and inference process may require high computational resources.

Future research directions include further optimizing the automatic generation of natural language descriptions to reduce the need for manual verification and exploring the application of SldprtNet to CAD modeling tasks in other fields.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe that details each step and the ingredients needed. This is like the parametric commands in a CAD model, where each command is a step in creating the dish. SldprtNet is like a large library of recipes, containing detailed instructions and images for various dishes. With these recipes, you can learn how to make different dishes and adjust them to your taste.

In this process, the encoder acts like a translator, converting complex recipe steps into simple, understandable text instructions. The decoder is like a chef, following these instructions step-by-step to create a delicious dish. This way, you can not only learn how to make dishes but also innovate and improve based on your needs.

The unique aspect of SldprtNet is that it not only provides detailed recipe steps but also includes images and descriptions of each dish, serving as a visual guide you can refer to while cooking. By combining this information, you can better understand and master the cooking process, adjusting and optimizing based on your needs.

In summary, SldprtNet is like a large kitchen recipe library, providing detailed steps, images, and descriptions to help you better understand and master the CAD modeling process, allowing for innovation and improvement.

ELI14 Explained like you're 14

Hey there! Let's talk about something cool called SldprtNet. Imagine you're playing a huge LEGO game, where you have all sorts of LEGO models, each with detailed building instructions and pictures. These instructions are like the steps you use to build LEGO in the game, and the pictures are what you can refer to once you're done building.

SldprtNet is just like this massive LEGO library, with over 242,000 models of industrial parts, each with detailed building steps and pictures. With this information, you can learn how to build different models and innovate and improve based on your ideas.

In this process, there are two important tools: the encoder and the decoder. The encoder is like a translator, turning complex building steps into simple, understandable text instructions. The decoder is like a master builder, following these instructions step-by-step to build a complete model.

In summary, SldprtNet is like a giant LEGO library, providing detailed steps, pictures, and descriptions to help you better understand and master the building process, allowing for innovation and improvement. Isn't that cool?

Glossary

CAD (Computer-Aided Design)

Computer-Aided Design is a technology used for design and documentation with computer software, widely used in engineering, architecture, and manufacturing.

In this paper, CAD is used to create and edit 3D models of industrial parts.

SldprtNet

SldprtNet is a large-scale multimodal dataset containing over 242,000 industrial parts' 3D models for semantic-driven CAD modeling.

SldprtNet is the core dataset of this study, supporting multimodal learning and reasoning.

Multimodal

Multimodal refers to the ability to analyze and process multiple different types of data (such as text, images, and audio).

In this paper, multimodal refers to combining 3D models, images, and natural language descriptions for learning and reasoning.

Encoder

An encoder is a tool that converts input data into another format, used in this paper to convert CAD models into structured text representation.

The encoder is used to convert .sldprt files into parameterized text representation.

Decoder

A decoder is a tool that converts encoded data back into its original format, used in this paper to reconstruct CAD models from structured text.

The decoder is used to reconstruct 3D models from parameterized text.

Qwen2.5-VL-7B

Qwen2.5-VL-7B is a lightweight multimodal language model used to generate natural language descriptions.

Qwen2.5-VL-7B is used to generate descriptions of parts' appearance and functionality by combining images and text.

Parametric

Parametric refers to the use of parameters and variables to define and control a model's geometric shapes and features.

In this paper, parametric is used to describe the features and operations of CAD models.

Natural Language Description

Natural language description is a description of an object or process using human language, making it easy to understand and communicate.

In this paper, natural language description is used to describe the appearance and functionality of 3D models.

Feature Tree

A feature tree is a hierarchical structure used in CAD models to organize and manage features and operations.

The feature tree is used to record the features and parameters used in model creation.

Geometric Deep Learning

Geometric deep learning is a method that combines geometric information and deep learning techniques for analysis and processing.

In this paper, geometric deep learning is used to analyze and process 3D model data.

Open Questions Unanswered questions from this research

  • 1 Open Question 1: How to further optimize the automatic generation of natural language descriptions to reduce the need for manual verification? Current methods still face challenges in generating accuracy and consistency, requiring more advanced generation models and algorithms.
  • 2 Open Question 2: How to apply SldprtNet to CAD modeling tasks in other fields to expand its generalizability? The existing dataset is primarily focused on industrial parts, which may limit its application in other fields.
  • 3 Open Question 3: How to reduce the computational resource requirements during training and inference to enable the use of SldprtNet in a wider range of application scenarios? Current methods may require high computational resources, limiting their application in resource-constrained environments.
  • 4 Open Question 4: How to better combine geometric information, images, and natural language descriptions in multimodal learning to enhance cross-modal learning and reasoning capabilities? Current methods still have room for improvement in multimodal alignment and integration.
  • 5 Open Question 5: How to introduce more function-level semantic annotations in SldprtNet to support function-aware generation and abstraction? The existing dataset mainly focuses on geometric and structural information, lacking function-level semantic annotations.

Applications

Immediate Applications

Industrial Design Optimization

SldprtNet can be used to optimize the design of industrial parts, improving design efficiency and accuracy. Designers can quickly generate and adjust design schemes using the multimodal information provided by the dataset.

Education and Training

As a teaching tool, SldprtNet can help students and engineers learn and master CAD modeling techniques. With detailed steps and descriptions, users can better understand and apply CAD modeling.

Automated Design

By generating CAD models through natural language descriptions, SldprtNet supports automated design and rapid prototyping, shortening design cycles and improving production efficiency.

Long-term Vision

Cross-Field Applications

The multimodal nature of SldprtNet makes it potentially valuable for CAD modeling in fields such as architecture and aerospace. By expanding the dataset's generalizability, it can support design and optimization in more fields.

Intelligent Design Systems

Combining SldprtNet's data and models, intelligent design systems can be developed to achieve full-process automation and optimization from design to production, improving overall production efficiency and quality.

Abstract

We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part's appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

cs.RO cs.CV

References (20)

SketchGen: Generating Constrained CAD Sketches

W. Para, Shariq Farooq Bhat, Paul Guerrero et al.

2021 95 citations View Analysis →

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao et al.

2021 30308 citations View Analysis →

A Learning Algorithm for Continually Running Fully Recurrent Neural Networks

Ronald J. Williams, D. Zipser

1989 4947 citations

Hierarchical Neural Coding for Controllable CAD Model Generation

Xiang Xu, P. Jayaraman, J. Lambourne et al.

2023 76 citations View Analysis →

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch et al.

2023 3189 citations View Analysis →

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Angela Dai, Angel X. Chang, M. Savva et al.

2017 5137 citations View Analysis →

CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation

Anna C. Doris, Md Ferdous Alam, A. Nobari et al.

2025 16 citations View Analysis →

Thingi10K: A Dataset of 10, 000 3D-Printing Models

Qingnan Zhou, Alec Jacobson

2016 480 citations View Analysis →

Text2CAD: Generating Sequential CAD Models from Beginner-to-Expert Level Text Prompts

Mohammad Sadil Khan, Sankalp Sinha, T. Sheikh et al.

2024 18 citations View Analysis →

ABC: A Big CAD Model Dataset for Geometric Deep Learning

Sebastian Koch, A. Matveev, Zhongshi Jiang et al.

2018 602 citations View Analysis →

Magic3D: High-Resolution Text-to-3D Content Creation

Chen-Hsuan Lin, Jun Gao, Luming Tang et al.

2022 1488 citations View Analysis →

SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation

L. Yi, Hao Su, Xingwen Guo et al.

2016 487 citations View Analysis →

ShapeNet: An Information-Rich 3D Model Repository

Angel X. Chang, T. Funkhouser, L. Guibas et al.

2015 6246 citations View Analysis →

Construction and optimization of CSG representations

V. Shapiro, D. Vossler

1991 104 citations

Neurosymbolic Models for Computer Graphics

Daniel Ritchie, Paul Guerrero, R. K. Jones et al.

2023 40 citations View Analysis →

SketchGraphs: A Large-Scale Dataset for Modeling Relational Geometry in Computer-Aided Design

Ari Seff, Yaniv Ovadia, Wenda Zhou et al.

2020 83 citations View Analysis →

'CADSketchNet' - An Annotated Sketch dataset for 3D CAD Model Retrieval with Deep Neural Networks

Bharadwaj Manda, Shubham Dhayarkar, Sai Mitheran et al.

2021 29 citations View Analysis →

PolyGen: An Autoregressive Generative Model of 3D Meshes

Charlie Nash, Yaroslav Ganin, A. Eslami et al.

2020 328 citations View Analysis →

3D ShapeNets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, A. Khosla et al.

2014 6272 citations

T3Bench: Benchmarking Current Progress in Text-to-3D Generation

Yuze He, Yushi Bai, Matthieu Lin et al.

2023 59 citations View Analysis →