ECLASS-Augmented Semantic Product Search for Electronic Components

TL;DR

ECLASS-augmented dense retrieval method achieves 94.3% HitRate@5 in semantic search for electronic components.

cs.IR 🔴 Advanced 2026-04-22 30 views
Nico Baumgart Markus Lange-Hegermann Jan Henze
semantic search dense retrieval ECLASS Industry 4.0 LLM

Key Findings

Methodology

This paper proposes a semantic retrieval method augmented with the ECLASS standard, using LLM-assisted dense retrieval techniques to enhance semantic search performance for industrial electronic components. The method includes three stages: query rewriting, retrieval, and re-ranking. By transforming natural language queries into attribute-focused expressions and embedding them into a shared vector space for retrieval, the relevance between queries and products is evaluated using a re-ranking model.

Key Results

  • Result 1: The ECLASS-augmented dense retrieval method achieved a HitRate@5 of 94.3% on expert queries, significantly outperforming the traditional BM25 method's 31.4%.
  • Result 2: Across different configurations, ECLASS semantics-enhanced product representations consistently showed performance improvements, particularly when combined with re-ranking, where MRR improved by approximately 10-20%.
  • Result 3: Experiments demonstrated that using higher-dimensional embeddings (e.g., 2560 dimensions) generally outperformed lower-dimensional embeddings (e.g., 1024 dimensions).

Significance

This research bridges the semantic gap between natural language queries and manufacturer-specific terminology in industrial product data by integrating hierarchical semantics from the ECLASS standard into embedding-based retrieval. The method holds significant academic value and offers practical solutions for factory automation and engineering workflows in the context of Industry 4.0.

Technical Contribution

Technically, the study introduces an ECLASS-augmented dense retrieval method that significantly enhances semantic search performance for industrial products. By incorporating standardized hierarchical metadata, it provides a crucial semantic bridge between user intent and sparse product descriptions. Additionally, the paper demonstrates the potential of effectively leveraging industrial classification standards in embedding-based retrieval pipelines.

Novelty

This study is the first to systematically evaluate the integration of ECLASS standard semantics into dense retrieval techniques for industrial electronic component semantic search. Compared to existing work, the method not only shows significant performance improvements but also addresses the long-standing issue of semantic mismatch by introducing standardized hierarchical semantics.

Limitations

  • Limitation 1: Pure dense retrieval with re-ranking may not reliably compute aggregate or ratio-like features from heterogeneous product fields when handling queries requiring such calculations.
  • Limitation 2: In highly specialized domains, terminology ambiguity may lead the retrieval pipeline to rank irrelevant products ahead of the target products.
  • Limitation 3: The query rewriting strategy may remove important information from the query, affecting retrieval effectiveness.

Future Work

Future directions include: 1) further optimizing query rewriting strategies to retain more critical information from queries; 2) exploring how to better handle aggregate or ratio-like features in dense retrieval; 3) investigating the application of ECLASS-augmented semantic retrieval methods in other industrial domains.

AI Executive Summary

In the context of Industry 4.0, the digital transformation of factory automation and engineering workflows is rapidly advancing. However, traditional retrieval methods like BM25 are limited in handling the semantic mismatch between natural language queries and manufacturer-specific terminology. To address this, the paper proposes a semantic retrieval method augmented with the ECLASS standard, using LLM-assisted dense retrieval techniques to enhance semantic search performance for industrial electronic components.

The method comprises three core components: query rewriting, retrieval, and re-ranking. Initially, an LLM transforms natural language queries into attribute-focused expressions, which are then embedded into a shared vector space for retrieval. Subsequently, a re-ranking model evaluates the relevance between queries and products, improving retrieval accuracy.

Experimental results demonstrate that the ECLASS-augmented dense retrieval method achieved a HitRate@5 of 94.3% on expert queries, significantly outperforming the traditional BM25 method's 31.4%. Additionally, across different configurations, ECLASS semantics-enhanced product representations consistently showed performance improvements, particularly when combined with re-ranking, where MRR improved by approximately 10-20%.

This research bridges the semantic gap between natural language queries and manufacturer-specific terminology in industrial product data by integrating hierarchical semantics from the ECLASS standard into embedding-based retrieval. The method holds significant academic value and offers practical solutions for factory automation and engineering workflows in the context of Industry 4.0.

However, the study also highlights some limitations, such as the inability of pure dense retrieval with re-ranking to reliably compute aggregate or ratio-like features from heterogeneous product fields. Additionally, in highly specialized domains, terminology ambiguity may lead the retrieval pipeline to rank irrelevant products ahead of the target products. Future research directions include further optimizing query rewriting strategies and exploring the application of ECLASS-augmented semantic retrieval methods in other industrial domains.

Deep Analysis

Background

The rise of Industry 4.0 has driven the digital transformation of manufacturing, with technologies such as the Internet of Things, Artificial Intelligence, and Big Data being widely applied in production environments. In this context, the Asset Administration Shell (AAS) serves as a standardized digital representation of industrial assets, facilitating interoperability across heterogeneous systems. To achieve semantic interoperability, standardized vocabularies like ECLASS are widely used to describe products in a machine-interpretable manner. ECLASS is an international classification and description standard that organizes products in a hierarchical taxonomy and defines shared names, attributes, and semantics.

Core Problem

In industrial product data, the semantic mismatch between natural language queries and manufacturer-specific terminology is a long-standing issue. Traditional lexical retrieval methods like BM25 are limited in handling this semantic mismatch, especially when users or LLM agents are unfamiliar with manufacturer-specific terminology. Although recent advances in LLMs and dense retrieval have changed retrieval system design by combining vector search with query rewriting and re-ranking, enabling semantic matching beyond lexical overlap, their performance on structured industrial catalogs with attribute-centric product descriptions remains insufficiently studied.

Innovation

The core innovation of this paper lies in integrating hierarchical semantics from the ECLASS standard into embedding-based retrieval, proposing an ECLASS-augmented dense retrieval method. Specifically, the method: 1) transforms natural language queries into attribute-focused expressions using LLMs, addressing the semantic mismatch issue; 2) enhances product representations with ECLASS standard semantics, providing a crucial semantic bridge between user intent and sparse product descriptions; 3) evaluates the relevance between queries and products using a re-ranking model, improving retrieval accuracy.

Methodology

Method details:

  • �� Query Rewriting: Use LLMs to transform natural language queries into attribute-focused expressions, embedding them into a shared vector space.
  • �� Retrieval: Use LLM embedding models to embed each product into a vector space, and at query time, embed the (rewritten) query for similarity comparison.
  • �� Re-ranking: Use a re-ranking model to evaluate the relevance between queries and products, capturing more complex semantic relationships and improving retrieval accuracy.

Experiments

The experimental design includes using a product database based on the ECLASS 13.0 standard, covering a representative subset of products from the domain of control cabinet components. The experiments use a manually curated dataset combining expert and non-expert perspectives, enabling both quantitative and qualitative analysis. The experiments evaluate how retrieval components, including embedding models, query rewriting, re-ranking, and hyperparameter settings, interact with structured product data. The experiments verify that ECLASS semantics-enhanced product representations consistently show performance improvements across different configurations.

Results

Results analysis shows that the ECLASS-augmented dense retrieval method achieved a HitRate@5 of 94.3% on expert queries, significantly outperforming the traditional BM25 method's 31.4%. Additionally, using higher-dimensional embeddings (e.g., 2560 dimensions) generally outperformed lower-dimensional embeddings (e.g., 1024 dimensions). Across different configurations, ECLASS semantics-enhanced product representations consistently showed performance improvements, particularly when combined with re-ranking, where MRR improved by approximately 10-20%.

Applications

The method has direct application value in factory automation and engineering workflows in the context of Industry 4.0. By addressing the semantic mismatch between natural language queries and manufacturer-specific terminology, the method can be used to improve semantic retrieval performance for industrial product data, supporting engineers and autonomous agents in identifying suitable components from structured catalogs.

Limitations & Outlook

Despite significant performance improvements, the method may not reliably compute aggregate or ratio-like features from heterogeneous product fields when handling queries requiring such calculations. Additionally, in highly specialized domains, terminology ambiguity may lead the retrieval pipeline to rank irrelevant products ahead of the target products. Future research directions include further optimizing query rewriting strategies and exploring the application of ECLASS-augmented semantic retrieval methods in other industrial domains.

Plain Language Accessible to non-experts

Imagine you're in a massive electronics store trying to find a specific component. The store has thousands of products, each with detailed technical specifications but no simple descriptions. You might ask the store clerk, "I need a component suitable for a specific application," but the clerk might not understand your request because you're using everyday language, not technical jargon.

It's like being in a restaurant and wanting to order a dish you've never heard of. You might describe the flavors and feel you want, but the server needs to know the exact dish name and ingredients to help you find it. Our research is like equipping this restaurant with a super-intelligent server who can not only understand your description but also find the most suitable dish based on the detailed menu information.

Our method uses a standard called ECLASS, which is like the restaurant's menu classification system. It helps our "server" understand the specific details of each dish and translate your description into a language they can understand. This way, even if you use everyday language, our system can find the most suitable product.

In this way, we solve the common problem of semantic mismatch in industrial product search, making it easier for engineers and automated systems to find the components they need.

ELI14 Explained like you're 14

Imagine you're in a massive electronics store trying to find a specific component. The store has thousands of products, each with detailed technical specifications but no simple descriptions. You might ask the store clerk, "I need a component suitable for a specific application," but the clerk might not understand your request because you're using everyday language, not technical jargon.

It's like being in a restaurant and wanting to order a dish you've never heard of. You might describe the flavors and feel you want, but the server needs to know the exact dish name and ingredients to help you find it. Our research is like equipping this restaurant with a super-intelligent server who can not only understand your description but also find the most suitable dish based on the detailed menu information.

Our method uses a standard called ECLASS, which is like the restaurant's menu classification system. It helps our "server" understand the specific details of each dish and translate your description into a language they can understand. This way, even if you use everyday language, our system can find the most suitable product.

In this way, we solve the common problem of semantic mismatch in industrial product search, making it easier for engineers and automated systems to find the components they need.

Glossary

ECLASS

ECLASS is an international classification and description standard used to organize products in a hierarchical taxonomy and define shared names, attributes, and semantics.

In this paper, ECLASS is used to enhance the semantic information of product representations.

Dense Retrieval

Dense retrieval is an information retrieval method that uses vector space models to compute the similarity between queries and documents.

Dense retrieval techniques are used in this paper to improve semantic search performance.

Re-ranking

Re-ranking is a method that re-evaluates the relevance of candidate results after initial retrieval to improve the accuracy of retrieval results.

In this paper, re-ranking is used to evaluate the relevance between queries and products.

Large Language Model (LLM)

A large language model is a deep learning-based natural language processing model capable of understanding and generating natural language text.

LLMs are used in this paper for query rewriting and embedding generation.

HitRate@5

HitRate@5 is an evaluation metric in information retrieval that indicates the proportion of queries for which at least one relevant result is found in the top 5 results.

HitRate@5 is used in this paper to evaluate the performance of retrieval methods.

MRR (Mean Reciprocal Rank)

MRR is an information retrieval evaluation metric that represents the average of the reciprocal ranks of the first relevant result in the retrieval results.

MRR is used in this paper to evaluate the performance of retrieval methods.

Query Rewriting

Query rewriting is a method that transforms a user's natural language query into a more effective form for retrieval.

Query rewriting is performed using LLMs in this paper.

Vector Space Model

A vector space model is an information retrieval model that represents documents and queries as vectors and computes their similarity for retrieval.

Vector space models are used in dense retrieval in this paper.

Semantic Mismatch

Semantic mismatch refers to the semantic differences between natural language queries and document descriptions, leading to poor retrieval performance.

Semantic mismatch is addressed in this paper using the ECLASS standard.

Industry 4.0

Industry 4.0 refers to the fourth industrial revolution, characterized by the digital transformation of manufacturing through the integration of IoT, AI, and Big Data into production environments.

The paper discusses semantic retrieval issues in the context of Industry 4.0.

Open Questions Unanswered questions from this research

  • 1 How to better handle aggregate or ratio-like features in dense retrieval remains an open question. Current methods may not reliably compute these features when handling queries requiring such calculations. Future research needs to explore new methods to address this issue.
  • 2 Terminology ambiguity in highly specialized domains remains a challenge. Although the proposed method addresses the semantic mismatch issue to some extent, the retrieval pipeline may still rank irrelevant products ahead of the target products when dealing with terminology ambiguity.
  • 3 How to apply ECLASS-augmented semantic retrieval methods in other industrial domains requires further research. While the method performs well in the electronic component domain, its applicability and performance in other domains need to be verified.
  • 4 Optimizing query rewriting strategies remains an area for research. Current strategies may remove important information from queries, affecting retrieval effectiveness. Future research needs to explore more effective query rewriting strategies.
  • 5 How to improve retrieval performance without increasing computational costs is an important research direction. While the method shows significant performance improvements, computational costs remain high. Future research needs to explore more efficient retrieval methods.

Applications

Immediate Applications

Industrial Product Semantic Retrieval

The method can be used to improve semantic retrieval performance for industrial product data, supporting engineers and autonomous agents in identifying suitable components from structured catalogs.

Factory Automation

By addressing the semantic mismatch between natural language queries and manufacturer-specific terminology, the method can be used for component selection and configuration in factory automation.

Engineering Workflow Optimization

The method can be used to optimize component search and selection processes in engineering workflows, improving efficiency and accuracy.

Long-term Vision

Cross-domain Semantic Retrieval

In the future, the method can be extended to other industrial domains, achieving cross-domain semantic retrieval and improving interoperability between different fields.

Smart Manufacturing

By combining with other smart manufacturing technologies, the method can be used to achieve more efficient production processes and smarter manufacturing systems.

Abstract

Efficient semantic access to industrial product data is a key enabler for factory automation and emerging LLM-based agent workflows, where both human engineers and autonomous agents must identify suitable components from highly structured catalogs. However, the vocabulary mismatch between natural-language queries and attribute-centric product descriptions limits the effectiveness of traditional retrieval approaches, e.g., BM25. In this work, we present a systematic evaluation of LLM-assisted dense retrieval for semantic product search on industrial electronic components, and investigate the integration of hierarchical semantics from the ECLASS standard into embedding-based retrieval. Our results show that dense retrieval combined with re-ranking substantially outperforms classical lexical methods and foundation model web-search baselines. In particular, the proposed approach achieves a Hit_Rate@5 of 94.3 %, compared to 31.4 % for BM25 on expert queries, while also exceeding foundation model baselines in both effectiveness and efficiency. Furthermore, augmenting product representations with ECLASS semantics yields consistent performance gains across configurations, demonstrating that standardized hierarchical metadata provides a crucial semantic bridge between user intent and sparse product descriptions.

cs.IR

References (20)

Large Language Models for Information Retrieval: A Survey

Yutao Zhu, Huaying Yuan, Shuting Wang et al.

2023 529 citations ⭐ Influential View Analysis →

LLMs as Sparse Retrievers:A Framework for First-Stage Product Search

Hongru Song, Yuansan Liu, Ruqing Zhang et al.

2025 1 citations View Analysis →

Hierarchical Multi-field Representations for Two-Stage E-commerce Retrieval

N. Freymuth, Dong Liu, Thomas Ricatte et al.

2025 2 citations View Analysis →

Interoperable information modelling leveraging asset administration shell and large language model for quality control toward zero defect manufacturing

Dachuan Shi, Philipp Liedl, Thomas Bauernhansl

2024 29 citations

“Phoenix Contact”

K. Eisert, Angela Josephs-Olesch

2022 12 citations

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 4380 citations View Analysis →

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga et al.

2024 54 citations View Analysis →

AI Agents and Agentic AI-Navigating a Plethora of Concepts for Future Manufacturing

Yinwang Ren, Yangyang Liu, Tang Ji et al.

2025 21 citations View Analysis →

Dual data mapping with fine-tuned large language models and asset administration shells toward interoperable knowledge representation

Dachuan Shi, Olga Meyer, Michael Oberle et al.

22 citations

Automated Extraction of Conditional Causal Rules from Control Narratives Using Logic Programming and Large Language Models

F. C. Kunze, Gianluca Manca, Alexander Fay

2025 2 citations

Graph Database

P. Wood

2018 58 citations

Ten Years of Asset Administration Shell: Developments, Research Opportunities, and Adoption Challenges

Lucas Sakurada, Fernando de la Prieta, Paulo Leitão

2025 3 citations

Okapi at TREC

S. Robertson, S. Walker, M. Hancock-Beaulieu et al.

1992 53 citations

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long et al.

2025 677 citations View Analysis →

Generation of Asset Administration Shell With Large Language Model Agents: Toward Semantic Interoperability in Digital Twins in the Context of Industry 4.0

Yuchen Xia, Zhewen Xiao, Nasser Jazdi et al.

2024 47 citations View Analysis →

Leveraging LLMs Towards Assistant-based Support for Industrial Threat Models

Enrico Fregnan, Christian Göttel, Balz Maag et al.

2025 1 citations

Dense Text Retrieval Based on Pretrained Language Models: A Survey

Wayne Xin Zhao, Jing Liu, Ruiyang Ren et al.

2022 299 citations View Analysis →

Why Asset Administration Shells: A Survey on Uses and Challenges

Angelos Alexopoulos, Georgios Kalogeras, K. Koutras et al.

2025 2 citations

Generalized Embedding Models for Industry 4.0 Applications

Christodoulos Constantinides, Shuxin Lin, Dhaval Patel

2025 1 citations

Leveraging Large Language Models for Robust Maintenance Rule Extraction in Industrial Settings

Nicola Tamascelli, Nilavra Bhattacharya, Chen Song et al.

2025 1 citations