Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models
Using the CompCQ framework, this study analyzes LLM-generated competency questions across domains, revealing generation characteristics.
Key Findings
Methodology
The paper introduces a multi-dimensional framework called CompCQ for systematically comparing competency questions generated by different LLMs. This framework analyzes the complexity and readability of generated questions by quantifying linguistic, syntactic, and semantic features. The study uses open models like KimiK2-1T, LLama3.1-8B, LLama3.2-3B, and closed models like Gemini 2.5 Pro and GPT 4.1, conducting experiments across five domains including cultural heritage and healthcare.
Key Results
- In the personalized depression treatment ontology domain, KimiK2 generated questions scored highest in complexity and readability metrics, with an FKGL of 21, indicating a high educational level required for comprehension.
- The Gemini model consistently produced the most concise and readable questions across most domains, achieving the lowest FKGL scores, indicating a generation style favoring simple, direct questions.
- Open models showed significant increases in output complexity in complex domains, particularly in technically demanding fields, highlighting limitations in handling complex inputs.
Significance
This study systematically analyzes LLM-generated competency questions, revealing generation characteristics and performance differences across domains. This is crucial for selecting appropriate LLMs for ontology engineering, especially in applications requiring high-quality question generation. Results indicate that no single model can fully cover all requirement spaces, necessitating the combination of multiple models and retaining human-in-the-loop refinement for comprehensive and accurate coverage.
Technical Contribution
The technical contribution lies in the introduction of a multi-dimensional framework, CompCQ, for analyzing and comparing LLM-generated competency questions. This framework considers not only linguistic and syntactic complexity but also introduces methods for evaluating semantic diversity and coverage, providing new perspectives and tools for assessing LLM-generated questions.
Novelty
This is the first systematic comparison of open and closed LLMs in generating competency questions, introducing a multi-dimensional analysis framework, CompCQ. Unlike previous studies, this research delves into the intrinsic characteristics of generated questions beyond mere feasibility.
Limitations
- Open models exhibit significant increases in question complexity in complex domains, potentially leading to comprehension difficulties.
- Some models generate fewer questions in specific domains, possibly failing to cover the full requirements.
- Closed models, while stable, lack diversity.
Future Work
Future research could explore combining multiple LLMs to enhance question coverage and diversity. Additionally, optimizing the CompCQ framework to better suit different domain needs and reduce the necessity for human intervention is another direction.
AI Executive Summary
In ontology engineering, competency questions (CQs) are essential tools for requirement elicitation, traditionally modeled by ontology engineers and domain experts through a manual process. This process is time-consuming and requires substantial expertise, limiting its widespread application. The introduction of generative AI automates CQ creation, broadening stakeholder engagement and access to ontology engineering.
However, with the widespread adoption of large language models (LLMs), understanding the intrinsic characteristics of their generated CQs becomes crucial. This paper introduces a multi-dimensional framework called CompCQ for systematically comparing CQs generated by different LLMs. Through cross-domain empirical studies, we analyze features such as readability, structural complexity, and semantic diversity of CQs.
The study uses open models like KimiK2-1T, LLama3.1-8B, LLama3.2-3B, and closed models like Gemini 2.5 Pro and GPT 4.1, conducting experiments across five domains including cultural heritage and healthcare. Experimental results show that LLM performance reflects distinct generation profiles shaped by use cases. Closed models excel in stability and readability, while open models offer higher diversity but sometimes sacrifice clarity.
In the personalized depression treatment ontology domain, KimiK2 generated questions scored highest in complexity and readability metrics, with an FKGL of 21, indicating a high educational level required for comprehension. Conversely, the Gemini model consistently produced the most concise and readable questions across most domains, achieving the lowest FKGL scores, indicating a generation style favoring simple, direct questions.
Results indicate that no single model can fully cover all requirement spaces, necessitating the combination of multiple models and retaining human-in-the-loop refinement for comprehensive and accurate coverage. Future research could explore optimizing the CompCQ framework to better suit different domain needs and reduce the necessity for human intervention.
Deep Analysis
Background
Ontology Engineering (OE) is a crucial field in information science, aimed at organizing and sharing knowledge through ontology construction. Requirement elicitation is a foundational stage in the OE lifecycle, determining the functional scope and semantic adequacy of the resulting model. Competency questions (CQs) are recognized as the standard mechanism for this task, serving as a natural language interface between domain experts and ontology engineers. By framing requirements as answerable questions, CQs direct the modeling of concepts and relations, underpin validation and testing, and inform assessments of ontology reuse. However, the manual formulation of CQs remains a major bottleneck, as it requires substantial domain and modeling expertise, leading to their under-utilization in practice. To mitigate this, the OE community has increasingly turned towards automation, from pattern-based approaches to the use of Large Language Models (LLMs).
Core Problem
Despite the rapid adoption of LLMs in OE, understanding their output properties across multiple dimensions and characteristics remains a significant challenge. For example, the influence of model architecture, parameter size, and input domain on features such as linguistic structure, complexity, and semantic diversity of generated CQs is poorly investigated. Treating LLMs as a monolithic solution ignores the substantial variability in their output profiles that ontology engineers must understand to effectively select LLM-based OE tools.
Innovation
The innovation of this paper lies in the introduction of a multi-dimensional framework called CompCQ for systematically comparing competency questions generated by different LLMs. This framework analyzes the complexity and readability of generated questions by quantifying linguistic, syntactic, and semantic features. The study uses open models like KimiK2-1T, LLama3.1-8B, LLama3.2-3B, and closed models like Gemini 2.5 Pro and GPT 4.1, conducting experiments across five domains including cultural heritage and healthcare. Through this approach, we identify salient properties of generated questions, including readability, relevance with respect to the input text, and structural complexity.
Methodology
- �� Introduce the CompCQ framework: a multi-dimensional framework for quantifying and comparing LLM-generated competency questions.
- �� Use multiple LLMs for experiments: including open models (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1).
- �� Cross-domain analysis: conduct experiments across five domains including cultural heritage and healthcare to identify salient properties of generated questions.
- �� Quantify features: analyze the readability, structural complexity, and semantic diversity of generated questions.
Experiments
The experimental design covers multiple LLMs, including open and closed models, across five domains including cultural heritage and healthcare. The models used include KimiK2-1T, LLama3.1-8B, LLama3.2-3B, Gemini 2.5 Pro, and GPT 4.1. Metrics used in the experiments include readability, structural complexity, and semantic diversity. To ensure reproducibility, all models were prompted via their respective APIs with parameters set to temperature=0, Top-P=1, and seed=46.
Results
Experimental results show that LLM performance reflects distinct generation profiles shaped by use cases. Closed models excel in stability and readability, while open models offer higher diversity but sometimes sacrifice clarity. In the personalized depression treatment ontology domain, KimiK2 generated questions scored highest in complexity and readability metrics, with an FKGL of 21, indicating a high educational level required for comprehension. Conversely, the Gemini model consistently produced the most concise and readable questions across most domains, achieving the lowest FKGL scores, indicating a generation style favoring simple, direct questions.
Applications
The applications of this study include requirement elicitation and validation processes in ontology engineering. By automatically generating high-quality competency questions, it significantly reduces human intervention and improves efficiency. Additionally, this approach can be applied to other fields requiring natural language interfaces, such as knowledge graph construction and semantic search.
Limitations & Outlook
Despite providing a systematic method for comparing LLM-generated competency questions, the CompCQ framework faces challenges in handling complex domains where open models exhibit significant increases in question complexity, potentially leading to comprehension difficulties. Additionally, some models generate fewer questions in specific domains, possibly failing to cover the full requirements. Closed models, while stable, lack diversity. Future research could explore combining multiple LLMs to enhance question coverage and diversity.
Plain Language Accessible to non-experts
Imagine you work in a large supermarket, and your task is to help customers find the products they need. Each customer has different needs; some need to find specific products, while others require recommendations based on certain criteria. To better serve the customers, you need to ask some questions to clarify their needs, similar to competency questions (CQs).
In a traditional supermarket, staff need to manually ask these questions based on their experience and the customer's description, which is time-consuming and requires a lot of experience. In a modern smart supermarket, we can use a technology called generative AI to automatically generate these questions. Generative AI acts like a super-smart assistant that can quickly generate a series of relevant questions based on the customer's description, helping staff better understand the customer's needs.
However, different generative AI assistants may have different styles and characteristics when generating questions. Some assistants generate simple and clear questions that are easy to understand, while others may generate more complex questions that require more background knowledge to understand. Therefore, we need a systematic method to compare the questions generated by these assistants to ensure they can meet the customer's needs.
By using this method, we can better choose the right assistant to help us serve customers, improve work efficiency, and ensure that customers get the products they need.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super cool game where you're an explorer looking for hidden treasures on a map. To find these treasures, you need to ask some questions like 'Where is the treasure?' or 'What tools do I need to find it?' These questions are like the competency questions (CQs) we use in ontology engineering.
In this game, you can choose different assistants to help you generate these questions. Some assistants are really smart and can quickly give you simple and clear questions, while others like to give more complex questions that make you think a bit more. Our task is to find the best assistant to help us find the treasure faster.
To do this, we need a super cool tool called CompCQ. It helps us compare the questions generated by different assistants to see which one is better for our adventure. With this tool, we can know which assistant generates easier-to-understand questions and which one generates more creative questions.
So, next time you face a challenge in the game, don't forget to use CompCQ to choose the best assistant to help you complete the task and find all the treasures!
Glossary
Competency Questions
Competency questions are natural language questions used in ontology engineering for requirement elicitation, helping define the scope and functionality of an ontology.
In this paper, competency questions are used to evaluate the quality and applicability of LLM-generated questions.
Generative AI
Generative AI is a type of artificial intelligence technology capable of automatically generating content, widely used in text generation, image generation, and other fields.
In this paper, generative AI is used to automatically generate competency questions.
Large Language Models (LLMs)
Large language models are deep learning-based natural language processing models with large-scale parameters and powerful generation capabilities.
In this paper, LLMs are used to generate and compare competency questions across different domains.
CompCQ Framework
CompCQ is a multi-dimensional framework for comparing LLM-generated competency questions, quantifying linguistic, syntactic, and semantic features.
The paper introduces the CompCQ framework to systematically analyze the complexity and readability of LLM-generated questions.
Readability
Readability refers to the ease with which a text can be read and understood, often evaluated using metrics like the Flesch-Kincaid Grade Level.
In this paper, readability is used to assess the difficulty of understanding LLM-generated questions.
Structural Complexity
Structural complexity refers to the syntactic and semantic complexity of a text, affecting its difficulty in understanding and processing.
In this paper, structural complexity is used to analyze the complexity of LLM-generated questions.
Semantic Diversity
Semantic diversity refers to the diversity and coverage of a text in terms of semantics, affecting its richness of information.
In this paper, semantic diversity is used to evaluate the diversity of LLM-generated questions.
Open Models
Open models are LLMs that can be freely accessed and used, typically offering higher diversity and flexibility.
In this paper, open models are used to generate and compare competency questions across different domains.
Closed Models
Closed models are LLMs controlled by specific companies or organizations, typically offering higher stability and consistency.
In this paper, closed models are used to generate and compare competency questions across different domains.
Flesch-Kincaid Grade Level
The Flesch-Kincaid Grade Level is a metric for assessing text readability, indicating the years of education required to understand the text.
In this paper, the Flesch-Kincaid Grade Level is used to evaluate the readability of LLM-generated questions.
Open Questions Unanswered questions from this research
- 1 Despite providing a systematic method for comparing LLM-generated competency questions, the CompCQ framework faces challenges in handling complex domains where open models exhibit significant increases in question complexity, potentially leading to comprehension difficulties. Future research could explore optimizing the framework to reduce this complexity.
- 2 Some models generate fewer questions in specific domains, possibly failing to cover the full requirements. This indicates a need for further research on how to improve LLMs' generation coverage and diversity.
- 3 Closed models, while stable, lack diversity. Future research could explore how to enhance diversity while maintaining stability.
- 4 Current research mainly focuses on text generation, and future work could explore how to apply the CompCQ framework to other generation tasks, such as image and audio generation.
- 5 While LLMs perform well in generating competency questions, their performance in handling multilingual and cross-cultural requirements needs further investigation.
Applications
Immediate Applications
Ontology Engineering Requirement Elicitation
By automatically generating high-quality competency questions, it significantly reduces human intervention and improves efficiency in ontology engineering.
Knowledge Graph Construction
Generative AI-generated competency questions can guide the construction and validation of knowledge graphs, ensuring semantic completeness.
Semantic Search Optimization
By generating relevant competency questions, semantic search accuracy and relevance can be improved, providing users with more precise search results.
Long-term Vision
Cross-Domain Knowledge Integration
By generating competency questions, integration and sharing of knowledge across different domains can be achieved, promoting interdisciplinary collaboration and innovation.
Intelligent Assistant Development
In the future, intelligent assistants based on generative AI can be developed to help users quickly access needed information in various scenarios, improving work and life efficiency.
Abstract
Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.
References (20)
The Llama 3 Herd of Models
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey et al.
RevOnt: Reverse engineering of competency questions from knowledge graphs via language models
Fiorela Ciroku, Jacopo de Berardinis, Jongmo Kim et al.
An Experiment in Retrofitting Competency Questions for Existing Ontologies
Reham Alharbi, Valentina A. M. Tamma, Floriana Grasso et al.
Automating the Generation of Competency Questions for Ontologies with AgOCQs
M. Antia, C. Keet
Characterising the Gap Between Theory and Practice of Ontology Reuse
Reham Alharbi, V. Tamma, Floriana Grasso
Use of Competency Questions in Ontology Engineering: A Survey
Glaice K. S. Quirino, J. S. Salamon, M. Barcellos
Assessing Candidate Ontologies for Reuse
Reham Alharbi
Towards a Methodology for Building Ontologies
M. Uschold, Martin King
Analysis of Ontology Competency Questions and their formalizations in SPARQL-OWL
Dawid Wisniewski, Jedrzej Potoniec, Agnieszka Lawrynowicz et al.
On the Roles of Competency Questions in Ontology Engineering
C. Keet, Z. Khan
Test-Driven Development of Ontologies
C. Keet, A. Ławrynowicz
CQChecker: A Tool to Check Ontologies in OWL-DL using Competency Questions written in Controlled Natural Language
Camila Bezerra, Filipe Santana, F. Freitas
The Role of Competency Questions in Enterprise Engineering
M. Gruninger, M. Fox
Computing Authoring Tests from Competency Questions: Experimental Validation
Matt Dennis, Kees van Deemter, Daniele Dell'Aglio et al.
Evaluating the Evaluation of Diversity in Natural Language Generation
Guy Tevet, Jonathan Berant
A Review and Comparison of Competency Question Engineering Approaches
Reham Alharbi, Valentina A. M. Tamma, Floriana Grasso et al.
OntoChat: a Framework for Conversational Ontology Engineering using Language Models
Bohui Zhang, Valentina Anita Carriero, Katrin Schreiberhuber et al.
Assessing and Enhancing Bottom-up CNL Design for Competency Questions for Ontologies
M. Antia, C. Keet
A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements
Reham Alharbi, Valentina A. M. Tamma, Terry R. Payne et al.
The Music Meta Ontology: a flexible semantic model for the interoperability of music metadata
Jacopo de Berardinis, Valentina Anita Carriero, Albert Meroño-Peñuela et al.