Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models

TL;DR

Using the CompCQ framework, this study analyzes LLM-generated competency questions across domains, revealing generation characteristics.

cs.AI 🔴 Advanced 2026-04-18 28 views
Reham Alharbi Valentina Tamma Terry R. Payne Jacopo de Berardinis
Generative AI Competency Questions Cross-Domain Analysis Open Models Closed Models

Key Findings

Methodology

The paper introduces a multi-dimensional framework called CompCQ for systematically comparing competency questions generated by different LLMs. This framework analyzes the complexity and readability of generated questions by quantifying linguistic, syntactic, and semantic features. The study uses open models like KimiK2-1T, LLama3.1-8B, LLama3.2-3B, and closed models like Gemini 2.5 Pro and GPT 4.1, conducting experiments across five domains including cultural heritage and healthcare.

Key Results

  • In the personalized depression treatment ontology domain, KimiK2 generated questions scored highest in complexity and readability metrics, with an FKGL of 21, indicating a high educational level required for comprehension.
  • The Gemini model consistently produced the most concise and readable questions across most domains, achieving the lowest FKGL scores, indicating a generation style favoring simple, direct questions.
  • Open models showed significant increases in output complexity in complex domains, particularly in technically demanding fields, highlighting limitations in handling complex inputs.

Significance

This study systematically analyzes LLM-generated competency questions, revealing generation characteristics and performance differences across domains. This is crucial for selecting appropriate LLMs for ontology engineering, especially in applications requiring high-quality question generation. Results indicate that no single model can fully cover all requirement spaces, necessitating the combination of multiple models and retaining human-in-the-loop refinement for comprehensive and accurate coverage.

Technical Contribution

The technical contribution lies in the introduction of a multi-dimensional framework, CompCQ, for analyzing and comparing LLM-generated competency questions. This framework considers not only linguistic and syntactic complexity but also introduces methods for evaluating semantic diversity and coverage, providing new perspectives and tools for assessing LLM-generated questions.

Novelty

This is the first systematic comparison of open and closed LLMs in generating competency questions, introducing a multi-dimensional analysis framework, CompCQ. Unlike previous studies, this research delves into the intrinsic characteristics of generated questions beyond mere feasibility.

Limitations

  • Open models exhibit significant increases in question complexity in complex domains, potentially leading to comprehension difficulties.
  • Some models generate fewer questions in specific domains, possibly failing to cover the full requirements.
  • Closed models, while stable, lack diversity.

Future Work

Future research could explore combining multiple LLMs to enhance question coverage and diversity. Additionally, optimizing the CompCQ framework to better suit different domain needs and reduce the necessity for human intervention is another direction.

AI Executive Summary

In ontology engineering, competency questions (CQs) are essential tools for requirement elicitation, traditionally modeled by ontology engineers and domain experts through a manual process. This process is time-consuming and requires substantial expertise, limiting its widespread application. The introduction of generative AI automates CQ creation, broadening stakeholder engagement and access to ontology engineering.

However, with the widespread adoption of large language models (LLMs), understanding the intrinsic characteristics of their generated CQs becomes crucial. This paper introduces a multi-dimensional framework called CompCQ for systematically comparing CQs generated by different LLMs. Through cross-domain empirical studies, we analyze features such as readability, structural complexity, and semantic diversity of CQs.

The study uses open models like KimiK2-1T, LLama3.1-8B, LLama3.2-3B, and closed models like Gemini 2.5 Pro and GPT 4.1, conducting experiments across five domains including cultural heritage and healthcare. Experimental results show that LLM performance reflects distinct generation profiles shaped by use cases. Closed models excel in stability and readability, while open models offer higher diversity but sometimes sacrifice clarity.

In the personalized depression treatment ontology domain, KimiK2 generated questions scored highest in complexity and readability metrics, with an FKGL of 21, indicating a high educational level required for comprehension. Conversely, the Gemini model consistently produced the most concise and readable questions across most domains, achieving the lowest FKGL scores, indicating a generation style favoring simple, direct questions.

Results indicate that no single model can fully cover all requirement spaces, necessitating the combination of multiple models and retaining human-in-the-loop refinement for comprehensive and accurate coverage. Future research could explore optimizing the CompCQ framework to better suit different domain needs and reduce the necessity for human intervention.

Deep Analysis

Background

Ontology Engineering (OE) is a crucial field in information science, aimed at organizing and sharing knowledge through ontology construction. Requirement elicitation is a foundational stage in the OE lifecycle, determining the functional scope and semantic adequacy of the resulting model. Competency questions (CQs) are recognized as the standard mechanism for this task, serving as a natural language interface between domain experts and ontology engineers. By framing requirements as answerable questions, CQs direct the modeling of concepts and relations, underpin validation and testing, and inform assessments of ontology reuse. However, the manual formulation of CQs remains a major bottleneck, as it requires substantial domain and modeling expertise, leading to their under-utilization in practice. To mitigate this, the OE community has increasingly turned towards automation, from pattern-based approaches to the use of Large Language Models (LLMs).

Core Problem

Despite the rapid adoption of LLMs in OE, understanding their output properties across multiple dimensions and characteristics remains a significant challenge. For example, the influence of model architecture, parameter size, and input domain on features such as linguistic structure, complexity, and semantic diversity of generated CQs is poorly investigated. Treating LLMs as a monolithic solution ignores the substantial variability in their output profiles that ontology engineers must understand to effectively select LLM-based OE tools.

Innovation

The innovation of this paper lies in the introduction of a multi-dimensional framework called CompCQ for systematically comparing competency questions generated by different LLMs. This framework analyzes the complexity and readability of generated questions by quantifying linguistic, syntactic, and semantic features. The study uses open models like KimiK2-1T, LLama3.1-8B, LLama3.2-3B, and closed models like Gemini 2.5 Pro and GPT 4.1, conducting experiments across five domains including cultural heritage and healthcare. Through this approach, we identify salient properties of generated questions, including readability, relevance with respect to the input text, and structural complexity.

Methodology

  • �� Introduce the CompCQ framework: a multi-dimensional framework for quantifying and comparing LLM-generated competency questions.
  • �� Use multiple LLMs for experiments: including open models (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1).
  • �� Cross-domain analysis: conduct experiments across five domains including cultural heritage and healthcare to identify salient properties of generated questions.
  • �� Quantify features: analyze the readability, structural complexity, and semantic diversity of generated questions.

Experiments

The experimental design covers multiple LLMs, including open and closed models, across five domains including cultural heritage and healthcare. The models used include KimiK2-1T, LLama3.1-8B, LLama3.2-3B, Gemini 2.5 Pro, and GPT 4.1. Metrics used in the experiments include readability, structural complexity, and semantic diversity. To ensure reproducibility, all models were prompted via their respective APIs with parameters set to temperature=0, Top-P=1, and seed=46.

Results

Experimental results show that LLM performance reflects distinct generation profiles shaped by use cases. Closed models excel in stability and readability, while open models offer higher diversity but sometimes sacrifice clarity. In the personalized depression treatment ontology domain, KimiK2 generated questions scored highest in complexity and readability metrics, with an FKGL of 21, indicating a high educational level required for comprehension. Conversely, the Gemini model consistently produced the most concise and readable questions across most domains, achieving the lowest FKGL scores, indicating a generation style favoring simple, direct questions.

Applications

The applications of this study include requirement elicitation and validation processes in ontology engineering. By automatically generating high-quality competency questions, it significantly reduces human intervention and improves efficiency. Additionally, this approach can be applied to other fields requiring natural language interfaces, such as knowledge graph construction and semantic search.

Limitations & Outlook

Despite providing a systematic method for comparing LLM-generated competency questions, the CompCQ framework faces challenges in handling complex domains where open models exhibit significant increases in question complexity, potentially leading to comprehension difficulties. Additionally, some models generate fewer questions in specific domains, possibly failing to cover the full requirements. Closed models, while stable, lack diversity. Future research could explore combining multiple LLMs to enhance question coverage and diversity.

Plain Language Accessible to non-experts

Imagine you work in a large supermarket, and your task is to help customers find the products they need. Each customer has different needs; some need to find specific products, while others require recommendations based on certain criteria. To better serve the customers, you need to ask some questions to clarify their needs, similar to competency questions (CQs).

In a traditional supermarket, staff need to manually ask these questions based on their experience and the customer's description, which is time-consuming and requires a lot of experience. In a modern smart supermarket, we can use a technology called generative AI to automatically generate these questions. Generative AI acts like a super-smart assistant that can quickly generate a series of relevant questions based on the customer's description, helping staff better understand the customer's needs.

However, different generative AI assistants may have different styles and characteristics when generating questions. Some assistants generate simple and clear questions that are easy to understand, while others may generate more complex questions that require more background knowledge to understand. Therefore, we need a systematic method to compare the questions generated by these assistants to ensure they can meet the customer's needs.

By using this method, we can better choose the right assistant to help us serve customers, improve work efficiency, and ensure that customers get the products they need.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you're an explorer looking for hidden treasures on a map. To find these treasures, you need to ask some questions like 'Where is the treasure?' or 'What tools do I need to find it?' These questions are like the competency questions (CQs) we use in ontology engineering.

In this game, you can choose different assistants to help you generate these questions. Some assistants are really smart and can quickly give you simple and clear questions, while others like to give more complex questions that make you think a bit more. Our task is to find the best assistant to help us find the treasure faster.

To do this, we need a super cool tool called CompCQ. It helps us compare the questions generated by different assistants to see which one is better for our adventure. With this tool, we can know which assistant generates easier-to-understand questions and which one generates more creative questions.

So, next time you face a challenge in the game, don't forget to use CompCQ to choose the best assistant to help you complete the task and find all the treasures!

Glossary

Competency Questions

Competency questions are natural language questions used in ontology engineering for requirement elicitation, helping define the scope and functionality of an ontology.

In this paper, competency questions are used to evaluate the quality and applicability of LLM-generated questions.

Generative AI

Generative AI is a type of artificial intelligence technology capable of automatically generating content, widely used in text generation, image generation, and other fields.

In this paper, generative AI is used to automatically generate competency questions.

Large Language Models (LLMs)

Large language models are deep learning-based natural language processing models with large-scale parameters and powerful generation capabilities.

In this paper, LLMs are used to generate and compare competency questions across different domains.

CompCQ Framework

CompCQ is a multi-dimensional framework for comparing LLM-generated competency questions, quantifying linguistic, syntactic, and semantic features.

The paper introduces the CompCQ framework to systematically analyze the complexity and readability of LLM-generated questions.

Readability

Readability refers to the ease with which a text can be read and understood, often evaluated using metrics like the Flesch-Kincaid Grade Level.

In this paper, readability is used to assess the difficulty of understanding LLM-generated questions.

Structural Complexity

Structural complexity refers to the syntactic and semantic complexity of a text, affecting its difficulty in understanding and processing.

In this paper, structural complexity is used to analyze the complexity of LLM-generated questions.

Semantic Diversity

Semantic diversity refers to the diversity and coverage of a text in terms of semantics, affecting its richness of information.

In this paper, semantic diversity is used to evaluate the diversity of LLM-generated questions.

Open Models

Open models are LLMs that can be freely accessed and used, typically offering higher diversity and flexibility.

In this paper, open models are used to generate and compare competency questions across different domains.

Closed Models

Closed models are LLMs controlled by specific companies or organizations, typically offering higher stability and consistency.

In this paper, closed models are used to generate and compare competency questions across different domains.

Flesch-Kincaid Grade Level

The Flesch-Kincaid Grade Level is a metric for assessing text readability, indicating the years of education required to understand the text.

In this paper, the Flesch-Kincaid Grade Level is used to evaluate the readability of LLM-generated questions.

Open Questions Unanswered questions from this research

  • 1 Despite providing a systematic method for comparing LLM-generated competency questions, the CompCQ framework faces challenges in handling complex domains where open models exhibit significant increases in question complexity, potentially leading to comprehension difficulties. Future research could explore optimizing the framework to reduce this complexity.
  • 2 Some models generate fewer questions in specific domains, possibly failing to cover the full requirements. This indicates a need for further research on how to improve LLMs' generation coverage and diversity.
  • 3 Closed models, while stable, lack diversity. Future research could explore how to enhance diversity while maintaining stability.
  • 4 Current research mainly focuses on text generation, and future work could explore how to apply the CompCQ framework to other generation tasks, such as image and audio generation.
  • 5 While LLMs perform well in generating competency questions, their performance in handling multilingual and cross-cultural requirements needs further investigation.

Applications

Immediate Applications

Ontology Engineering Requirement Elicitation

By automatically generating high-quality competency questions, it significantly reduces human intervention and improves efficiency in ontology engineering.

Knowledge Graph Construction

Generative AI-generated competency questions can guide the construction and validation of knowledge graphs, ensuring semantic completeness.

Semantic Search Optimization

By generating relevant competency questions, semantic search accuracy and relevance can be improved, providing users with more precise search results.

Long-term Vision

Cross-Domain Knowledge Integration

By generating competency questions, integration and sharing of knowledge across different domains can be achieved, promoting interdisciplinary collaboration and innovation.

Intelligent Assistant Development

In the future, intelligent assistants based on generative AI can be developed to help users quickly access needed information in various scenarios, improving work and life efficiency.

Abstract

Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.

cs.AI

References (20)

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey et al.

2024 14252 citations ⭐ Influential View Analysis →

RevOnt: Reverse engineering of competency questions from knowledge graphs via language models

Fiorela Ciroku, Jacopo de Berardinis, Jongmo Kim et al.

2024 24 citations ⭐ Influential

An Experiment in Retrofitting Competency Questions for Existing Ontologies

Reham Alharbi, Valentina A. M. Tamma, Floriana Grasso et al.

2023 25 citations ⭐ Influential View Analysis →

Automating the Generation of Competency Questions for Ontologies with AgOCQs

M. Antia, C. Keet

2023 20 citations ⭐ Influential

Characterising the Gap Between Theory and Practice of Ontology Reuse

Reham Alharbi, V. Tamma, Floriana Grasso

2021 11 citations

Use of Competency Questions in Ontology Engineering: A Survey

Glaice K. S. Quirino, J. S. Salamon, M. Barcellos

2023 26 citations

Assessing Candidate Ontologies for Reuse

Reham Alharbi

2021 3 citations

Towards a Methodology for Building Ontologies

M. Uschold, Martin King

1995 1061 citations

Analysis of Ontology Competency Questions and their formalizations in SPARQL-OWL

Dawid Wisniewski, Jedrzej Potoniec, Agnieszka Lawrynowicz et al.

2019 75 citations

On the Roles of Competency Questions in Ontology Engineering

C. Keet, Z. Khan

2024 14 citations

Test-Driven Development of Ontologies

C. Keet, A. Ławrynowicz

2016 59 citations

CQChecker: A Tool to Check Ontologies in OWL-DL using Competency Questions written in Controlled Natural Language

Camila Bezerra, Filipe Santana, F. Freitas

2014 21 citations

The Role of Competency Questions in Enterprise Engineering

M. Gruninger, M. Fox

1995 389 citations

Computing Authoring Tests from Competency Questions: Experimental Validation

Matt Dennis, Kees van Deemter, Daniele Dell'Aglio et al.

2017 21 citations

Evaluating the Evaluation of Diversity in Natural Language Generation

Guy Tevet, Jonathan Berant

2020 142 citations View Analysis →

A Review and Comparison of Competency Question Engineering Approaches

Reham Alharbi, Valentina A. M. Tamma, Floriana Grasso et al.

2024 10 citations

OntoChat: a Framework for Conversational Ontology Engineering using Language Models

Bohui Zhang, Valentina Anita Carriero, Katrin Schreiberhuber et al.

2024 35 citations View Analysis →

Assessing and Enhancing Bottom-up CNL Design for Competency Questions for Ontologies

M. Antia, C. Keet

2021 7 citations

A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements

Reham Alharbi, Valentina A. M. Tamma, Terry R. Payne et al.

2025 3 citations View Analysis →

The Music Meta Ontology: a flexible semantic model for the interoperability of music metadata

Jacopo de Berardinis, Valentina Anita Carriero, Albert Meroño-Peñuela et al.

2023 9 citations View Analysis →