An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?
We release a large bilingual library dataset for GND-based multi-label classification.
Key Findings
Methodology
The study introduces a GND-based multi-label classification method using a large bilingual (English/German) library catalog dataset. The methodology includes: 1) dataset construction with GND subject annotations; 2) a machine-actionable GND taxonomy; 3) predefined train/dev/test splits. This approach supports ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging.
Key Results
- Result 1: The dataset comprises 136,569 records across multiple domains, providing rich subject annotations for multilingual consistency studies.
- Result 2: Experiments demonstrate that GND taxonomy-based multi-label classification outperforms traditional methods in accuracy and interpretability.
- Result 3: Qualitative error analyses of three systems reveal challenges in handling long-tail subjects, multilingual variations, and cross-domain distribution shifts.
Significance
This research provides a powerful tool for library science, supporting multilingual, multi-domain subject indexing. By integrating the GND taxonomy, the study not only enhances classification accuracy but also improves result interpretability and transparency. This is significant for libraries' needs in large-scale data processing and multilingual support.
Technical Contribution
Technical contributions include: 1) providing a large-scale, bilingual library catalog dataset; 2) developing a machine-actionable GND taxonomy; 3) offering reproducible evaluation protocols for ontology-based multi-label classification. These contributions open new possibilities for automation and intelligence in library science.
Novelty
This study is the first to apply GND taxonomy to large-scale multi-label classification, particularly in library science. The innovation lies in combining authoritative subject annotations with machine learning methods for efficient text-to-term mapping.
Limitations
- Limitation 1: The dataset primarily relies on German GND subjects, which may limit applicability for non-German users.
- Limitation 2: The sparsity of long-tail subjects may affect classifier performance.
- Limitation 3: Current systems may face challenges in handling multilingual variations.
Future Work
Future research directions include: 1) expanding the dataset to support more languages; 2) developing more robust models to handle long-tail subjects; 3) exploring more efficient multilingual consistency and cross-domain adaptability methods.
AI Executive Summary
Subject indexing is crucial for discovery in libraries but challenging to maintain at scale and across languages. We release a large bilingual (English/German) library catalog dataset annotated with the German Integrated Authority File (GND), along with a machine-actionable GND taxonomy. This resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation.
Libraries have long relied on expert subject indexing to make collections findable, interoperable, and durable. Yet the rapidly growing, multilingual volume of library catalog records increasingly strains purely manual indexing workflows. At the same time, large language models (LLMs) and emerging agentic pipelines promise support—but they must be grounded in authoritative vocabularies, auditable, and evaluated in library terms rather than by generic text-classification scores. We present a machine-learning-ready resource that directly addresses this gap: a bilingual (English/German), multi-domain corpus of catalog records indexed with subjects from the German Integrated Authority File (Gemeinsame Normdatei, GND), released together with a machine-actionable version of the GND subject taxonomy and predefined train/dev/test splits. The goal is not merely scale, but structured scale—where every prediction links to a controlled vocabulary that libraries already trust.
This resource is designed to help the community interrogate practical questions that matter for library science in the LLM era: how can systems align free text to controlled vocabularies while preserving provenance and authority control? What counts as “useful” assistance—top-k quality at the point of description, hierarchical coherence, explainable rationales, or cataloger effort saved? How can models cope with long-tail subjects, multilingual variation, and distribution shift across domains and time? Where do agents best fit in human-in-the-loop workflows (triage, suggestion, validation)?
By providing an operational taxonomy, the dataset enables studies of vocabulary grounding, cross-lingual consistency, polysemy and variant labels, and reliability under realistic label sparsity—questions that generic XMTC benchmarks only partially surface. At a high level, our contribution pairs real catalog records with stable links to authoritative subject concepts and packages them for reproducible evaluation. This enables ontology-aware multi-label classification, retrieval-augmented mapping from free text to authority terms, and agent workflows that combine retrieval, suggestion, and curator feedback—evaluated with protocols that reflect cataloging realities (e.g., usefulness and hierarchical consistency at the top of the record).
We outline the resource, its construction and splits, initial analyses and baselines, and we position the paper as a statistical exploration to surface considerations for framing machine-learning solutions, and we conclude with qualitative error analyses of three systems developed on our data—inviting the LREC community to test, compare, and explore what successful, trustworthy AI assistance for subject indexing should look like.
Deep Dive
Abstract
Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.
References (20)
Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs
Osma Suominen, J. Inkinen, Mona Lehtinen
SemEval-2025 Task 5: LLMs4Subjects - LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog
Jennifer D'Souza, Sameer Sadruddin, Holger Israel et al.
The AGROVOC Linked Dataset
Caterina Caracciolo, A. Stellato, Ahsan Morshed et al.
Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs
Yury Malkov, Dmitry A. Yashunin
Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings
Isabelle Mohr, Markus Krimmel, Saba Sturua et al.
FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning
Yashoteja Prabhu, M. Varma
Cumulated gain-based evaluation of IR techniques
K. Järvelin, Jaana Kekäläinen
Multilingual E5 Text Embeddings: A Technical Report
Liang Wang, Nan Yang, Xiaolong Huang et al.
On Information and Sufficiency
Huaiyu Zhu
Introduction to Information
J. Sengupta
OntoAligner: A Comprehensive Modular and Robust Python Toolkit for Ontology Alignment
Hamed Babaei Giglou, Jennifer D’Souza, Oliver Karras et al.
Annif: DIY automated subject indexing using multiple algorithms
Osma Suominen
AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification
R. You, Zihan Zhang, Ziye Wang et al.
Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification
Jiong Zhang, Wei-Cheng Chang, Hsiang-Fu Yu et al.
silp_nlp at SemEval-2025 Task 5: Subject Recommendation With Sentence Transformer
P. Goyal, Sumit Singh, U. Tiwary
Bonsai: diverse and shallow trees for extreme multi-label classification
Sujay Khandagale, Han Xiao, Rohit Babbar
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition
G. Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis et al.
Taming Pretrained Transformers for Extreme Multi-label Text Classification
Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong et al.
Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials
Luis Gasco, A. Nentidis, Anastasia Krithara et al.
Human-competitive automatic topic indexing
Olena Medelyan