An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

TL;DR

We release a large bilingual library dataset for GND-based multi-label classification.

cs.CL 🔴 Advanced 2026-03-11 13 views

Jennifer D'Souza Sameer Sadruddin Maximilian Kähler Andrea Salfinger Luca Zaccagna Francesca Incitti Lauro Snidaro Osma Suominen

AI Reader Arxiv Page Download PDF

multi-label classification library science GND bilingual dataset AI

Key Findings

Methodology

The study introduces a GND-based multi-label classification method using a large bilingual (English/German) library catalog dataset. The methodology includes: 1) dataset construction with GND subject annotations; 2) a machine-actionable GND taxonomy; 3) predefined train/dev/test splits. This approach supports ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging.

Key Results

Result 1: The dataset comprises 136,569 records across multiple domains, providing rich subject annotations for multilingual consistency studies.
Result 2: Experiments demonstrate that GND taxonomy-based multi-label classification outperforms traditional methods in accuracy and interpretability.
Result 3: Qualitative error analyses of three systems reveal challenges in handling long-tail subjects, multilingual variations, and cross-domain distribution shifts.

Significance

This research provides a powerful tool for library science, supporting multilingual, multi-domain subject indexing. By integrating the GND taxonomy, the study not only enhances classification accuracy but also improves result interpretability and transparency. This is significant for libraries' needs in large-scale data processing and multilingual support.

Technical Contribution

Technical contributions include: 1) providing a large-scale, bilingual library catalog dataset; 2) developing a machine-actionable GND taxonomy; 3) offering reproducible evaluation protocols for ontology-based multi-label classification. These contributions open new possibilities for automation and intelligence in library science.

Novelty

This study is the first to apply GND taxonomy to large-scale multi-label classification, particularly in library science. The innovation lies in combining authoritative subject annotations with machine learning methods for efficient text-to-term mapping.

Limitations

Limitation 1: The dataset primarily relies on German GND subjects, which may limit applicability for non-German users.
Limitation 2: The sparsity of long-tail subjects may affect classifier performance.
Limitation 3: Current systems may face challenges in handling multilingual variations.

Future Work

Future research directions include: 1) expanding the dataset to support more languages; 2) developing more robust models to handle long-tail subjects; 3) exploring more efficient multilingual consistency and cross-domain adaptability methods.

AI Executive Summary

Subject indexing is crucial for discovery in libraries but challenging to maintain at scale and across languages. We release a large bilingual (English/German) library catalog dataset annotated with the German Integrated Authority File (GND), along with a machine-actionable GND taxonomy. This resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation.

Libraries have long relied on expert subject indexing to make collections findable, interoperable, and durable. Yet the rapidly growing, multilingual volume of library catalog records increasingly strains purely manual indexing workflows. At the same time, large language models (LLMs) and emerging agentic pipelines promise support—but they must be grounded in authoritative vocabularies, auditable, and evaluated in library terms rather than by generic text-classification scores. We present a machine-learning-ready resource that directly addresses this gap: a bilingual (English/German), multi-domain corpus of catalog records indexed with subjects from the German Integrated Authority File (Gemeinsame Normdatei, GND), released together with a machine-actionable version of the GND subject taxonomy and predefined train/dev/test splits. The goal is not merely scale, but structured scale—where every prediction links to a controlled vocabulary that libraries already trust.

This resource is designed to help the community interrogate practical questions that matter for library science in the LLM era: how can systems align free text to controlled vocabularies while preserving provenance and authority control? What counts as “useful” assistance—top-k quality at the point of description, hierarchical coherence, explainable rationales, or cataloger effort saved? How can models cope with long-tail subjects, multilingual variation, and distribution shift across domains and time? Where do agents best fit in human-in-the-loop workflows (triage, suggestion, validation)?

By providing an operational taxonomy, the dataset enables studies of vocabulary grounding, cross-lingual consistency, polysemy and variant labels, and reliability under realistic label sparsity—questions that generic XMTC benchmarks only partially surface. At a high level, our contribution pairs real catalog records with stable links to authoritative subject concepts and packages them for reproducible evaluation. This enables ontology-aware multi-label classification, retrieval-augmented mapping from free text to authority terms, and agent workflows that combine retrieval, suggestion, and curator feedback—evaluated with protocols that reflect cataloging realities (e.g., usefulness and hierarchical consistency at the top of the record).

We outline the resource, its construction and splits, initial analyses and baselines, and we position the paper as a statistical exploration to surface considerations for framing machine-learning solutions, and we conclude with qualitative error analyses of three systems developed on our data—inviting the LREC community to test, compare, and explore what successful, trustworthy AI assistance for subject indexing should look like.

Deep Dive

Abstract

Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.

cs.CL cs.AI cs.DL cs.IR

References (20)

Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs

Osma Suominen, J. Inkinen, Mona Lehtinen

2025 1 citations ⭐ Influential View Analysis →

SemEval-2025 Task 5: LLMs4Subjects - LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog

Jennifer D'Souza, Sameer Sadruddin, Holger Israel et al.

2025 18 citations View Analysis →

The AGROVOC Linked Dataset

Caterina Caracciolo, A. Stellato, Ahsan Morshed et al.

2013 195 citations

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs

Yury Malkov, Dmitry A. Yashunin

2016 2098 citations View Analysis →

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

Isabelle Mohr, Markus Krimmel, Saba Sturua et al.

2024 28 citations View Analysis →

FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning

Yashoteja Prabhu, M. Varma

2014 421 citations

Cumulated gain-based evaluation of IR techniques

K. Järvelin, Jaana Kekäläinen

2002 5328 citations

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang et al.

2024 367 citations View Analysis →

On Information and Sufficiency

Huaiyu Zhu

1997 9682 citations

Introduction to Information

J. Sengupta

1993 473 citations

OntoAligner: A Comprehensive Modular and Robust Python Toolkit for Ontology Alignment

Hamed Babaei Giglou, Jennifer D’Souza, Oliver Karras et al.

2025 7 citations View Analysis →

Annif: DIY automated subject indexing using multiple algorithms

Osma Suominen

2019 54 citations

AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification

R. You, Zihan Zhang, Ziye Wang et al.

2018 294 citations View Analysis →

Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification

Jiong Zhang, Wei-Cheng Chang, Hsiang-Fu Yu et al.

2021 127 citations View Analysis →

silp_nlp at SemEval-2025 Task 5: Subject Recommendation With Sentence Transformer

P. Goyal, Sumit Singh, U. Tiwary

2 citations

Bonsai: diverse and shallow trees for extreme multi-label classification

Sujay Khandagale, Han Xiao, Rohit Babbar

2019 178 citations View Analysis →

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition

G. Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis et al.

2015 764 citations

Taming Pretrained Transformers for Extreme Multi-label Text Classification

Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong et al.

2019 255 citations

Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials

Luis Gasco, A. Nentidis, Anastasia Krithara et al.

2021 19 citations

Human-competitive automatic topic indexing

Olena Medelyan

2009 116 citations

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection