MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

TL;DR

MADE benchmark enhances multi-label text classification accuracy with uncertainty quantification, especially in medical device adverse events.

cs.CL 🔴 Advanced 2026-04-17 36 views
Raunak Agarwal Markus Wenzel Simon Baur Jonas Zimmer George Harvey Jackie Ma
multi-label classification uncertainty quantification medical devices machine learning dataset

Key Findings

Methodology

This paper introduces MADE, a dynamic benchmark for multi-label text classification, particularly for medical device adverse event reports. The core methodology involves using over 20 encoder and decoder models for fine-tuning and few-shot learning. Uncertainty quantification methods based on entropy and consistency are systematically evaluated. The MADE dataset features a long-tailed distribution of hierarchical labels and enables reproducible evaluation through strict temporal splits.

Key Results

  • Result 1: Smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive uncertainty quantification capabilities.
  • Result 2: Generative fine-tuning delivers the most reliable uncertainty quantification, especially excelling on rare labels.
  • Result 3: Large reasoning models improve performance on rare labels yet exhibit surprisingly weak uncertainty quantification.

Significance

This research provides an uncontaminated benchmark for multi-label text classification in the medical field, addressing the saturation and data contamination issues of existing benchmarks. By introducing a dynamically updated dataset, MADE offers a continuous evaluation platform for future research, enabling the testing of models' generalization capabilities on fresh data.

Technical Contribution

Technically, MADE overcomes the limitations of traditional static datasets by introducing a dynamically updated benchmark. It provides a long-tailed distribution of hierarchical labels and ensures reproducible evaluation through strict temporal splits. Additionally, it systematically evaluates various uncertainty quantification methods, offering practical guidance for future research.

Novelty

MADE is the first dynamic multi-label text classification benchmark focused on medical device adverse events. Unlike existing static datasets, it avoids data contamination through continuous updates and provides a more challenging evaluation environment.

Limitations

  • Limitation 1: Although MADE provides a dynamically updated dataset, its hierarchical label structure may pose difficulties for models in handling complex label dependencies.
  • Limitation 2: The uncertainty quantification capabilities on rare labels still need further improvement.

Future Work

Future research can explore ways to further enhance model generalization capabilities on the MADE dataset, especially in cases of rare labels and complex label dependencies. Additionally, research can focus on better integrating information and consistency-based uncertainty quantification methods to improve model reliability.

AI Executive Summary

In high-stakes domains like healthcare, machine learning models require not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity.

Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from medical device adverse event reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits.

We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty.

Our work is publicly available, providing an uncontaminated benchmark and comprehensive baselines for future research. Through MADE, researchers can test models' generalization capabilities on continuously updated data, avoiding potential leakage of test data into the pretraining corpora of future foundation models.

In summary, MADE not only provides a dynamic evaluation platform for multi-label text classification but also offers practical guidance on model selection and UQ strategies. Future research can build on this foundation to further explore improving model performance on complex label dependencies and rare labels.

Deep Analysis

Background

Multi-label text classification (MLTC) is crucial in the medical field for tasks such as patient categorization, clinical coding, and incident reporting. However, MLTC faces challenges like label imbalances, dependencies, and combinatorial complexity. Traditional MLTC benchmarks are increasingly saturated and affected by data contamination, making it difficult to evaluate the true capabilities of large language models (LLMs). Existing datasets are often static and may be included in LLM pre-training corpora, leading to data contamination. Additionally, the imbalance and interdependence of labels make models prone to bias toward frequent classes while ignoring rare but safety-critical conditions.

Core Problem

The core problem in MLTC is selecting multiple labels from a typically much larger set, leading to a combinatorial problem that scales exponentially with the label space size. Real-world MLTC data is characterized by severe inter- and intra-class imbalances: a few common conditions comprise the majority of examples, while safety-critical conditions reside in the long tail. Models must learn to disentangle correlated signatures without becoming biased toward frequent classes. Further, labels often co-occur and are hierarchically interdependent, violating the assumption of label independence.

Innovation

MADE's core innovations include:

1. Dynamically updated dataset: By continuously introducing new reports, it avoids data contamination issues.

2. Long-tailed distribution of hierarchical labels: Provides a more challenging evaluation environment.

3. Strict temporal splits: Ensures reproducible evaluation.

4. Systematic evaluation of uncertainty quantification methods: Offers practical guidance for future research.

Methodology

  • �� Dataset Construction: Extract data from FDA's medical device adverse event reports to create a multi-label text classification dataset with hierarchical labels.
  • �� Model Selection: Use over 20 encoder and decoder models for fine-tuning and few-shot learning.
  • �� Uncertainty Quantification: Evaluate entropy and consistency-based methods, as well as self-verbalized uncertainty quantification.
  • �� Experimental Design: Conduct systematic evaluations under fine-tuning and few-shot settings, comparing the performance of different models and uncertainty quantification methods.

Experiments

The experimental design involves using the FDA's medical device adverse event report dataset for fine-tuning and few-shot learning. Baseline models include over 20 encoder and decoder models, with evaluation metrics such as macro F1, Jaccard index, and uncertainty quantification metrics (PRR, Spearman correlation, ECE+). The experiments also include ablation studies to evaluate the effectiveness of different uncertainty quantification methods.

Results

Experimental results show that smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive uncertainty quantification capabilities. Generative fine-tuning delivers the most reliable uncertainty quantification, especially excelling on rare labels. Large reasoning models improve performance on rare labels yet exhibit surprisingly weak uncertainty quantification. Self-verbalized confidence is not a reliable proxy for uncertainty.

Applications

MADE can be directly applied to automated reporting and classification systems for medical device adverse events, helping medical institutions more effectively monitor and manage device safety issues. Its dynamically updated dataset and uncertainty quantification methods can also be applied to other high-risk domains' multi-label text classification tasks.

Limitations & Outlook

Although MADE provides a dynamically updated dataset, its hierarchical label structure may pose difficulties for models in handling complex label dependencies. Additionally, the uncertainty quantification capabilities on rare labels still need further improvement. Future research can explore ways to better integrate information and consistency-based uncertainty quantification methods to improve model reliability.

Plain Language Accessible to non-experts

Imagine a library with many books, each having multiple tags like 'Science Fiction', 'Mystery', 'Bestseller', etc. Our task is to assign the right tags to each new book. The problem is, some tags are very common, like 'Bestseller', while others are rare, like 'Sci-Fi Mystery'. We need a smart system to help us automatically tag books and alert us when it's unsure.

MADE is like this smart assistant for the library. It can quickly assign the right tags to each book and tell us how sure it is about these tags. If it's not very sure, it will prompt us to double-check.

This system gets smarter by continuously learning from new books and tags. It can also handle complex tag relationships, like a book being both 'Science Fiction' and 'Mystery', and there might be some connection between these tags.

In short, MADE is a tool that helps us manage the library better, allowing us to work more efficiently with a large number of books while ensuring the accuracy of the tags.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game with lots of characters, each having different skills and attributes. Your task is to choose the right combination of skills for each character, which is like tagging them.

The problem is, some skills are common, like 'Attack Power', while others are rare, like 'Invisibility'. You need a smart assistant to help you quickly choose the right skill combinations and alert you when it's unsure.

MADE is like this assistant. It not only helps you quickly choose skill combinations but also tells you how sure it is about these choices. If it's not very sure, it will prompt you to double-check.

This assistant gets smarter by continuously learning from new characters and skills. It can also handle complex skill relationships, like a character needing both 'Attack Power' and 'Invisibility', and there might be some connection between these skills.

In short, MADE is a tool that helps you manage characters better in the game, allowing you to play more efficiently with a large number of characters while ensuring the accuracy of skill choices.

Glossary

Multi-label Text Classification

A machine learning task aimed at assigning multiple labels to each input sample.

Used in this paper for classifying medical device adverse events.

Uncertainty Quantification

Evaluates the degree of uncertainty in model predictions, helping to identify potential erroneous predictions.

Used to enhance model reliability in high-risk domains.

Long-tailed Distribution

A data distribution where a few categories occupy most samples, while most categories are rare.

Characteristic of the label distribution in the MADE dataset.

Entropy

A metric for measuring the uncertainty of a random variable; higher values indicate greater uncertainty.

One of the metrics used for uncertainty quantification.

Consistency

The stability of a model's output across multiple predictions, reflecting its reliability.

One of the metrics used for uncertainty quantification.

Discriminative Fine-tuning

A fine-tuning method focused on enhancing a model's discriminative ability for specific tasks.

Used in MADE to improve head-to-tail accuracy.

Generative Fine-tuning

A fine-tuning method focused on enhancing a model's ability to generate outputs.

Used in MADE to improve the reliability of uncertainty quantification.

Reasoning Model

A model capable of complex reasoning and decision-making, often used for handling rare labels.

Used in MADE to improve performance on rare labels.

Self-verbalization

A method where a model outputs a confidence score corresponding to its prediction.

Used in MADE to assess the model's confidence level.

FDA (Food and Drug Administration)

A U.S. government agency responsible for protecting and promoting public health.

Provides the medical device adverse event reports for the MADE dataset.

Open Questions Unanswered questions from this research

  • 1 Open Question 1: How can we improve the uncertainty quantification capabilities on rare labels without increasing computational complexity? Current methods often require more computational resources when dealing with rare labels.
  • 2 Open Question 2: How can we better integrate information and consistency-based uncertainty quantification methods to improve model reliability? Existing methods may conflict in certain situations.
  • 3 Open Question 3: How can we simplify the hierarchical label structure without affecting model performance? Complex hierarchical structures may pose difficulties for models in handling label dependencies.
  • 4 Open Question 4: How can we maintain model generalization capabilities on a dynamically updated dataset? As data continuously updates, models may need frequent adjustments.
  • 5 Open Question 5: How can we balance model automation and human intervention in high-risk domains? Excessive human intervention may reduce system efficiency.

Applications

Immediate Applications

Medical Device Monitoring

Hospitals and medical institutions can use MADE to automate the monitoring and classification of medical device adverse events, improving the efficiency of device safety management.

Clinical Event Reporting

Clinical researchers can leverage MADE's dataset and uncertainty quantification methods to improve event reporting systems, ensuring accuracy and timeliness.

Drug Safety Monitoring

Regulatory agencies can use MADE's framework to develop similar systems for monitoring adverse drug reactions, enhancing drug safety.

Long-term Vision

Intelligent Healthcare Systems

In the future, MADE could become part of intelligent healthcare systems, helping doctors and medical institutions better manage and predict the safety issues of medical devices and drugs.

Cross-domain Applications

MADE's framework and methods can be extended to other high-risk domains, such as finance and aviation, to enhance risk management capabilities in these fields.

Abstract

Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.

cs.CL

References (20)

SGDR: Stochastic Gradient Descent with Warm Restarts

I. Loshchilov, F. Hutter

2016 10219 citations ⭐ Influential View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 32709 citations ⭐ Influential

Measuring Calibration in Deep Learning

J. Nixon, Michael W. Dusenberry, Linchuan Zhang et al.

2019 620 citations ⭐ Influential View Analysis →

hdbscan: Hierarchical density based clustering

Leland McInnes, John Healy, S. Astels

2017 2518 citations ⭐ Influential

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev et al.

2024 83 citations View Analysis →

A system for classifying multi-label text into EuroVoc

G. Boella, Luigi Di Caro, D. Rispoli et al.

2013 24 citations

KEVLAR: The Complete Resource for EuroVoc Classification of Legal Documents

Lorenzo Bocchi, Camilla Casula, Alessio Palmero Aprosio

2024 4 citations

Benchmarking large language models for biomedical natural language processing applications and recommendations

Qingyu Chen, Jingcheng Du, Yan Hu et al.

2023 177 citations View Analysis →

Evaluation framework to guide implementation of AI systems into healthcare settings

S. Reddy, Wendy Rogers, Ville-Petteri Makinen et al.

2021 160 citations

Exploring the Landscape of Natural Language Processing Research

Tim Schopf, Karim Arabi, F. Matthes

2023 20 citations View Analysis →

Calibration of Probabilities: The State of the Art

S. Lichtenstein, Baruch Fischhoff, L. Phillips

1977 725 citations

Navigating Uncertainty: A User-Perspective Survey of Trustworthiness of AI in Healthcare

Jaya Ojha, Oriana Presacan, Pedro G. Lind et al.

2025 22 citations

HDLTex: Hierarchical Deep Learning for Text Classification

Kamran Kowsari, Donald E. Brown, Mojtaba Heidarysafa et al.

2017 469 citations View Analysis →

Proving Test Set Contamination in Black Box Language Models

Yonatan Oren, Nicole Meister, Niladri S. Chatterji et al.

2023 217 citations View Analysis →

Misclassification Risk and Uncertainty Quantification in Deep Classifiers

Murat Şensoy, Maryam Saleki, S. Julier et al.

2021 30 citations

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

Katherine Tian, E. Mitchell, Allan Zhou et al.

2023 643 citations View Analysis →

Hierarchy-aware Biased Bound Margin Loss Function for Hierarchical Text Classification

Gibaeg Kim, Sanghun Im, Heung-Seon Oh

2024 7 citations

Calibration-Based Empirical Probability

A. Dawid

1985 158 citations

Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations

Qingyu Chen, Alexis Allot, Robert Leaman et al.

2022 42 citations View Analysis →

Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization

Svetlana Kiritchenko, S. Matwin, R. Nock et al.

2006 142 citations