MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
MADE benchmark enhances multi-label text classification accuracy with uncertainty quantification, especially in medical device adverse events.
Key Findings
Methodology
This paper introduces MADE, a dynamic benchmark for multi-label text classification, particularly for medical device adverse event reports. The core methodology involves using over 20 encoder and decoder models for fine-tuning and few-shot learning. Uncertainty quantification methods based on entropy and consistency are systematically evaluated. The MADE dataset features a long-tailed distribution of hierarchical labels and enables reproducible evaluation through strict temporal splits.
Key Results
- Result 1: Smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive uncertainty quantification capabilities.
- Result 2: Generative fine-tuning delivers the most reliable uncertainty quantification, especially excelling on rare labels.
- Result 3: Large reasoning models improve performance on rare labels yet exhibit surprisingly weak uncertainty quantification.
Significance
This research provides an uncontaminated benchmark for multi-label text classification in the medical field, addressing the saturation and data contamination issues of existing benchmarks. By introducing a dynamically updated dataset, MADE offers a continuous evaluation platform for future research, enabling the testing of models' generalization capabilities on fresh data.
Technical Contribution
Technically, MADE overcomes the limitations of traditional static datasets by introducing a dynamically updated benchmark. It provides a long-tailed distribution of hierarchical labels and ensures reproducible evaluation through strict temporal splits. Additionally, it systematically evaluates various uncertainty quantification methods, offering practical guidance for future research.
Novelty
MADE is the first dynamic multi-label text classification benchmark focused on medical device adverse events. Unlike existing static datasets, it avoids data contamination through continuous updates and provides a more challenging evaluation environment.
Limitations
- Limitation 1: Although MADE provides a dynamically updated dataset, its hierarchical label structure may pose difficulties for models in handling complex label dependencies.
- Limitation 2: The uncertainty quantification capabilities on rare labels still need further improvement.
Future Work
Future research can explore ways to further enhance model generalization capabilities on the MADE dataset, especially in cases of rare labels and complex label dependencies. Additionally, research can focus on better integrating information and consistency-based uncertainty quantification methods to improve model reliability.
AI Executive Summary
In high-stakes domains like healthcare, machine learning models require not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity.
Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from medical device adverse event reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits.
We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty.
Our work is publicly available, providing an uncontaminated benchmark and comprehensive baselines for future research. Through MADE, researchers can test models' generalization capabilities on continuously updated data, avoiding potential leakage of test data into the pretraining corpora of future foundation models.
In summary, MADE not only provides a dynamic evaluation platform for multi-label text classification but also offers practical guidance on model selection and UQ strategies. Future research can build on this foundation to further explore improving model performance on complex label dependencies and rare labels.
Deep Analysis
Background
Multi-label text classification (MLTC) is crucial in the medical field for tasks such as patient categorization, clinical coding, and incident reporting. However, MLTC faces challenges like label imbalances, dependencies, and combinatorial complexity. Traditional MLTC benchmarks are increasingly saturated and affected by data contamination, making it difficult to evaluate the true capabilities of large language models (LLMs). Existing datasets are often static and may be included in LLM pre-training corpora, leading to data contamination. Additionally, the imbalance and interdependence of labels make models prone to bias toward frequent classes while ignoring rare but safety-critical conditions.
Core Problem
The core problem in MLTC is selecting multiple labels from a typically much larger set, leading to a combinatorial problem that scales exponentially with the label space size. Real-world MLTC data is characterized by severe inter- and intra-class imbalances: a few common conditions comprise the majority of examples, while safety-critical conditions reside in the long tail. Models must learn to disentangle correlated signatures without becoming biased toward frequent classes. Further, labels often co-occur and are hierarchically interdependent, violating the assumption of label independence.
Innovation
MADE's core innovations include:
1. Dynamically updated dataset: By continuously introducing new reports, it avoids data contamination issues.
2. Long-tailed distribution of hierarchical labels: Provides a more challenging evaluation environment.
3. Strict temporal splits: Ensures reproducible evaluation.
4. Systematic evaluation of uncertainty quantification methods: Offers practical guidance for future research.
Methodology
- �� Dataset Construction: Extract data from FDA's medical device adverse event reports to create a multi-label text classification dataset with hierarchical labels.
- �� Model Selection: Use over 20 encoder and decoder models for fine-tuning and few-shot learning.
- �� Uncertainty Quantification: Evaluate entropy and consistency-based methods, as well as self-verbalized uncertainty quantification.
- �� Experimental Design: Conduct systematic evaluations under fine-tuning and few-shot settings, comparing the performance of different models and uncertainty quantification methods.
Experiments
The experimental design involves using the FDA's medical device adverse event report dataset for fine-tuning and few-shot learning. Baseline models include over 20 encoder and decoder models, with evaluation metrics such as macro F1, Jaccard index, and uncertainty quantification metrics (PRR, Spearman correlation, ECE+). The experiments also include ablation studies to evaluate the effectiveness of different uncertainty quantification methods.
Results
Experimental results show that smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive uncertainty quantification capabilities. Generative fine-tuning delivers the most reliable uncertainty quantification, especially excelling on rare labels. Large reasoning models improve performance on rare labels yet exhibit surprisingly weak uncertainty quantification. Self-verbalized confidence is not a reliable proxy for uncertainty.
Applications
MADE can be directly applied to automated reporting and classification systems for medical device adverse events, helping medical institutions more effectively monitor and manage device safety issues. Its dynamically updated dataset and uncertainty quantification methods can also be applied to other high-risk domains' multi-label text classification tasks.
Limitations & Outlook
Although MADE provides a dynamically updated dataset, its hierarchical label structure may pose difficulties for models in handling complex label dependencies. Additionally, the uncertainty quantification capabilities on rare labels still need further improvement. Future research can explore ways to better integrate information and consistency-based uncertainty quantification methods to improve model reliability.
Plain Language Accessible to non-experts
Imagine a library with many books, each having multiple tags like 'Science Fiction', 'Mystery', 'Bestseller', etc. Our task is to assign the right tags to each new book. The problem is, some tags are very common, like 'Bestseller', while others are rare, like 'Sci-Fi Mystery'. We need a smart system to help us automatically tag books and alert us when it's unsure.
MADE is like this smart assistant for the library. It can quickly assign the right tags to each book and tell us how sure it is about these tags. If it's not very sure, it will prompt us to double-check.
This system gets smarter by continuously learning from new books and tags. It can also handle complex tag relationships, like a book being both 'Science Fiction' and 'Mystery', and there might be some connection between these tags.
In short, MADE is a tool that helps us manage the library better, allowing us to work more efficiently with a large number of books while ensuring the accuracy of the tags.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super complex game with lots of characters, each having different skills and attributes. Your task is to choose the right combination of skills for each character, which is like tagging them.
The problem is, some skills are common, like 'Attack Power', while others are rare, like 'Invisibility'. You need a smart assistant to help you quickly choose the right skill combinations and alert you when it's unsure.
MADE is like this assistant. It not only helps you quickly choose skill combinations but also tells you how sure it is about these choices. If it's not very sure, it will prompt you to double-check.
This assistant gets smarter by continuously learning from new characters and skills. It can also handle complex skill relationships, like a character needing both 'Attack Power' and 'Invisibility', and there might be some connection between these skills.
In short, MADE is a tool that helps you manage characters better in the game, allowing you to play more efficiently with a large number of characters while ensuring the accuracy of skill choices.
Glossary
Multi-label Text Classification
A machine learning task aimed at assigning multiple labels to each input sample.
Used in this paper for classifying medical device adverse events.
Uncertainty Quantification
Evaluates the degree of uncertainty in model predictions, helping to identify potential erroneous predictions.
Used to enhance model reliability in high-risk domains.
Long-tailed Distribution
A data distribution where a few categories occupy most samples, while most categories are rare.
Characteristic of the label distribution in the MADE dataset.
Entropy
A metric for measuring the uncertainty of a random variable; higher values indicate greater uncertainty.
One of the metrics used for uncertainty quantification.
Consistency
The stability of a model's output across multiple predictions, reflecting its reliability.
One of the metrics used for uncertainty quantification.
Discriminative Fine-tuning
A fine-tuning method focused on enhancing a model's discriminative ability for specific tasks.
Used in MADE to improve head-to-tail accuracy.
Generative Fine-tuning
A fine-tuning method focused on enhancing a model's ability to generate outputs.
Used in MADE to improve the reliability of uncertainty quantification.
Reasoning Model
A model capable of complex reasoning and decision-making, often used for handling rare labels.
Used in MADE to improve performance on rare labels.
Self-verbalization
A method where a model outputs a confidence score corresponding to its prediction.
Used in MADE to assess the model's confidence level.
FDA (Food and Drug Administration)
A U.S. government agency responsible for protecting and promoting public health.
Provides the medical device adverse event reports for the MADE dataset.
Open Questions Unanswered questions from this research
- 1 Open Question 1: How can we improve the uncertainty quantification capabilities on rare labels without increasing computational complexity? Current methods often require more computational resources when dealing with rare labels.
- 2 Open Question 2: How can we better integrate information and consistency-based uncertainty quantification methods to improve model reliability? Existing methods may conflict in certain situations.
- 3 Open Question 3: How can we simplify the hierarchical label structure without affecting model performance? Complex hierarchical structures may pose difficulties for models in handling label dependencies.
- 4 Open Question 4: How can we maintain model generalization capabilities on a dynamically updated dataset? As data continuously updates, models may need frequent adjustments.
- 5 Open Question 5: How can we balance model automation and human intervention in high-risk domains? Excessive human intervention may reduce system efficiency.
Applications
Immediate Applications
Medical Device Monitoring
Hospitals and medical institutions can use MADE to automate the monitoring and classification of medical device adverse events, improving the efficiency of device safety management.
Clinical Event Reporting
Clinical researchers can leverage MADE's dataset and uncertainty quantification methods to improve event reporting systems, ensuring accuracy and timeliness.
Drug Safety Monitoring
Regulatory agencies can use MADE's framework to develop similar systems for monitoring adverse drug reactions, enhancing drug safety.
Long-term Vision
Intelligent Healthcare Systems
In the future, MADE could become part of intelligent healthcare systems, helping doctors and medical institutions better manage and predict the safety issues of medical devices and drugs.
Cross-domain Applications
MADE's framework and methods can be extended to other high-risk domains, such as finance and aviation, to enhance risk management capabilities in these fields.
Abstract
Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.
References (20)
SGDR: Stochastic Gradient Descent with Warm Restarts
I. Loshchilov, F. Hutter
Decoupled Weight Decay Regularization
I. Loshchilov, F. Hutter
Measuring Calibration in Deep Learning
J. Nixon, Michael W. Dusenberry, Linchuan Zhang et al.
hdbscan: Hierarchical density based clustering
Leland McInnes, John Healy, S. Astels
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph
Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev et al.
A system for classifying multi-label text into EuroVoc
G. Boella, Luigi Di Caro, D. Rispoli et al.
KEVLAR: The Complete Resource for EuroVoc Classification of Legal Documents
Lorenzo Bocchi, Camilla Casula, Alessio Palmero Aprosio
Benchmarking large language models for biomedical natural language processing applications and recommendations
Qingyu Chen, Jingcheng Du, Yan Hu et al.
Evaluation framework to guide implementation of AI systems into healthcare settings
S. Reddy, Wendy Rogers, Ville-Petteri Makinen et al.
Exploring the Landscape of Natural Language Processing Research
Tim Schopf, Karim Arabi, F. Matthes
Calibration of Probabilities: The State of the Art
S. Lichtenstein, Baruch Fischhoff, L. Phillips
Navigating Uncertainty: A User-Perspective Survey of Trustworthiness of AI in Healthcare
Jaya Ojha, Oriana Presacan, Pedro G. Lind et al.
HDLTex: Hierarchical Deep Learning for Text Classification
Kamran Kowsari, Donald E. Brown, Mojtaba Heidarysafa et al.
Proving Test Set Contamination in Black Box Language Models
Yonatan Oren, Nicole Meister, Niladri S. Chatterji et al.
Misclassification Risk and Uncertainty Quantification in Deep Classifiers
Murat Şensoy, Maryam Saleki, S. Julier et al.
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
Katherine Tian, E. Mitchell, Allan Zhou et al.
Hierarchy-aware Biased Bound Margin Loss Function for Hierarchical Text Classification
Gibaeg Kim, Sanghun Im, Heung-Seon Oh
Calibration-Based Empirical Probability
A. Dawid
Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations
Qingyu Chen, Alexis Allot, Robert Leaman et al.
Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization
Svetlana Kiritchenko, S. Matwin, R. Nock et al.