Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection
This paper introduces the 'Disagreeing Rationales' framework, systematically analyzing how diverse human annotations and explanations impact hate speech detection, emphasizing the benefits of soft labels and rationales.
Key Findings
Methodology
The study constructs a unified evaluation framework integrating multiple models, training strategies, and metrics, covering diverse label and rationale representation spaces. Using two Transformer models (e.g., BERT-base) trained on English and Portuguese hate speech datasets (HateXplain and HateBRXplain), it combines hard, soft, and intermediate label and rationale representations. The evaluation employs classification metrics such as macro F1-score and Jensen-Shannon Divergence (JSD) for prediction distribution similarity, alongside explainability metrics across three dimensions: plausibility, faithfulness, and complexity. The training involves multi-objective loss functions—cross-entropy (CE), mean squared error (MSE), and KL divergence—allowing models to learn from diverse annotation spaces. Systematic re-implementation of models and metrics reveals the significant influence of rationale and label representation choices on performance, highlighting the advantages of softer representations in capturing human judgment variability.
Key Results
- Results demonstrate that soft labels and rationales outperform hard counterparts in both classification and explanation quality. For instance, in HateXplain, soft configurations improved F1-score by 2.5% and AUPRC by 3.2%. In HateBRXplain, models with soft rationales showed higher alignment with human annotations, with IoU scores increasing to 0.68. The sensitivity analysis indicates models trained with soft rationales exhibit more stable performance across metrics and datasets, confirming their effectiveness in capturing diverse human judgments. Correlation analyses show higher inter-metric consistency within soft rationale spaces, supporting the robustness of this approach.
- The findings reveal that plausibility (alignment with human rationales) is significantly higher in soft rationale spaces (AUPRC ~0.75), while faithfulness metrics (model attention vs. rationales) are also superior, indicating better reflection of model decision processes. Complexity measures (entropy, Gini index) suggest that soft rationales lead to more concise, focused explanations, aligning with human preferences for simplicity. The positive correlations among metrics within soft spaces imply that multiple evaluation dimensions can jointly assess explanation quality effectively. These insights emphasize the importance of flexible annotation representations in subjective NLP tasks.
- Overall, the results underscore that incorporating soft labels and rationales enhances the model’s ability to reflect human diversity, improves interpretability, and reduces biases. Traditional hard labels and rationales tend to oversimplify complex judgments, whereas softer representations better accommodate individual differences. This advances the development of fairer, more transparent models, especially in sensitive applications like hate speech detection, where understanding nuanced human perspectives is crucial.
- The study also highlights that reliance solely on attention-based explanations may have limitations, and future work should integrate gradient-based and perturbation-based interpretability methods to validate rationale fidelity. Additionally, expanding datasets with richer, natural language explanations and multi-modal annotations will further improve model robustness and explainability. Addressing computational costs and scalability of multi-representational training remains an open challenge, but the proposed framework provides a solid foundation for future exploration.
- In conclusion, this research offers a comprehensive approach to modeling and evaluating diverse human judgments in subjective NLP tasks, advocating for the adoption of flexible, multi-space annotation strategies. It paves the way for more inclusive, fair, and trustworthy AI systems capable of understanding and respecting human complexity in social contexts.
Significance
This work marks a pioneering effort in systematically analyzing the impact of annotation diversity on model performance and interpretability in subjective NLP tasks like hate speech detection. By integrating multiple label and rationale spaces, the framework captures the inherent variability in human judgments, addressing long-standing issues of bias, fairness, and explainability. The findings demonstrate that softer, more flexible representations not only improve classification accuracy but also produce explanations that better align with human reasoning, fostering trust and transparency. Such advancements are vital for deploying AI systems in socially sensitive domains, where understanding diverse perspectives is essential for ethical and effective moderation. The methodology and insights from this study set a new standard for evaluating explainability in subjective tasks, emphasizing the importance of modeling human judgment diversity rather than collapsing it into a single ground truth. This paradigm shift has profound implications for future research and real-world applications, promoting AI that is more aligned with human values and societal norms.
Technical Contribution
The paper introduces a unified evaluation framework that supports multiple label and rationale representation spaces, including hard, soft, and intermediate forms. It innovatively combines multi-objective loss functions—cross-entropy, MSE, and KL divergence—to enable models to learn from diverse annotations simultaneously. The approach systematically re-implements existing rationale-based models (e.g., MRP, SRA) within this flexible setting, revealing the influence of rationale representation on performance. The framework encompasses comprehensive metrics for classification (macro F1, JSD) and explainability (plausibility, faithfulness, complexity), providing a holistic assessment tool. The experimental validation across multilingual datasets demonstrates the robustness and generality of the approach, establishing new benchmarks for subjective NLP evaluation. Theoretically, it advances understanding of how annotation variability impacts model interpretability and fairness, offering a foundation for future multi-space modeling strategies.
Novelty
This work is the first to systematically incorporate and evaluate multiple human annotation spaces—hard, soft, and intermediate—in training and assessing NLP models for subjective tasks. Unlike prior approaches that rely solely on single ground-truth labels, it explicitly models annotation disagreement and rationales, capturing the richness of human judgment. The introduction of a unified, multi-metric evaluation protocol tailored for diverse label and rationale spaces is a key innovation, enabling more nuanced understanding of model behavior. The experimental evidence confirms that soft and intermediate representations better reflect human variability, leading to improved performance and interpretability. These contributions significantly advance the state-of-the-art in explainability and fairness for subjective NLP tasks, setting a new standard for future research.
Limitations
- The study relies on limited datasets (HateXplain and HateBRXplain), which may not fully represent the diversity of real-world social media content, potentially limiting the generalizability of findings.
- Models are based on BERT architecture; the performance and interpretability of larger or more recent models (e.g., GPT-4, T5) in multi-space annotation settings remain unexplored.
- The evaluation of rationales primarily uses attention weights, which may not fully capture the fidelity of explanations; integrating gradient-based or perturbation-based methods could improve validity.
- Handling multi-modal data (images, videos) or free-text explanations remains an open challenge, requiring further methodological development.
- Computational costs associated with training and evaluating models across multiple representation spaces are high, posing scalability challenges for large-scale deployment.
Future Work
未来的研究将集中在:• 扩展自然语言生成(NLG)解释,结合多模态数据(图像、视频)丰富理据表达;• 探索个性化、多文化背景下的多样化解释策略,提升模型的包容性;• 结合用户偏好和社会价值观,设计多元化的理据生成和评估机制;• 开发更高效的训练和推理方法,降低多空间模型的计算成本;• 建立多模态、多视角的理据评估指标体系,提升模型的解释可信度。通过这些努力,有望推动AI在社会敏感任务中的公平性、透明性和可信度不断提升,最终实现更具包容性和可解释性的智能系统。
AI Executive Summary
In the rapidly evolving field of NLP, subjective tasks such as hate speech detection pose unique challenges due to inherent annotation disagreements and interpretability issues. Traditional evaluation metrics—accuracy, F1-score—are insufficient for capturing the nuanced human judgments that underpin these tasks. Human annotators often disagree on whether content is hateful, reflecting diverse cultural, personal, and contextual perspectives. This variability complicates both model training and evaluation, risking biased or incomplete understanding of harmful content.
To address these issues, Muscat et al. propose the 'Disagreeing Rationales' framework, a comprehensive approach that systematically models and evaluates the diversity of human annotations. Central to this framework is the support for multiple label and rationale spaces—hard, soft, and intermediate—allowing models to learn from and reflect the full spectrum of human judgments. By integrating multi-objective loss functions such as cross-entropy, mean squared error, and KL divergence, the models are trained to balance accuracy with interpretability, capturing the richness of subjective annotations.
The experimental setup involves two multilingual datasets—HateXplain (English) and HateBRXplain (Portuguese)—each containing token-level rationales and diverse labels. The models, based on BERT architectures, are trained under various configurations, and evaluated across a suite of metrics. These include classification performance measures like macro F1-score and Jensen-Shannon divergence, as well as explainability metrics such as plausibility (alignment with human rationales), faithfulness (fidelity to model decision processes), and complexity (explanation conciseness). Results consistently show that softer label and rationale representations outperform traditional hard labels, especially in capturing annotation variability and producing more trustworthy explanations.
The significance of this work lies in its potential to transform how NLP models handle subjectivity. By embracing annotation disagreement rather than collapsing it into a single ground truth, the framework promotes models that are more aligned with human diversity. This approach enhances fairness, reduces biases, and improves transparency—crucial factors for deploying AI in socially sensitive domains like content moderation. The findings demonstrate that soft rationales lead to better model stability, interpretability, and user trust, paving the way for more inclusive AI systems.
Looking ahead, future research should explore integrating richer explanation formats, such as natural language generation, and extending multi-modal data incorporation. Addressing computational scalability and developing personalized explanation strategies will be vital for real-world applications. Overall, this study offers a foundational shift in subjective NLP evaluation, emphasizing the importance of modeling and respecting human judgment diversity to build fairer, more transparent AI.
Deep Analysis
Background
The field of NLP has seen rapid advancements with the advent of large pre-trained models like BERT (Devlin et al., 2019), which significantly improved performance on various tasks. However, subjective NLP tasks—such as hate speech detection, offensive language identification, and sentiment analysis—pose unique challenges due to the inherent variability in human judgments. Traditional supervised learning approaches rely on single ground-truth labels, often obtained via majority voting, which oversimplifies the complex, context-dependent nature of human perception. Recent efforts, such as UMA et al. (2025), have begun to explore multi-annotator datasets and probabilistic labels, acknowledging that different individuals may interpret content differently based on cultural, personal, and contextual factors. Despite these advances, existing evaluation frameworks primarily focus on classification accuracy, neglecting the interpretability and diversity of explanations. As explainability becomes a critical component for deploying NLP models in sensitive applications, there is a pressing need for evaluation paradigms that incorporate human judgment variability, especially at the token-level rationales. This background sets the stage for the current work, which aims to systematically analyze how diverse annotations influence both model performance and interpretability in subjective tasks.
Core Problem
The core challenge addressed in this paper is the modeling and evaluation of human judgment diversity in subjective NLP tasks, particularly hate speech detection. Human annotators often disagree on whether a statement is hateful, and their rationales—highlighted tokens or explanations—also vary significantly. Traditional models and evaluation metrics assume a single 'truth,' which leads to biased or incomplete understanding, especially in socially sensitive contexts. This oversimplification hampers the development of fair, transparent, and robust systems capable of handling real-world complexity. Moreover, existing explainability methods, such as attention weights, do not adequately reflect the diversity of human rationales, raising questions about their fidelity and trustworthiness. The problem is compounded by the lack of standardized evaluation protocols that account for annotation disagreement at both label and rationale levels, limiting progress in building models that genuinely reflect human perspectives and promote fairness.
Innovation
The key innovations introduced by this work include: • A multi-representation modeling framework that supports hard, soft, and intermediate label and rationale spaces, enabling models to learn from the full spectrum of human judgments. • A unified evaluation protocol that integrates classification and explainability metrics across diverse annotation spaces, providing a comprehensive assessment of model performance. • The adoption of multi-objective loss functions—cross-entropy, MSE, and KL divergence—that facilitate training models to handle annotation variability effectively. • Systematic re-implementation of existing rationale-based models (e.g., MRP, SRA) within this flexible framework, revealing the impact of rationale representation choices on performance. These innovations collectively address the limitations of traditional single-ground-truth approaches, offering a more nuanced understanding of human judgment diversity and its implications for model interpretability and fairness.
Methodology
The methodology involves several interconnected steps:
- �� Designing multiple label and rationale spaces: including hard labels (single class), soft labels (probabilistic distributions), and intermediate representations that capture annotation disagreement.
- �� Collecting token-level rationales from multiple annotators across datasets like HateXplain and HateBRXplain, ensuring diverse perspectives.
- �� Developing multi-objective loss functions: combining cross-entropy for classification accuracy, MSE for rationale regression, and KL divergence for distribution alignment.
- �� Training Transformer-based models (e.g., BERT) with attention mechanisms to extract rationales, incorporating the multi-space labels and rationales.
- �� Implementing evaluation metrics: classification performance (macro F1, JSD), rationale plausibility (AUPRC, IoU), faithfulness (attention alignment), and complexity (entropy, Gini index).
- �� Conducting experiments across different configurations—hard, soft, and intermediate representations—and datasets, with statistical significance testing (paired t-test, FDR correction) to validate findings.
- �� Analyzing the sensitivity of metrics to representation choices and exploring correlations among evaluation dimensions to understand their interplay.
Experiments
The experimental setup involves two multilingual datasets: HateXplain (English) and HateBRXplain (Portuguese), both containing token-level rationales and multiple annotations. Models are trained under various configurations—hard labels, soft labels, and intermediate representations—using the same hyperparameters for comparability. Baselines include original models like MRP and SRA, re-implemented within the proposed framework. The evaluation employs classification metrics such as macro F1-score and Jensen-Shannon divergence (JSD), as well as explainability metrics—plausibility (AUPRC, IoU), faithfulness (attention alignment), and complexity (entropy, Gini). Multiple runs with cross-validation ensure robustness, and statistical tests confirm significance. The analysis focuses on how different label and rationale spaces influence performance, stability, and interpretability, with particular attention to the benefits of soft representations in capturing annotation diversity.
Results
Experimental results demonstrate that models trained with soft labels and rationales outperform traditional hard-label models across all metrics. In HateXplain, soft configurations improved macro F1-score by approximately 2.5% and AUPRC by 3.2%, indicating better classification and explanation quality. In HateBRXplain, the IoU score for rationales increased to 0.68, reflecting higher alignment with human rationales. Sensitivity analysis shows that models with soft rationales exhibit more stable performance across datasets and configurations, with higher correlations among evaluation metrics, especially in plausibility and faithfulness. The results confirm that softer representations better capture the diversity of human judgments, leading to explanations that are both more accurate and more trustworthy. These findings advocate for adopting flexible annotation spaces in subjective NLP tasks.
Applications
This framework is directly applicable to content moderation, hate speech detection, and social media analysis, where understanding diverse human perspectives is crucial. Organizations can leverage soft labels and rationales to develop more inclusive and fair moderation tools, reducing biases and improving transparency. The approach also benefits academic research by providing comprehensive evaluation tools for explainability, fostering the development of models that better reflect societal values. In the long term, integrating multi-modal data (images, videos) and natural language explanations can further enhance interpretability and user trust. The methodology supports building AI systems capable of nuanced understanding and respectful engagement in socially sensitive contexts, ultimately promoting responsible AI deployment.
Limitations & Outlook
The current study relies on limited datasets (HateXplain and HateBRXplain), which may not fully encompass the diversity of real-world social media content, potentially affecting generalizability. The models are based on BERT, and performance with larger or more recent architectures (e.g., GPT-4, T5) remains to be validated. The reliance on attention weights for rationale evaluation may introduce biases, and alternative interpretability methods should be explored. Handling multi-modal data and free-text explanations requires further methodological development. Computational costs associated with training models across multiple representation spaces pose scalability challenges. Additionally, the framework's effectiveness in dynamic, evolving social environments needs further investigation.
Plain Language Accessible to non-experts
想象你在一家工厂工作,工厂里有很多工人(标注者),他们每天都在判断一件事情是否属于“坏”的类别,比如是否有人在说伤人的话。这些工人有不同的背景和观点,有时会对同一件事有不同的看法。有的工人觉得某句话很伤人,有的工人觉得没那么严重。工厂的机器(模型)要学会理解这些不同的判断,但传统的方法只听取大多数工人的意见,忽略了不同工人的看法。而这篇论文提出了一种新方法,让机器不仅听取大多数人的意见,还能理解每个人的不同看法,甚至可以用一种更细腻的方式(软标签和软理据)来表达这些差异。这样,机器就能更好地理解人们的多样性,做出更公平、更可信的判断。就像一个工厂里的工人们愿意表达自己真实的想法,而不是只说“大家都这么说”,这样工厂的产品(模型)才会更贴近真实世界的复杂性。
ELI14 Explained like you're 14
想象你在学校里,有很多同学在讨论一个问题,比如谁是班里的“最佳学生”。每个人的看法都不一样,有的同学觉得小明很棒,有的觉得小红更厉害。老师(模型)想知道谁最适合这个称号,但不能只听大多数人的意见,因为每个人的观点都很重要。于是,老师开始听每个人的理由(理据),而不是只看投票结果。有的理由很强烈,有的比较温和。老师还用一种特别的方法(软标签和软理据)来表达每个人的不同看法,而不是只用“谁赢了”那样简单的答案。这样,老师就能更公平地理解每个同学的想法,也能更好地解释为什么会有不同的看法。这个方法让老师更聪明,也让每个人的声音都被尊重。就像在一个班级里,每个人都可以说出自己的理由,老师用心听,最后得出的结论也更公平、更贴近每个人的心声。
Glossary
Disagreeing Rationales(不同意见的理据)
指在标注和解释中存在多样性和分歧的理由或依据,反映人类判断的复杂性。
论文核心概念,用于描述多样化的人类解释和模型理据。
Soft Labels(软标签)
概率分布形式的标签,表达不同类别的可能性,反映标注者的多样性。
用于训练模型以捕捉人类判断的变异性。
Rationales(理据)
支持模型预测的关键证据或理由,通常以标注的文本片段表示。
模型解释的重要依据。
Jensen-Shannon Divergence(JSD, Jensen-Shannon散度)
衡量两个概率分布相似度的指标,值越小表示越接近。
用于评估模型预测分布与人类标注分布的相似性。
Plausibility(合理性)
模型生成的理据与人类理据的匹配程度。
评价模型解释是否符合人类预期。
Faithfulness(忠实性)
模型的理据是否真实反映其内部决策过程。
衡量解释的真实性和可信度。
Entropy(熵)
衡量理据分布的散布程度,值越低表示越集中。
用来评估理据的简洁性。
Gini Index(基尼指数)
衡量理据的集中程度,值越高表示越集中。
用于评估理据的复杂度。
Attention Mechanism(注意力机制)
模型中用于突出重要信息的机制,常用于理据提取。
模型解释性的重要工具。
Multi-Modal Evaluation(多模态评估)
结合多种数据类型(文本、图像等)进行模型评价的方法。
未来提升理据丰富性的重要方向。
KL Divergence(Kullback-Leibler散度)
衡量两个概率分布差异的指标。
用于比较模型预测与真实标签的分布相似性。
Multi-Objective Loss(多目标损失)
结合多个损失函数,优化模型的不同性能指标。
实现多样标签和理据的协同学习。
Transformer(变换器模型)
一种基于自注意力机制的深度学习架构,广泛用于NLP。
本文采用的模型基础架构。
HateXplain Dataset(HateXplain数据集)
包含英语仇恨言论及理据标注的公开数据集。
实验数据来源之一。
HateBRXplain Dataset(HateBRXplain数据集)
葡萄牙语仇恨言论及理据标注数据集。
实验数据来源之一。
AUPRC(Precision-Recall曲线下面积)
衡量模型在理据合理性方面的性能指标。
评价理据合理性的重要指标。
IoU(Intersection over Union)
衡量模型理据与人类理据重叠程度的指标。
评估理据合理性。
Open Questions Unanswered questions from this research
- 1 当前研究主要依赖注意力机制,未来应结合梯度、扰动等多种解释方法验证理据的忠实性。如何在多模态、多视角下统一理据评估体系,仍是未解难题。未来还需探索更大规模、多样化的标注数据,以提升模型的泛化能力和解释多样性。
Abstract
Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.
References (20)
ERASER: A Benchmark to Evaluate Rationalized NLP Models
Jay DeYoung, Sarthak Jain, Nazneen Rajani et al.
Evaluating and Aggregating Feature-based Model Explanations
Umang Bhatt, Adrian Weller, J. Moura
HateBRXplain: A Benchmark Dataset with Human-Annotated Rationales for Explainable Hate Speech Detection in Brazilian Portuguese
Isadora Salles, Francielle Vargas, Fabrício Benevenuto
HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection
Binny Mathew, Punyajoy Saha, Seid Muhie Yimam et al.
A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI
Beiduo Chen, Siyao Peng, Anna Korhonen et al.
Ecologically Valid Explanations for Label Variation in NLI
Nan-Jiang Jiang, Chenhao Tan, M. Marneffe
Pearson Correlation Coefficient
Divergence measures based on the Shannon entropy
Jianhua Lin
Concise Explanations of Neural Networks using Adversarial Training
P. Chalasani, Jiefeng Chen, Amrita Roy Chowdhury et al.
Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection
Brage Eilertsen, Roskva Bjorgfinsd'ottir, Francielle Vargas et al.
Deep learning from crowds
Filipe Rodrigues, Francisco Câmara Pereira
A Diagnostic Study of Explainability Techniques for Text Classification
Pepa Atanasova, J. Simonsen, C. Lioma et al.
Using Effect Size-or Why the P Value Is Not Enough.
Gail M. Sullivan, R. Feinn
Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies
Ceyhan C Serdar, Murat Cihan, D. Yücel et al.
Training and Evaluating with Human Label Variation: An Empirical Study
Kemal Kurniawan, Meladel Mistica, Timothy Baldwin et al.
A systematic analysis of performance measures for classification tasks
Marina Sokolova, G. Lapalme
Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems
Benedetta Muscato, Lucia C. Passaro, Gizem Gezici et al.
Why Don’t You Do It Right? Analysing Annotators’ Disagreement in Subjective Tasks
Marta Sandri, Elisa Leonardelli, Sara Tonelli et al.
An Analysis of Variance Test for Normality (Complete Samples)
S. Shapiro, M. Wilk
The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism
Pratik S. Sachdeva, Renata Barreto, Geoff Bacon et al.