Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

TL;DR

Proposes CAPO, a cross-annotator preference optimization method, enabling LLMs to learn and reproduce stable individual explanation behaviors, outperforming prompting and SFT.

cs.CL 🔴 Advanced 2026-05-28 77 views

Beiduo Chen Pingjun Hong Ziyun Zhang Benjamin Roth Anna Korhonen Barbara Plank

AI Reader Arxiv Page Download PDF

NLP annotator behavior model simulation preference optimization explanation behavior

Key Findings

Methodology

This study analyzes four annotators across two sentence-pair tasks—Natural Language Inference (NLI) and Paraphrase Judgment—by examining their stable individual explanation patterns. Input content is systematically reduced and annotations aggregated at the annotator level to reveal consistent behaviors. Baseline methods include prompting (in-context learning, profile prompting) and supervised fine-tuning (SFT). The core innovation, cross-annotator preference optimization (CAPO), contrasts a target annotator's responses with other valid but less target-specific annotations for the same input, using a contrastive loss (e.g., DPO). Evaluation involves decision accuracy, explanation similarity (ROUGE, BERTScore), aggregation-aware imitation metrics, and human validation, demonstrating CAPO's superior ability to model annotator-specific explanation behavior.

Key Results

Prompting methods exhibit limited stability, with decision accuracy around 40%. SFT significantly improves behavior modeling, reaching approximately 55%. CAPO further enhances target annotator imitation, achieving over 96% accuracy at the group level. Feature KL divergence indicates CAPO reduces content bias, emphasizing stylistic features like length and lexical reuse. Human validation confirms explanations generated by CAPO align better with target annotator reasoning, with 82.8% agreement. These results demonstrate CAPO’s effectiveness in capturing stable, individual explanation patterns while maintaining decision performance.
Across multiple metrics, CAPO outperforms baseline methods in both automatic and human evaluations, especially in reproducing annotator-specific reasoning and explanation styles. The model's ability to generate personalized explanations without sacrificing accuracy highlights its potential for scalable, interpretable NLP applications.

Significance

This work advances the understanding of human label variation by modeling annotator-specific explanation behaviors, moving beyond traditional label distribution approaches. It provides a scalable framework for generating interpretable, personalized annotations, crucial for applications like explainable AI, human-in-the-loop systems, and knowledge base construction. By capturing individual reasoning patterns, the approach enhances model transparency and trustworthiness, addressing long-standing challenges in AI interpretability and human-AI collaboration. The methodology also opens avenues for personalized education, legal reasoning, and medical diagnostics, where understanding individual thought processes is vital.

Technical Contribution

The paper introduces the CAPO algorithm, which leverages contrastive learning with human annotation variation to encode individual explanation styles. It innovatively combines content reduction techniques (residual embeddings) with pairwise preference contrastive loss, enabling models to learn stable, person-specific behaviors. The evaluation framework integrates multiple metrics—decision accuracy, explanation overlap, and recognizability classifiers—providing a comprehensive assessment of behavior imitation. This approach departs from traditional fine-tuning by explicitly modeling behavioral differences rather than relying solely on aggregate data, thus offering a new paradigm for personalized model training.

Novelty

This research is the first to explicitly treat annotator explanation behaviors as learnable, stable signals grounded in human annotation histories. Unlike prior work focusing on label distributions or persona-based prompting, CAPO uses pairwise preference contrast derived from actual annotation variation, enabling models to learn individual reasoning styles. The combination of content reduction, stable behavior detection, and contrastive training constitutes a novel framework that significantly enhances the interpretability and personalization of language models.

Limitations

The approach relies heavily on multiple annotations per input to identify stable behaviors, limiting applicability where data is sparse or annotations are inconsistent. In highly ambiguous or novel inputs, the learned behaviors may not generalize well.
Training with pairwise preferences increases computational complexity and requires careful construction of annotation pairs to avoid confounding label or score differences, which may not always be feasible in real-world datasets.
Current models may still struggle with highly diverse or context-dependent behaviors, and the method's effectiveness in multilingual or cross-cultural settings remains untested.

Future Work

Future research could explore integrating multimodal data (images, audio) to enrich behavior modeling, developing adaptive pair selection strategies for more efficient training, and extending the framework to low-resource languages. Additionally, investigating online learning paradigms where models continuously adapt to new annotation behaviors would be valuable. Applying this approach to real-world annotation platforms, such as crowdsourcing or expert systems, could validate its practical utility. Further, combining behavior modeling with active learning could optimize annotation efforts and improve model robustness across diverse domains.

AI Executive Summary

In recent years, natural language processing (NLP) has made remarkable progress with large pre-trained language models (LLMs). However, a persistent challenge remains: human annotations, especially explanations, exhibit significant variability across individuals. Traditional approaches treat disagreement as noise, aggregating labels into distributions that obscure the underlying reasoning differences. This simplification limits the interpretability and personalization of AI systems, which are increasingly demanded in sensitive applications like healthcare, legal analysis, and education.

Recognizing this gap, the authors of this study propose a novel framework that models annotator-specific explanation behaviors as stable signals. They introduce the cross-annotator preference optimization (CAPO) algorithm, designed to leverage the natural variation among human annotators as a form of contrastive supervision. By contrasting a target annotator’s responses with those of others on the same input, CAPO encourages models to learn and reproduce individual reasoning styles, rather than generic or averaged behaviors.

The research employs two well-established sentence-pair tasks—natural language inference (NLI) and paraphrase judgment—each annotated by four different individuals. Through detailed analysis, the authors find that single annotations are heavily content-dependent and thus unreliable for capturing stable behaviors. However, when annotations are aggregated at the annotator level, consistent stylistic patterns emerge. This insight underpins the design of CAPO, which uses pairwise preference contrast to reinforce target-specific explanation patterns during training.

Experimental results demonstrate that prompting-based approaches are limited and unstable in modeling individual behaviors. Supervised fine-tuning (SFT) significantly improves performance, but CAPO surpasses it by further enhancing the recognizability and stability of individual explanation styles. Quantitative metrics such as decision accuracy, explanation overlap (ROUGE, BERTScore), and recognizability classifiers show consistent gains. Human validation confirms that explanations generated by CAPO are more aligned with target annotator reasoning, with 82.8% agreement.

This work has profound implications for scalable, interpretable AI. By grounding explanation behaviors in actual annotation histories, it paves the way for personalized AI systems that can adapt to diverse reasoning styles, improving trust and usability. The methodology also offers a new perspective on understanding human cognition and decision-making processes through the lens of machine learning. Future directions include extending behavior modeling to multimodal data, optimizing training efficiency, and deploying in real-world annotation platforms. Overall, this research marks a significant step toward more human-like, explainable AI systems that respect individual differences in reasoning and explanation.

Deep Analysis

Background

The evolution of NLP has increasingly emphasized interpretability and human-centered AI. Early efforts focused on probabilistic label models and uncertainty quantification, exemplified by works like Nie et al. (2020, 2026) and Aroyo and Welty (2019). These approaches treated disagreement as noise, aiming to produce a consensus label distribution. However, recent studies highlight that disagreement often encodes meaningful differences in interpretation, reasoning, and perspective, especially in ambiguous tasks like NLI and paraphrase detection. The advent of explainable AI (XAI) and rationales has further enriched this understanding, with works such as Jiang and de Marneffe (2022) demonstrating the value of free-text explanations. Despite these advances, most models still rely on aggregate annotations, neglecting individual behavioral nuances. This gap motivates the current research, which seeks to explicitly model and learn individual explanation styles grounded in human annotation histories.

Core Problem

The core challenge addressed in this paper is the difficulty in capturing and reproducing stable, individual explanation behaviors from human annotations. Existing models often conflate content-driven explanations with stylistic tendencies, making it hard to disentangle the two. Single annotations are content-dominated and highly variable, which hampers the ability to learn consistent behavioral patterns. Moreover, current approaches lack mechanisms to leverage the natural variation among annotators as a source of meaningful signals. This results in models that produce generic or averaged explanations, limiting their interpretability and personalization. The problem is compounded by the scarcity of annotated data that explicitly targets individual behaviors, necessitating new methods to extract and utilize such signals effectively.

Innovation

This paper introduces several key innovations:

�� Content reduction via residual embeddings (E4) that diminish input influence, isolating stylistic features.
�� Cross-annotator preference contrast: leveraging pairwise differences in responses from multiple annotators on the same input as supervision signals.
�� The CAPO algorithm, which combines content reduction with preference contrastive learning, enabling models to learn stable, person-specific explanation behaviors.
�� A comprehensive evaluation framework that includes automatic metrics (decision accuracy, explanation similarity, recognizability classifiers) and human validation, ensuring behavioral authenticity.
�� Demonstrating that stable individual explanation behaviors can be learned and reproduced, opening new avenues for personalized NLP applications.

Methodology

�� Data Collection: Utilized two datasets—VariErr (NLI) and R2 (paraphrase)—each with four annotators providing labels and explanations.
�� Data Preprocessing: Applied content reduction techniques, such as residual embeddings (E4), to minimize input content influence and highlight stylistic features.
�� Baseline Methods: Implemented prompting strategies (in-context learning, profile prompting) and supervised fine-tuning (SFT) with independent adapters.
�� CAPO Design:
�� Input: Sentence pairs and target annotator ID.
�� Pair Construction: For each input, select the target annotator’s response as positive; other annotators’ responses as negatives, forming preference pairs.
�� Loss Function: Used a contrastive loss inspired by DPO (Rafailov et al.) to optimize the model’s ability to prefer target responses.
�� Training: Initialized from SFT models, then fine-tuned with preference contrast, emphasizing target-specific behaviors.
�� Evaluation: Employed decision accuracy, ROUGE/BERTScore for explanations, feature KL divergence, and a group classifier (GC) to assess recognizability; human validation was conducted for qualitative assessment.

Experiments

The experiments involved training models on the VariErr and R2 datasets, with 300/100/100 splits for training, validation, and testing. Baselines included prompting (ICL, VP, VP-ICL), SFT, and the proposed CAPO. The models were evaluated on multiple metrics: label accuracy, explanation overlap (ROUGE-L, BERTScore), feature KL divergence, and recognizability scores from a group classifier. The analysis focused on the impact of group size (m) for aggregation, demonstrating that larger groups better reveal stable behaviors. Human evaluators also assessed explanation quality and style consistency. Results consistently showed CAPO’s superiority in modeling individual behaviors while maintaining decision accuracy.

Results

CAPO achieved over 96% accuracy in recognizing target annotator behaviors at the group level, far surpassing SFT (~55%) and prompting (~40%). Feature KL divergence decreased significantly, indicating reduced content bias and enhanced stylistic fidelity. Human validation confirmed explanations generated by CAPO aligned more closely with target annotator reasoning, with an 82.8% agreement rate. The model effectively balanced behavior imitation with decision performance, demonstrating that stable, individual explanation styles can be learned and reproduced reliably. These findings validate the core hypothesis that annotator-specific behaviors are learnable signals, especially when aggregated across multiple instances.

Applications

This methodology can be directly applied to improve the quality and interpretability of large-scale annotation systems, enabling personalized explanations in AI assistants, medical diagnostics, and legal reasoning. It facilitates the creation of models that adapt to individual reasoning styles, making AI outputs more trustworthy and user-aligned. Long-term, the approach could underpin adaptive learning environments, where AI systems evolve with user feedback, and support cross-cultural or multilingual annotation tasks by capturing diverse explanatory norms. It also offers a pathway toward more transparent AI, where understanding individual reasoning enhances user trust and system accountability.

Limitations & Outlook

The reliance on multiple annotations per input limits applicability in low-resource or real-time scenarios. The pairwise preference training increases computational overhead and complexity, requiring careful pair selection to avoid confounding label or score differences. The approach’s effectiveness diminishes when annotator behaviors are highly inconsistent or context-dependent. Additionally, the current focus on English datasets necessitates validation in multilingual settings. Future work should address these limitations by developing more efficient training strategies, exploring few-shot or unsupervised variants, and testing across diverse languages and domains.

Plain Language Accessible to non-experts

Imagine you’re in a classroom where each student answers questions differently. Some write long explanations, others keep it short. The teacher wants to understand each student’s unique style so they can give personalized feedback and help each student learn better. This paper is like teaching a robot to do the same thing. Instead of just giving the right answer, the robot learns how each person explains things, capturing their individual habits.

The scientists found that when a person explains something many times, their style becomes clearer. For example, one person might always use lots of examples, while another prefers simple words. By collecting many explanations from different people, the robot can learn to mimic each one’s style.

To do this, they set up a system where the robot compares responses from different people on the same question. If one person’s explanation is more like the target person’s usual way, the robot learns to prefer that style. Over time, the robot gets better at giving explanations that sound just like the target person, making it more personalized and trustworthy.

This is useful because it helps create AI that can talk and explain in ways that feel natural and familiar to different users. Whether it’s a teaching assistant, a medical chatbot, or a customer service bot, understanding individual explanation styles makes AI more helpful and human-like. It’s like having a conversation with a friend who always explains things just the way you like!

Abstract

Free-text explanations extend human label variation (HLV) beyond label disagreement by revealing the reasoning and preferences behind annotators' decisions. We study whether large language models (LLMs) can learn and reproduce such annotator-specific label-explanation behavior. Using two sentence-pair tasks with four annotators each -- natural language inference and paraphrase judgment -- we first analyze whether annotators exhibit stable individual patterns. We find that such patterns are weak at the single-annotation level due to strong input-content effects, but become detectable after input-content reduction and annotator-level aggregation. We then compare prompting and supervised fine-tuning (SFT) baselines and propose cross-annotator preference optimization (CAPO), which contrasts a target annotator's response with other valid but less target-specific annotations for the same input. Experiments show that prompting is limited and unstable, SFT better captures annotator-specific behavior, and CAPO further improves aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation. Overall, our results show that HLV can be learned as annotator-specific label-explanation behavior, suggesting a path toward scalable explanation-based annotation grounded in annotator histories rather than labels alone.

cs.CL

References (20)

A survey of modern authorship attribution methods

E. Stamatatos

2009 915 citations ⭐ Influential

Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions

Matthias Orlikowski, Jiaxin Pei, Paul Rottger et al.

2025 37 citations View Analysis →

LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task

Elisa Leonardelli, Silvia Casola, Siyao Peng et al.

2025 12 citations View Analysis →

We Need to Consider Disagreement in Evaluation

Valerio Basile, Michael Fell, Tommaso Fornaciari et al.

2021 185 citations

Conference on Neural Information Processing Systems

Lu Liu, Tianyi Zhou, Guodong Long et al.

2019 210 citations

Out of One, Many: Using Language Models to Simulate Human Samples

Lisa P. Argyle, E. Busby, Nancy Fulda et al.

2022 1051 citations View Analysis →

Ecologically Valid Explanations for Label Variation in NLI

Nan-Jiang Jiang, Chenhao Tan, M. Marneffe

2023 17 citations View Analysis →

Steering Language Models With Activation Engineering

A. M. Turner, Lisa Thiergart, Gavin Leech et al.

2023 534 citations View Analysis →

Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

Junyu Lu, Deyi Ji, Xuanyi Liu et al.

2026 1 citations View Analysis →

Learning from Disagreement: A Survey

Alexandra Uma, Tommaso Fornaciari, Dirk Hovy et al.

2021 296 citations

A Coefficient of Agreement for Nominal Scales

Jacob Cohen

1960 42601 citations

Scikit-learn: Machine Learning in Python

Fabian Pedregosa, G. Varoquaux, Alexandre Gramfort et al.

2011 89456 citations View Analysis →

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers, Iryna Gurevych

2019 17956 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 58589 citations View Analysis →

Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models

Myra Cheng, Esin Durmus, Dan Jurafsky

2023 311 citations View Analysis →

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

A. Davani, M. D'iaz, Vinodkumar Prabhakaran

2021 459 citations View Analysis →

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed et al.

2024 704 citations View Analysis →

Computational methods in authorship attribution

Moshe Koppel, Jonathan Schler, S. Argamon

2009 366 citations

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers et al.

2021 517 citations View Analysis →

The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text

Richárd Farkas, V. Vincze, György Móra et al.

2010 323 citations

Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

Abstract

References (20)

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs