Do Metrics for Counterfactual Explanations Align with User Perception?

TL;DR

The study finds that counterfactual explanation metrics do not align with user perception, necessitating more human-centered evaluation methods.

cs.AI 🟡 Intermediate 2026-03-17 64 views

Felix Liedeker Basil Ell Philipp Cimiano Christoph Düsing

AI Reader Arxiv Page Download PDF

counterfactual explanations user perception algorithmic metrics human-centered evaluation XAI

Key Findings

Methodology

This study empirically compares algorithmic evaluation metrics with human judgments. It uses three datasets: Mushroom, Obesity Levels, and Heart Disease. Participants rated counterfactual explanations across multiple perceived quality dimensions, which were then related to a comprehensive set of standard counterfactual metrics. The study analyzed both individual relationships and the extent to which combinations of metrics could predict human assessments.

Key Results

Result 1: Correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. For example, in the Mushroom dataset, sparsity and proximity showed moderate negative correlations with user ratings (r=-0.38 to -0.64).
Result 2: Increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans.
Result 3: In the Heart Disease dataset, correlations between all metrics and user ratings were non-significant, indicating substantial differences in metric-user perception relationships across datasets.

Significance

The significance of this study lies in revealing that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users. This underscores the need for more human-centered approaches to evaluating explainable AI. The findings challenge the common practice of treating automated counterfactual metrics as reliable proxies for human evaluation, emphasizing the necessity of assessments that better reflect human judgment in XAI systems.

Technical Contribution

Technical contributions include revealing a structural mismatch between algorithmic metrics for counterfactual explanations and user perception. The study demonstrates that existing automated metrics perform inconsistently across different datasets and fail to reliably predict user evaluations of explanation quality. This provides a theoretical basis for developing more human-centered evaluation methods.

Novelty

This study is the first to systematically compare algorithmic metrics for counterfactual explanations with human perception. Unlike previous studies, it not only focuses on individual metric correlations but also analyzes the predictive power of metric combinations, revealing the limitations of existing metrics.

Limitations

Limitation 1: The study uses only three datasets, which may not comprehensively represent all possible application scenarios.
Limitation 2: Participants' backgrounds and experiences may influence the subjectivity of ratings, not fully eliminating individual differences.

Future Work

Future research could extend to more datasets and application scenarios to verify the generalizability of current findings. Additionally, new evaluation metrics could be developed to better capture user-perceived explanation quality, advancing human-centered evaluation methods.

AI Executive Summary

In the field of artificial intelligence, explainability is a key factor in building trust. Counterfactual explanations, as an important method, can demonstrate changes in model predictions through minimal modifications to input instances. However, the metrics currently used to evaluate counterfactual explanations are primarily algorithmic and rarely validated against human judgments. This raises a critical question: do these metrics truly reflect user perception?

This study addresses this question by empirically comparing algorithmic evaluation metrics with human judgments. It uses three datasets: Mushroom, Obesity Levels, and Heart Disease. Participants rated counterfactual explanations across multiple perceived quality dimensions, which were then related to a comprehensive set of standard counterfactual metrics. The study analyzed both individual relationships and the extent to which combinations of metrics could predict human assessments.

The results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. For example, in the Mushroom dataset, sparsity and proximity showed moderate negative correlations with user ratings, indicating a preference for counterfactuals involving fewer and smaller changes. In contrast, in the Obesity Levels dataset, users preferred more information-rich explanations. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements and may even degrade performance, indicating structural limitations in how current metrics capture criteria relevant for humans.

These findings challenge the common practice of treating automated counterfactual metrics as reliable proxies for human evaluation, emphasizing the necessity of assessments that better reflect human judgment in XAI systems. The study reveals that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable AI.

Deep Analysis

Background

As machine learning systems are increasingly deployed across various fields, transparency and user understanding become crucial. Counterfactual explanations provide a way to show how minimal modifications to input instances can change model predictions. This method aligns with the natural human reasoning of 'what if' scenarios, making it prominent in explainable artificial intelligence (XAI) research. To evaluate the quality of counterfactual explanations, researchers have proposed a range of algorithmic metrics, such as sparsity and proximity. However, these metrics often lack empirical validation against human perception. This lack of validation is not uncommon in the XAI field, as similar issues have been observed with faithfulness metrics for feature attribution methods, where correlations are weak and can lead to contradictory rankings. Therefore, it is crucial to investigate whether automated metrics capture what users value in explanations.

Core Problem

The core problem is whether the algorithmic metrics currently used to evaluate counterfactual explanations truly reflect user perception. While these metrics are computationally feasible, it remains unknown if they capture the aspects of explanations that users find meaningful, useful, or trustworthy. Existing metrics are typically applied in isolation, lacking comparative validation against human judgments. This leads to a critical research question: can empirical studies reveal the relationship between these metrics and user perception, thereby advancing more human-centered evaluation methods?

Innovation

The core innovations of this study include systematically comparing algorithmic metrics for counterfactual explanations with human perception for the first time. Specific innovations include:

1. Conducting empirical studies using three different datasets (Mushroom, Obesity Levels, Heart Disease) to ensure diversity and generalizability of results.

2. Analyzing individual metric-user rating relationships and exploring the predictive power of metric combinations in human assessment.

3. Revealing that existing metrics perform inconsistently across datasets and fail to reliably predict user evaluations of explanation quality.

Methodology

Method details:

�� Dataset selection: Three classification datasets from the UCI Machine Learning Repository were selected: Mushroom, Obesity Levels, and Heart Disease.
�� Counterfactual generation: A prototype-based counterfactual generation method was used to ensure generated instances are close to the original and plausible with respect to the data distribution.
�� User study design: Participants rated the generated counterfactual explanations across dimensions such as accuracy, understandability, and trustworthiness.
�� Metric computation: Seven commonly used automated metrics were computed, including sparsity, proximity, and trust score.
�� Data analysis: Pearson correlations were used to analyze the relationship between metrics and user ratings, and supervised learning models evaluated the predictive power of metric combinations.

Experiments

Experimental design:

�� Datasets: Mushroom, Obesity Levels, and Heart Disease datasets were used, involving binary and multi-class classification tasks.
�� Baseline model: XGBoost was used as the baseline model for classification tasks.
�� Counterfactual generation: Counterfactual explanations were generated for each test set instance using a prototype-based method.
�� User study: Participants rated the generated counterfactual explanations across dimensions such as accuracy, understandability, and trustworthiness.
�� Metric computation: Seven commonly used automated metrics were computed, including sparsity, proximity, and trust score.

Results

Results analysis:

�� In the Mushroom dataset, sparsity and proximity showed moderate negative correlations with user ratings, indicating a preference for counterfactuals involving fewer and smaller changes.
�� In the Obesity Levels dataset, users preferred more information-rich explanations, with several metrics showing positive correlations with user ratings.
�� In the Heart Disease dataset, correlations between all metrics and user ratings were non-significant, indicating substantial differences in metric-user perception relationships across datasets.

Applications

Application scenarios:

�� The results of this study can guide improvements in counterfactual explanation methods, developing explanations that better meet user expectations.
�� In fields such as healthcare and finance, counterfactual explanations can help users better understand model decisions and increase trust.
�� The findings can also be used to evaluate other types of explanation methods, advancing the XAI field.

Limitations & Outlook

Limitations & outlook:

�� The study uses only three datasets, which may not comprehensively represent all possible application scenarios.
�� Participants' backgrounds and experiences may influence the subjectivity of ratings, not fully eliminating individual differences.
�� Future research could extend to more datasets and application scenarios to verify the generalizability of current findings. Additionally, new evaluation metrics could be developed to better capture user-perceived explanation quality, advancing human-centered evaluation methods.

Plain Language Accessible to non-experts

Imagine you are in a kitchen cooking a meal. Counterfactual explanations are like trying different combinations of spices to see which makes the dish taste better. You might wonder, if I add less salt, will it taste better? Or if I add some chili, will it have more flavor? These small changes are like the 'minimal modifications' in counterfactual explanations, helping you understand how different factors affect the final outcome.

In artificial intelligence, counterfactual explanations help us understand how models make decisions. Just like in the kitchen, you can change certain inputs (like ingredients or spices) to see how the outcome changes. Through this process, you discover which factors have the most impact on the result and which changes are acceptable.

However, the metrics currently used to evaluate these explanations are like standardized scoring systems that may not always reflect your true feelings about the dish. Just as some people prefer strong flavors while others like mild, personal preferences may not be captured by a simple scoring system.

Therefore, researchers are working to develop more human-centered evaluation methods that better reflect users' true perceptions of explanations. It's like creating a customized scoring system for each diner that more accurately reflects their preferences for the dish.

ELI14 Explained like you're 14

Hey there! Have you ever wondered what would happen if you made different choices in a game? That's what we call 'counterfactual explanations'! Imagine you're playing an adventure game, and your character is standing at a crossroads: one path leads to a mysterious forest, and the other to a dangerous cave. You might think, what if I chose the other path?

In artificial intelligence, counterfactual explanations help us understand how computers make these choices. Just like in the game, you can try different options to see what different outcomes you get. This way, you can better understand the rules and mechanics of the game.

But sometimes these explanations aren't always clear. Just like in the game, some puzzles might be hard to solve, and you need more information to make a decision. That's why researchers are working to develop better ways to explain these choices.

They hope these new methods will be like hints in the game, helping you better understand the reasons behind each choice. This way, you can make smarter decisions in the game and trust these explanations more.

Glossary

Counterfactual Explanation

A method that demonstrates changes in model predictions through minimal modifications to input instances.

Used in the paper to analyze the transparency of model decisions.

Sparsity

Refers to the number of features modified in a counterfactual explanation. Fewer modifications are generally considered better.

Used to evaluate the conciseness of counterfactual explanations.

Proximity

Measures the distance between the counterfactual instance and the original input instance. Smaller distances indicate closer proximity.

Used to evaluate the reasonableness of counterfactual explanations.

Plausibility

Refers to the plausibility of the counterfactual instance within the data distribution.

Used to evaluate the realism of counterfactual explanations.

Diversity

Measures the independence of changes in different features within a counterfactual explanation.

Used to evaluate the richness of counterfactual explanations.

Oracle Score

Measures the consistency of predictions for a counterfactual instance across different models.

Used to evaluate the model consistency of counterfactual explanations.

Trust Score

Measures how close a counterfactual instance is to its predicted class.

Used to evaluate the trustworthiness of counterfactual explanations.

Completeness

Measures the importance of features changed in a counterfactual explanation.

Used to evaluate the completeness of counterfactual explanations.

XGBoost

A highly efficient gradient boosting decision tree algorithm commonly used for classification and regression tasks.

Used as the baseline model for classification tasks in the paper.

UCI Machine Learning Repository

A widely used collection of datasets for various machine learning tasks.

Used in the paper to select experimental datasets.

Open Questions Unanswered questions from this research

1 Open question 1: Existing counterfactual evaluation metrics perform inconsistently across datasets. How can more universally applicable metrics be developed?
2 Open question 2: How can the user-perceived quality of counterfactual explanations be improved without increasing computational complexity?
3 Open question 3: In multi-class tasks, how can user preferences for counterfactual explanations be better captured?
4 Open question 4: How can the effectiveness of counterfactual explanations be validated across different fields, especially in critical areas like healthcare and finance?
5 Open question 5: How can insights from cognitive science be integrated to develop explanations that align more closely with human thinking?
6 Open question 6: Are current user research methods sufficient to fully capture users' true perceptions of explanations?
7 Open question 7: How can the balance between information richness and user comprehensibility be achieved in counterfactual explanations?

Applications

Immediate Applications

Medical Diagnosis

Counterfactual explanations can help doctors understand model diagnostic decisions, increasing transparency and trust.

Financial Decision-Making

In finance, counterfactual explanations can help users understand the decision process behind loan approvals or credit scoring.

Autonomous Driving

Counterfactual explanations can be used to analyze autonomous driving systems' decisions, helping engineers improve system safety and reliability.

Long-term Vision

Human-Computer Interaction

In the future, counterfactual explanations can be used to improve human-computer interaction, making AI systems more transparent and interpretable.

Education

Counterfactual explanations can be used in education to help students better understand complex concepts and problems.

Abstract

Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.

cs.AI cs.HC

References (20)

An Empirical Investigation of Users' Assessment of XAI Explanations: Identifying the Sweet Spot of Explanation Complexity and Value

Felix Liedeker, Christoph Düsing, Marcel Nieveler et al.

2024 2 citations

Interpretable Counterfactual Explanations Guided by Prototypes

A. V. Looveren, Janis Klaise

2019 453 citations View Analysis →

Predicting Satisfaction of Counterfactual Explanations from Human Ratings of Explanatory Qualities

Marharyta Domnich, Rasmus Moorits Veski, Julius Välja et al.

2025 1 citations View Analysis →

Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception?

Xiaoxiao Sun, Nidham Gazagnadou, Vivek Sharma et al.

2023 16 citations View Analysis →

Counterfactuals in Explainable Artificial Intelligence (XAI): Evidence from Human Reasoning

R. Byrne

2019 320 citations

Actionable Recourse for Automated Decisions: Examining the Effects of Counterfactual Explanation Type and Presentation on Lay User Understanding

Peter M. VanNostrand, Dennis M. Hofmann, Lei Ma et al.

2024 10 citations

Are Explanations Helpful? A Comparative Study of the Effects of Explanations in AI-Assisted Decision-Making

Xinru Wang, Ming Yin

2021 378 citations

Explaining machine learning classifiers through diverse counterfactual explanations

Ramaravind Kommiya Mothilal, Amit Sharma, Chenhao Tan

2019 1268 citations View Analysis →

Intraclass correlations: uses in assessing rater reliability.

P. Shrout, J. Fleiss

1979 23708 citations

Integrating federated learning for improved counterfactual explanations in clinical decision support systems for sepsis therapy

Christoph Düsing, Philipp Cimiano, S. Rehberg et al.

2024 13 citations

Discernibility in explanations: Designing more acceptable and meaningful machine learning models for medicine

Haomiao Wang, Julien Aligon, Julien May et al.

2025 3 citations

Interrater reliability and agreement of subjective judgments

Howard E. A. Tinsley, D. Weiss

1975 910 citations

To Trust Or Not To Trust A Classifier

Heinrich Jiang, Been Kim, Maya R. Gupta

2018 510 citations View Analysis →

Keep Your Friends Close and Your Counterfactuals Closer: Improved Learning From Closest Rather Than Plausible Counterfactual Explanations in an Abstract Setting

Ulrike Kuhl, André Artelt, Barbara Hammer

2022 29 citations View Analysis →

From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI

Meike Nauta, Jan Trienes, Shreyasi Pathak et al.

2022 617 citations View Analysis →

M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models

Xuhong Li, Mengnan Du, Jiamin Chen et al.

2023 34 citations

Features of Explainability: How users understand counterfactual and causal explanations for categorical and continuous features in XAI

Greta Warren, Mark T. Keane, R. Byrne

2022 30 citations View Analysis →

The Dynamics of Trust in XAI: Assessing Perceived and Demonstrated Trust Across Interaction Modes and Risk Treatments

Mohsen Abbaspour Onari, Gregor Baer, Chao Zhang et al.

2025 1 citations

Alibi Explain: Algorithms for Explaining Machine Learning Models

Janis Klaise, A. V. Looveren, G. Vacanti et al.

2021 133 citations

Counterfactual Explanations for Machine Learning: A Review

Sahil Verma, John P. Dickerson, Keegan E. Hines

2020 461 citations

Do Metrics for Counterfactual Explanations Align with User Perception?

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Counterfactual Explanation

Sparsity

Proximity

Plausibility

Diversity

Oracle Score

Trust Score

Completeness

XGBoost

UCI Machine Learning Repository

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Medical Diagnosis

Financial Decision-Making

Autonomous Driving

Long-term Vision

Human-Computer Interaction

Education

Abstract

References (20)

Related Papers

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Large Language Models Exhibit Normative Conformity