Do Metrics for Counterfactual Explanations Align with User Perception?
The study finds that counterfactual explanation metrics do not align with user perception, necessitating more human-centered evaluation methods.
Key Findings
Methodology
This study empirically compares algorithmic evaluation metrics with human judgments. It uses three datasets: Mushroom, Obesity Levels, and Heart Disease. Participants rated counterfactual explanations across multiple perceived quality dimensions, which were then related to a comprehensive set of standard counterfactual metrics. The study analyzed both individual relationships and the extent to which combinations of metrics could predict human assessments.
Key Results
- Result 1: Correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. For example, in the Mushroom dataset, sparsity and proximity showed moderate negative correlations with user ratings (r=-0.38 to -0.64).
- Result 2: Increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans.
- Result 3: In the Heart Disease dataset, correlations between all metrics and user ratings were non-significant, indicating substantial differences in metric-user perception relationships across datasets.
Significance
The significance of this study lies in revealing that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users. This underscores the need for more human-centered approaches to evaluating explainable AI. The findings challenge the common practice of treating automated counterfactual metrics as reliable proxies for human evaluation, emphasizing the necessity of assessments that better reflect human judgment in XAI systems.
Technical Contribution
Technical contributions include revealing a structural mismatch between algorithmic metrics for counterfactual explanations and user perception. The study demonstrates that existing automated metrics perform inconsistently across different datasets and fail to reliably predict user evaluations of explanation quality. This provides a theoretical basis for developing more human-centered evaluation methods.
Novelty
This study is the first to systematically compare algorithmic metrics for counterfactual explanations with human perception. Unlike previous studies, it not only focuses on individual metric correlations but also analyzes the predictive power of metric combinations, revealing the limitations of existing metrics.
Limitations
- Limitation 1: The study uses only three datasets, which may not comprehensively represent all possible application scenarios.
- Limitation 2: Participants' backgrounds and experiences may influence the subjectivity of ratings, not fully eliminating individual differences.
Future Work
Future research could extend to more datasets and application scenarios to verify the generalizability of current findings. Additionally, new evaluation metrics could be developed to better capture user-perceived explanation quality, advancing human-centered evaluation methods.
AI Executive Summary
In the field of artificial intelligence, explainability is a key factor in building trust. Counterfactual explanations, as an important method, can demonstrate changes in model predictions through minimal modifications to input instances. However, the metrics currently used to evaluate counterfactual explanations are primarily algorithmic and rarely validated against human judgments. This raises a critical question: do these metrics truly reflect user perception?
This study addresses this question by empirically comparing algorithmic evaluation metrics with human judgments. It uses three datasets: Mushroom, Obesity Levels, and Heart Disease. Participants rated counterfactual explanations across multiple perceived quality dimensions, which were then related to a comprehensive set of standard counterfactual metrics. The study analyzed both individual relationships and the extent to which combinations of metrics could predict human assessments.
The results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. For example, in the Mushroom dataset, sparsity and proximity showed moderate negative correlations with user ratings, indicating a preference for counterfactuals involving fewer and smaller changes. In contrast, in the Obesity Levels dataset, users preferred more information-rich explanations. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements and may even degrade performance, indicating structural limitations in how current metrics capture criteria relevant for humans.
These findings challenge the common practice of treating automated counterfactual metrics as reliable proxies for human evaluation, emphasizing the necessity of assessments that better reflect human judgment in XAI systems. The study reveals that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable AI.
Future research could extend to more datasets and application scenarios to verify the generalizability of current findings. Additionally, new evaluation metrics could be developed to better capture user-perceived explanation quality, advancing human-centered evaluation methods.
Deep Analysis
Background
As machine learning systems are increasingly deployed across various fields, transparency and user understanding become crucial. Counterfactual explanations provide a way to show how minimal modifications to input instances can change model predictions. This method aligns with the natural human reasoning of 'what if' scenarios, making it prominent in explainable artificial intelligence (XAI) research. To evaluate the quality of counterfactual explanations, researchers have proposed a range of algorithmic metrics, such as sparsity and proximity. However, these metrics often lack empirical validation against human perception. This lack of validation is not uncommon in the XAI field, as similar issues have been observed with faithfulness metrics for feature attribution methods, where correlations are weak and can lead to contradictory rankings. Therefore, it is crucial to investigate whether automated metrics capture what users value in explanations.
Core Problem
The core problem is whether the algorithmic metrics currently used to evaluate counterfactual explanations truly reflect user perception. While these metrics are computationally feasible, it remains unknown if they capture the aspects of explanations that users find meaningful, useful, or trustworthy. Existing metrics are typically applied in isolation, lacking comparative validation against human judgments. This leads to a critical research question: can empirical studies reveal the relationship between these metrics and user perception, thereby advancing more human-centered evaluation methods?
Innovation
The core innovations of this study include systematically comparing algorithmic metrics for counterfactual explanations with human perception for the first time. Specific innovations include:
1. Conducting empirical studies using three different datasets (Mushroom, Obesity Levels, Heart Disease) to ensure diversity and generalizability of results.
2. Analyzing individual metric-user rating relationships and exploring the predictive power of metric combinations in human assessment.
3. Revealing that existing metrics perform inconsistently across datasets and fail to reliably predict user evaluations of explanation quality.
Methodology
Method details:
- �� Dataset selection: Three classification datasets from the UCI Machine Learning Repository were selected: Mushroom, Obesity Levels, and Heart Disease.
- �� Counterfactual generation: A prototype-based counterfactual generation method was used to ensure generated instances are close to the original and plausible with respect to the data distribution.
- �� User study design: Participants rated the generated counterfactual explanations across dimensions such as accuracy, understandability, and trustworthiness.
- �� Metric computation: Seven commonly used automated metrics were computed, including sparsity, proximity, and trust score.
- �� Data analysis: Pearson correlations were used to analyze the relationship between metrics and user ratings, and supervised learning models evaluated the predictive power of metric combinations.
Experiments
Experimental design:
- �� Datasets: Mushroom, Obesity Levels, and Heart Disease datasets were used, involving binary and multi-class classification tasks.
- �� Baseline model: XGBoost was used as the baseline model for classification tasks.
- �� Counterfactual generation: Counterfactual explanations were generated for each test set instance using a prototype-based method.
- �� User study: Participants rated the generated counterfactual explanations across dimensions such as accuracy, understandability, and trustworthiness.
- �� Metric computation: Seven commonly used automated metrics were computed, including sparsity, proximity, and trust score.
Results
Results analysis:
- �� In the Mushroom dataset, sparsity and proximity showed moderate negative correlations with user ratings, indicating a preference for counterfactuals involving fewer and smaller changes.
- �� In the Obesity Levels dataset, users preferred more information-rich explanations, with several metrics showing positive correlations with user ratings.
- �� In the Heart Disease dataset, correlations between all metrics and user ratings were non-significant, indicating substantial differences in metric-user perception relationships across datasets.
Applications
Application scenarios:
- �� The results of this study can guide improvements in counterfactual explanation methods, developing explanations that better meet user expectations.
- �� In fields such as healthcare and finance, counterfactual explanations can help users better understand model decisions and increase trust.
- �� The findings can also be used to evaluate other types of explanation methods, advancing the XAI field.
Limitations & Outlook
Limitations & outlook:
- �� The study uses only three datasets, which may not comprehensively represent all possible application scenarios.
- �� Participants' backgrounds and experiences may influence the subjectivity of ratings, not fully eliminating individual differences.
- �� Future research could extend to more datasets and application scenarios to verify the generalizability of current findings. Additionally, new evaluation metrics could be developed to better capture user-perceived explanation quality, advancing human-centered evaluation methods.
Plain Language Accessible to non-experts
Imagine you are in a kitchen cooking a meal. Counterfactual explanations are like trying different combinations of spices to see which makes the dish taste better. You might wonder, if I add less salt, will it taste better? Or if I add some chili, will it have more flavor? These small changes are like the 'minimal modifications' in counterfactual explanations, helping you understand how different factors affect the final outcome.
In artificial intelligence, counterfactual explanations help us understand how models make decisions. Just like in the kitchen, you can change certain inputs (like ingredients or spices) to see how the outcome changes. Through this process, you discover which factors have the most impact on the result and which changes are acceptable.
However, the metrics currently used to evaluate these explanations are like standardized scoring systems that may not always reflect your true feelings about the dish. Just as some people prefer strong flavors while others like mild, personal preferences may not be captured by a simple scoring system.
Therefore, researchers are working to develop more human-centered evaluation methods that better reflect users' true perceptions of explanations. It's like creating a customized scoring system for each diner that more accurately reflects their preferences for the dish.
ELI14 Explained like you're 14
Hey there! Have you ever wondered what would happen if you made different choices in a game? That's what we call 'counterfactual explanations'! Imagine you're playing an adventure game, and your character is standing at a crossroads: one path leads to a mysterious forest, and the other to a dangerous cave. You might think, what if I chose the other path?
In artificial intelligence, counterfactual explanations help us understand how computers make these choices. Just like in the game, you can try different options to see what different outcomes you get. This way, you can better understand the rules and mechanics of the game.
But sometimes these explanations aren't always clear. Just like in the game, some puzzles might be hard to solve, and you need more information to make a decision. That's why researchers are working to develop better ways to explain these choices.
They hope these new methods will be like hints in the game, helping you better understand the reasons behind each choice. This way, you can make smarter decisions in the game and trust these explanations more.
Glossary
Counterfactual Explanation
A method that demonstrates changes in model predictions through minimal modifications to input instances.
Used in the paper to analyze the transparency of model decisions.
Sparsity
Refers to the number of features modified in a counterfactual explanation. Fewer modifications are generally considered better.
Used to evaluate the conciseness of counterfactual explanations.
Proximity
Measures the distance between the counterfactual instance and the original input instance. Smaller distances indicate closer proximity.
Used to evaluate the reasonableness of counterfactual explanations.
Plausibility
Refers to the plausibility of the counterfactual instance within the data distribution.
Used to evaluate the realism of counterfactual explanations.
Diversity
Measures the independence of changes in different features within a counterfactual explanation.
Used to evaluate the richness of counterfactual explanations.
Oracle Score
Measures the consistency of predictions for a counterfactual instance across different models.
Used to evaluate the model consistency of counterfactual explanations.
Trust Score
Measures how close a counterfactual instance is to its predicted class.
Used to evaluate the trustworthiness of counterfactual explanations.
Completeness
Measures the importance of features changed in a counterfactual explanation.
Used to evaluate the completeness of counterfactual explanations.
XGBoost
A highly efficient gradient boosting decision tree algorithm commonly used for classification and regression tasks.
Used as the baseline model for classification tasks in the paper.
UCI Machine Learning Repository
A widely used collection of datasets for various machine learning tasks.
Used in the paper to select experimental datasets.
Open Questions Unanswered questions from this research
- 1 Open question 1: Existing counterfactual evaluation metrics perform inconsistently across datasets. How can more universally applicable metrics be developed?
- 2 Open question 2: How can the user-perceived quality of counterfactual explanations be improved without increasing computational complexity?
- 3 Open question 3: In multi-class tasks, how can user preferences for counterfactual explanations be better captured?
- 4 Open question 4: How can the effectiveness of counterfactual explanations be validated across different fields, especially in critical areas like healthcare and finance?
- 5 Open question 5: How can insights from cognitive science be integrated to develop explanations that align more closely with human thinking?
- 6 Open question 6: Are current user research methods sufficient to fully capture users' true perceptions of explanations?
- 7 Open question 7: How can the balance between information richness and user comprehensibility be achieved in counterfactual explanations?
Applications
Immediate Applications
Medical Diagnosis
Counterfactual explanations can help doctors understand model diagnostic decisions, increasing transparency and trust.
Financial Decision-Making
In finance, counterfactual explanations can help users understand the decision process behind loan approvals or credit scoring.
Autonomous Driving
Counterfactual explanations can be used to analyze autonomous driving systems' decisions, helping engineers improve system safety and reliability.
Long-term Vision
Human-Computer Interaction
In the future, counterfactual explanations can be used to improve human-computer interaction, making AI systems more transparent and interpretable.
Education
Counterfactual explanations can be used in education to help students better understand complex concepts and problems.
Abstract
Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.
References (20)
An Empirical Investigation of Users' Assessment of XAI Explanations: Identifying the Sweet Spot of Explanation Complexity and Value
Felix Liedeker, Christoph Düsing, Marcel Nieveler et al.
Interpretable Counterfactual Explanations Guided by Prototypes
A. V. Looveren, Janis Klaise
Predicting Satisfaction of Counterfactual Explanations from Human Ratings of Explanatory Qualities
Marharyta Domnich, Rasmus Moorits Veski, Julius Välja et al.
Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception?
Xiaoxiao Sun, Nidham Gazagnadou, Vivek Sharma et al.
Counterfactuals in Explainable Artificial Intelligence (XAI): Evidence from Human Reasoning
R. Byrne
Actionable Recourse for Automated Decisions: Examining the Effects of Counterfactual Explanation Type and Presentation on Lay User Understanding
Peter M. VanNostrand, Dennis M. Hofmann, Lei Ma et al.
Are Explanations Helpful? A Comparative Study of the Effects of Explanations in AI-Assisted Decision-Making
Xinru Wang, Ming Yin
Explaining machine learning classifiers through diverse counterfactual explanations
Ramaravind Kommiya Mothilal, Amit Sharma, Chenhao Tan
Intraclass correlations: uses in assessing rater reliability.
P. Shrout, J. Fleiss
Integrating federated learning for improved counterfactual explanations in clinical decision support systems for sepsis therapy
Christoph Düsing, Philipp Cimiano, S. Rehberg et al.
Discernibility in explanations: Designing more acceptable and meaningful machine learning models for medicine
Haomiao Wang, Julien Aligon, Julien May et al.
Interrater reliability and agreement of subjective judgments
Howard E. A. Tinsley, D. Weiss
To Trust Or Not To Trust A Classifier
Heinrich Jiang, Been Kim, Maya R. Gupta
Keep Your Friends Close and Your Counterfactuals Closer: Improved Learning From Closest Rather Than Plausible Counterfactual Explanations in an Abstract Setting
Ulrike Kuhl, André Artelt, Barbara Hammer
From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI
Meike Nauta, Jan Trienes, Shreyasi Pathak et al.
M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models
Xuhong Li, Mengnan Du, Jiamin Chen et al.
Features of Explainability: How users understand counterfactual and causal explanations for categorical and continuous features in XAI
Greta Warren, Mark T. Keane, R. Byrne
The Dynamics of Trust in XAI: Assessing Perceived and Demonstrated Trust Across Interaction Modes and Risk Treatments
Mohsen Abbaspour Onari, Gregor Baer, Chao Zhang et al.
Alibi Explain: Algorithms for Explaining Machine Learning Models
Janis Klaise, A. V. Looveren, G. Vacanti et al.
Counterfactual Explanations for Machine Learning: A Review
Sahil Verma, John P. Dickerson, Keegan E. Hines