Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
Evaluates eight Shapley variants' human utility in high-risk settings, revealing misalignment between current metrics and human perception.
Key Findings
Methodology
The study employs a unified amortized framework to eliminate implementation confounders, enabling fair comparison of eight Shapley value variants. The research is conducted across four risk datasets and a realistic fraud detection environment involving 37 professional analysts and 3,735 case reviews. This approach reveals a fundamental misalignment between quantitative metrics and human perception of clarity and decision utility.
Key Results
- Result 1: Standard quantitative metrics such as sparsity and faithfulness are decoupled from human-perceived clarity and decision utility. While explanations did not improve objective analyst performance, they consistently increased decision confidence, indicating a critical risk of automation bias in high-stakes settings.
- Result 2: Comparison of eight Shapley variants reveals no single variant dominates across all metrics. Fixed baseline variants perform well on Deletion AUC and Recall@3 but poorly on sparsity and contrastivity.
- Result 3: Empirical variants show balanced performance, while Conditional Shapley deviates from this pattern, producing dense, sensitive attributions that reflect correlations rather than model behavior.
Significance
This research is significant in the field of explainable AI, particularly in high-risk decision systems. By revealing the misalignment between current evaluation metrics and human utility, the study provides evidence-based guidance for selecting appropriate Shapley variants and evaluation metrics. This not only enhances the transparency and interpretability of AI systems but also mitigates the risk of automation bias.
Technical Contribution
Technical contributions include: 1) Proposing a unified amortized framework that eliminates implementation confounders for fair comparison of different Shapley variants; 2) Providing large-scale empirical analysis revealing the fundamental misalignment between quantitative metrics and human perception; 3) Offering evidence-based guidance for selecting Shapley values and evaluation metrics in high-risk decision systems.
Novelty
This paper is the first to systematically evaluate different Shapley value variants' human utility in high-risk settings, highlighting the inadequacies of current evaluation metrics. Unlike previous studies, this paper emphasizes the importance of human perception in evaluation through large-scale empirical analysis.
Limitations
- Limitation 1: The study primarily focuses on financial and fraud detection domains, and results may not generalize to vision or language domains where feature semantics may have different dynamics.
- Limitation 2: Experiments were conducted in controlled settings, unable to capture long-term effects such as learning, adaptation, or changes in institutional decision norms.
- Limitation 3: The application of certain Shapley variants on high-dimensional datasets may be limited due to computational complexity.
Future Work
Future research could extend to other domains such as vision and natural language processing to validate the utility of Shapley value variants in different applications. Additionally, developing new evaluation metrics to better predict human perception and decision utility is an important direction.
AI Executive Summary
In high-stakes domains like fraud detection and credit assessment, machine learning model predictions often require human decision-maker review. Explainable AI (XAI) methods, such as Shapley values, aim to enhance transparency by decomposing model predictions into feature-level contributions. However, the proliferation of Shapley value variants has created a fragmented landscape with little consensus on practical deployment.
This paper employs a unified amortized framework to eliminate implementation confounders, enabling fair comparison of eight Shapley value variants. The research is conducted across four risk datasets and a realistic fraud detection environment involving 37 professional analysts and 3,735 case reviews. Results show that standard quantitative metrics such as sparsity and faithfulness are decoupled from human-perceived clarity and decision utility. While explanations did not improve objective analyst performance, they consistently increased decision confidence, indicating a critical risk of automation bias in high-stakes settings.
By comparing eight Shapley variants, the study finds no single variant dominates across all metrics. Fixed baseline variants perform well on Deletion AUC and Recall@3 but poorly on sparsity and contrastivity. Empirical variants show balanced performance, while Conditional Shapley deviates from this pattern, producing dense, sensitive attributions that reflect correlations rather than model behavior.
This research is significant in the field of explainable AI, particularly in high-risk decision systems. By revealing the misalignment between current evaluation metrics and human utility, the study provides evidence-based guidance for selecting appropriate Shapley variants and evaluation metrics. This not only enhances the transparency and interpretability of AI systems but also mitigates the risk of automation bias.
However, the study has some limitations. Firstly, it primarily focuses on financial and fraud detection domains, and results may not generalize to vision or language domains where feature semantics may have different dynamics. Additionally, experiments were conducted in controlled settings, unable to capture long-term effects such as learning, adaptation, or changes in institutional decision norms. Future research could extend to other domains such as vision and natural language processing to validate the utility of Shapley value variants in different applications. Additionally, developing new evaluation metrics to better predict human perception and decision utility is an important direction.
Deep Analysis
Background
In high-stakes domains such as fraud detection, credit assessment, and healthcare, machine learning model predictions often require human decision-makers' review. In these settings, model outputs rarely constitute final decisions. Instead, predictions are reviewed by human decision-makers operating under time, attention, and regulatory constraints. As a result, explanations are viewed as indispensable for accountability and oversight and have become a core requirement in operational ML deployments. Despite widespread adoption, the practical value of explanations in human-in-the-loop workflows remains poorly understood, often assumed rather than empirically established. Among explanation methods, local approaches grounded in cooperative game theory, most notably Shapley values, have emerged as a cornerstone by providing an axiomatic decomposition of model predictions into feature-level contributions. However, the framework has fragmented into competing formulations based on divergent assumptions about the semantics of feature absence, realized in popular implementations such as KernelSHAP, TreeSHAP, and related tools. This raises a critical evaluation question for practitioners: does the choice of formulation matter to the end-user, and do standard evaluation procedures anticipate the impact?
Core Problem
Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. Modern XAI evaluation relies on theoretical analysis and quantitative proxies, as well as mathematical distinctions between 'faithfulness' to the model or 'truthfulness' to the data. Yet, systematic evidence regarding how explanation methods perform against human-centered benchmarks remains scarce. Existing evaluations often focus on isolated properties and rarely stress-test these metrics against human behavior under realistic operational constraints. Furthermore, comparisons are often confounded by implementation choices, which mask the true semantic differences of the definitions themselves.
Innovation
The core innovations of this paper include: 1) Proposing a unified amortized framework that eliminates implementation confounders for fair comparison of different Shapley variants; 2) Providing large-scale empirical analysis revealing the fundamental misalignment between quantitative metrics and human perception; 3) Offering evidence-based guidance for selecting Shapley values and evaluation metrics in high-risk decision systems. These innovations not only enhance the transparency and interpretability of AI systems but also mitigate the risk of automation bias.
Methodology
- �� Employ a unified amortized framework to eliminate implementation confounders, enabling fair comparison of eight Shapley value variants.
- �� Conduct research across four risk datasets and a realistic fraud detection environment involving 37 professional analysts and 3,735 case reviews.
- �� This approach reveals a fundamental misalignment between quantitative metrics and human perception of clarity and decision utility.
- �� Evaluate Shapley attributions along two complementary axes: quantitative evaluation and human-in-the-loop study.
- �� Use a compact set of metrics to capture functional properties, cross-formulation agreement, and downstream analyst behavior.
Experiments
The experimental design includes four risk datasets and a realistic fraud detection environment involving 37 professional analysts and 3,735 case reviews. Baselines include popular implementations like KernelSHAP and TreeSHAP. Evaluation metrics include sparsity, faithfulness, contrastivity, Deletion AUC, and Recall@3. The experiments also include ablation studies to analyze performance differences across different Shapley variants.
Results
Experimental results show that standard quantitative metrics such as sparsity and faithfulness are decoupled from human-perceived clarity and decision utility. While explanations did not improve objective analyst performance, they consistently increased decision confidence, indicating a critical risk of automation bias in high-stakes settings. By comparing eight Shapley variants, the study finds no single variant dominates across all metrics. Fixed baseline variants perform well on Deletion AUC and Recall@3 but poorly on sparsity and contrastivity. Empirical variants show balanced performance, while Conditional Shapley deviates from this pattern, producing dense, sensitive attributions that reflect correlations rather than model behavior.
Applications
The application scenarios of this research include high-risk decision systems in financial and fraud detection domains. By revealing the misalignment between current evaluation metrics and human utility, the study provides evidence-based guidance for selecting appropriate Shapley variants and evaluation metrics. This not only enhances the transparency and interpretability of AI systems but also mitigates the risk of automation bias.
Limitations & Outlook
The study primarily focuses on financial and fraud detection domains, and results may not generalize to vision or language domains where feature semantics may have different dynamics. Additionally, experiments were conducted in controlled settings, unable to capture long-term effects such as learning, adaptation, or changes in institutional decision norms. Future research could extend to other domains such as vision and natural language processing to validate the utility of Shapley value variants in different applications. Additionally, developing new evaluation metrics to better predict human perception and decision utility is an important direction.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. You have a bunch of ingredients like carrots, potatoes, and chicken. You want to know how much each ingredient contributes to the final taste of the dish. Shapley values are like a chef's assistant that can tell you the importance of each ingredient in the dish. It considers combinations of all ingredients and tells you what the dish would taste like without carrots or if it would be better without potatoes.
In high-stakes environments like banks' fraud detection, Shapley values help analysts understand AI models' decisions. It's like having a transparent kitchen where analysts can see the impact of each feature (like transaction amount, location) on the model's judgment.
However, the problem is that different Shapley variants are like different chefs who might have different opinions on the ingredients. Some chefs might think carrots are crucial, while others might prioritize potatoes. This leads to situations where analysts might have varying confidence in the model's explanations in practical applications.
This study is like a cooking contest, evaluating the performance of different chefs (Shapley variants) to see which explanation aligns better with human intuition and needs.
ELI14 Explained like you're 14
Hey there! Did you know that in places like banks or hospitals, AI often helps make important decisions, like figuring out if a transaction is fraudulent or if a patient needs special care? To make these AI decisions more transparent, we need to know how they think.
Imagine you're playing a game, and AI is your teammate, telling you how dangerous each enemy is. Shapley values are like the AI's translator, explaining why it thinks a particular enemy is especially dangerous. It considers all possible combinations, just like in a game where you consider each teammate's role.
But different Shapley values are like different translators; some might make things clearer, while others might leave you more confused. This study compares these translators to see which explanations help you make better decisions in the game.
So next time you're making decisions in a game, think about these AIs and Shapley values—they're like your invisible assistants, helping you better understand the game world!
Glossary
Shapley Values
Shapley values are a cooperative game theory-based explanation method that decomposes model predictions into feature-level contributions.
Used in this paper to evaluate different variants' utility in high-risk settings.
KernelSHAP
KernelSHAP is a popular Shapley value implementation that estimates feature contributions using weighted least squares regression.
Used as one of the baselines to compare different Shapley variants.
TreeSHAP
TreeSHAP is a Shapley value implementation specifically designed for decision tree models, offering efficient computation.
Used to evaluate the performance of different Shapley variants.
Faithfulness
Faithfulness refers to the consistency of an explanation method with the model's predictions, i.e., whether the explanation accurately reflects the model's decision process.
Used as one of the evaluation metrics to assess different Shapley variants' performance.
Sparsity
Sparsity refers to the number of non-zero features in an explanation, with higher sparsity often indicating simpler explanations.
Used to evaluate the simplicity of different Shapley variants.
Automation Bias
Automation bias refers to the risk of humans over-relying on automated systems, even when the system's decisions may not be accurate.
The study reveals the potential for automation bias due to Shapley value explanations.
Amortized Framework
An amortized framework is a method to eliminate implementation confounders, allowing fair comparison of different algorithms' performance.
Used to eliminate implementation confounders for different Shapley variants.
Contrastivity
Contrastivity refers to the degree of change in an explanation method under different inputs, with higher contrastivity often indicating more sensitive explanations.
Used to assess the sensitivity of different Shapley variants.
Empirical Variants
Empirical variants are Shapley value implementations based on empirical data distributions, preserving feature marginals.
Used as one of the Shapley variants for comparison.
Conditional Shapley
Conditional Shapley is a Shapley value implementation that preserves empirical dependencies, challenging to estimate in high-dimensional settings.
Used to evaluate the performance of different Shapley variants.
Open Questions Unanswered questions from this research
- 1 The application of current Shapley value variants on high-dimensional datasets is limited by computational complexity. Future research needs to develop more efficient algorithms to support applications on larger-scale datasets.
- 2 Existing evaluation metrics, such as faithfulness and sparsity, have not been sufficiently validated for alignment with human perception. New metrics need to be developed to better predict human perception and decision utility.
- 3 The utility of different Shapley variants in vision and natural language processing domains has not been fully validated. Future research could extend to these domains to verify their applicability.
- 4 The long-term effects of automation bias remain unclear. Longitudinal studies are needed to assess the impact of automation bias on decision quality and human trust.
- 5 How to improve the interpretability and transparency of Shapley value explanations without increasing computational complexity is a pressing issue.
Applications
Immediate Applications
Financial Fraud Detection
Shapley values can help analysts understand model decisions in fraud detection, enhancing decision transparency and accountability.
Credit Assessment
In credit assessment, Shapley values can explain model scores for different applicants, helping credit officers make more informed decisions.
Medical Diagnosis
In healthcare, Shapley values can explain model predictions of patient risk, aiding doctors in making more accurate diagnostic decisions.
Long-term Vision
Fully Transparent AI Systems
By improving Shapley values' explanatory capabilities, fully transparent AI systems can be achieved, enhancing human trust in AI decisions.
Cross-Domain Explainable AI
In the future, Shapley values and their variants could expand to more domains like autonomous driving and smart homes, achieving broader applications.
Abstract
Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.
References (20)
Explaining machine learning classifiers through diverse counterfactual explanations
Ramaravind Kommiya Mothilal, Amit Sharma, Chenhao Tan
Consistent Individualized Feature Attribution for Tree Ensembles
Scott M. Lundberg, Gabriel Erion-Barner, Su-In Lee
Stabilizing Estimates of Shapley Values with Control Variates
Jeremy Goldwasser, Giles Hooker
Transparency, auditability, and explainability of machine learning models in credit scoring
Michael Bücker, G. Szepannek, Alicja Gosiewska et al.
Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Tree Ensembles
Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer et al.
How can I choose an explainer?: An Application-grounded Evaluation of Post-hoc Explanations
Sérgio Jesus, Catarina Bel'em, Vladimir Balayan et al.
Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems
Anupam Datta, S. Sen, Yair Zick
Fast TreeSHAP: Accelerating SHAP Value Computation for Trees
Jilei Yang
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
C. Rudin
Interpretable Machine Learning - A Brief History, State-of-the-Art and Challenges
Christoph Molnar, Giuseppe Casalicchio, B. Bischl
The Explanation Game: Explaining Machine Learning Models Using Shapley Values
Luke Merrick, Ankur Taly
Algorithms to estimate Shapley value feature attributions
Hugh Chen, Ian Covert, Scott M. Lundberg et al.
Interventionally Consistent Surrogates for Complex Simulation Models
Joel Dyer, Nicholas Bishop, Yorgos Felekis et al.
Ignore, Trust, or Negotiate: Understanding Clinician Acceptance of AI-Based Treatment Recommendations in Health Care
Venkatesh Sivaraman, L. Bukowski, J. Levin et al.
Causal Shapley Values: Exploiting Causal Knowledge to Explain Individual Predictions of Complex Models
T. Heskes, E. Sijben, I. G. Bucur et al.
Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance
Gagan Bansal, Tongshuang Sherry Wu, Joyce Zhou et al.
Generalized Linear Models
E. Ziegel
Notions of explainability and evaluation approaches for explainable artificial intelligence
Giulia Vilone, Longo Luca
Why Tabular Foundation Models Should Be a Research Priority
B. V. Breugel, M. Schaar
The many Shapley values for model explanation
Mukund Sundararajan, A. Najmi