Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

TL;DR

Evaluates eight Shapley variants' human utility in high-risk settings, revealing misalignment between current metrics and human perception.

cs.LG 🔴 Advanced 2026-04-24 19 views

Inês Oliveira e Silva Sérgio Jesus Iker Perez Rita P. Ribeiro Carlos Soares Hugo Ferreira Pedro Bizarro

AI Reader Arxiv Page Download PDF

Explainable AI Shapley values Feature attribution Human-in-the-loop Automation bias

Key Findings

Methodology

The study employs a unified amortized framework to eliminate implementation confounders, enabling fair comparison of eight Shapley value variants. The research is conducted across four risk datasets and a realistic fraud detection environment involving 37 professional analysts and 3,735 case reviews. This approach reveals a fundamental misalignment between quantitative metrics and human perception of clarity and decision utility.

Key Results

Result 1: Standard quantitative metrics such as sparsity and faithfulness are decoupled from human-perceived clarity and decision utility. While explanations did not improve objective analyst performance, they consistently increased decision confidence, indicating a critical risk of automation bias in high-stakes settings.
Result 2: Comparison of eight Shapley variants reveals no single variant dominates across all metrics. Fixed baseline variants perform well on Deletion AUC and Recall@3 but poorly on sparsity and contrastivity.
Result 3: Empirical variants show balanced performance, while Conditional Shapley deviates from this pattern, producing dense, sensitive attributions that reflect correlations rather than model behavior.

Significance

This research is significant in the field of explainable AI, particularly in high-risk decision systems. By revealing the misalignment between current evaluation metrics and human utility, the study provides evidence-based guidance for selecting appropriate Shapley variants and evaluation metrics. This not only enhances the transparency and interpretability of AI systems but also mitigates the risk of automation bias.

Technical Contribution

Technical contributions include: 1) Proposing a unified amortized framework that eliminates implementation confounders for fair comparison of different Shapley variants; 2) Providing large-scale empirical analysis revealing the fundamental misalignment between quantitative metrics and human perception; 3) Offering evidence-based guidance for selecting Shapley values and evaluation metrics in high-risk decision systems.

Novelty

This paper is the first to systematically evaluate different Shapley value variants' human utility in high-risk settings, highlighting the inadequacies of current evaluation metrics. Unlike previous studies, this paper emphasizes the importance of human perception in evaluation through large-scale empirical analysis.

Limitations

Limitation 1: The study primarily focuses on financial and fraud detection domains, and results may not generalize to vision or language domains where feature semantics may have different dynamics.
Limitation 2: Experiments were conducted in controlled settings, unable to capture long-term effects such as learning, adaptation, or changes in institutional decision norms.
Limitation 3: The application of certain Shapley variants on high-dimensional datasets may be limited due to computational complexity.

Future Work

Future research could extend to other domains such as vision and natural language processing to validate the utility of Shapley value variants in different applications. Additionally, developing new evaluation metrics to better predict human perception and decision utility is an important direction.

AI Executive Summary

In high-stakes domains like fraud detection and credit assessment, machine learning model predictions often require human decision-maker review. Explainable AI (XAI) methods, such as Shapley values, aim to enhance transparency by decomposing model predictions into feature-level contributions. However, the proliferation of Shapley value variants has created a fragmented landscape with little consensus on practical deployment.

This paper employs a unified amortized framework to eliminate implementation confounders, enabling fair comparison of eight Shapley value variants. The research is conducted across four risk datasets and a realistic fraud detection environment involving 37 professional analysts and 3,735 case reviews. Results show that standard quantitative metrics such as sparsity and faithfulness are decoupled from human-perceived clarity and decision utility. While explanations did not improve objective analyst performance, they consistently increased decision confidence, indicating a critical risk of automation bias in high-stakes settings.

By comparing eight Shapley variants, the study finds no single variant dominates across all metrics. Fixed baseline variants perform well on Deletion AUC and Recall@3 but poorly on sparsity and contrastivity. Empirical variants show balanced performance, while Conditional Shapley deviates from this pattern, producing dense, sensitive attributions that reflect correlations rather than model behavior.

However, the study has some limitations. Firstly, it primarily focuses on financial and fraud detection domains, and results may not generalize to vision or language domains where feature semantics may have different dynamics. Additionally, experiments were conducted in controlled settings, unable to capture long-term effects such as learning, adaptation, or changes in institutional decision norms. Future research could extend to other domains such as vision and natural language processing to validate the utility of Shapley value variants in different applications. Additionally, developing new evaluation metrics to better predict human perception and decision utility is an important direction.

Deep Analysis

Background

In high-stakes domains such as fraud detection, credit assessment, and healthcare, machine learning model predictions often require human decision-makers' review. In these settings, model outputs rarely constitute final decisions. Instead, predictions are reviewed by human decision-makers operating under time, attention, and regulatory constraints. As a result, explanations are viewed as indispensable for accountability and oversight and have become a core requirement in operational ML deployments. Despite widespread adoption, the practical value of explanations in human-in-the-loop workflows remains poorly understood, often assumed rather than empirically established. Among explanation methods, local approaches grounded in cooperative game theory, most notably Shapley values, have emerged as a cornerstone by providing an axiomatic decomposition of model predictions into feature-level contributions. However, the framework has fragmented into competing formulations based on divergent assumptions about the semantics of feature absence, realized in popular implementations such as KernelSHAP, TreeSHAP, and related tools. This raises a critical evaluation question for practitioners: does the choice of formulation matter to the end-user, and do standard evaluation procedures anticipate the impact?

Core Problem

Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. Modern XAI evaluation relies on theoretical analysis and quantitative proxies, as well as mathematical distinctions between 'faithfulness' to the model or 'truthfulness' to the data. Yet, systematic evidence regarding how explanation methods perform against human-centered benchmarks remains scarce. Existing evaluations often focus on isolated properties and rarely stress-test these metrics against human behavior under realistic operational constraints. Furthermore, comparisons are often confounded by implementation choices, which mask the true semantic differences of the definitions themselves.

Innovation

The core innovations of this paper include: 1) Proposing a unified amortized framework that eliminates implementation confounders for fair comparison of different Shapley variants; 2) Providing large-scale empirical analysis revealing the fundamental misalignment between quantitative metrics and human perception; 3) Offering evidence-based guidance for selecting Shapley values and evaluation metrics in high-risk decision systems. These innovations not only enhance the transparency and interpretability of AI systems but also mitigate the risk of automation bias.

Methodology

�� Employ a unified amortized framework to eliminate implementation confounders, enabling fair comparison of eight Shapley value variants.
�� Conduct research across four risk datasets and a realistic fraud detection environment involving 37 professional analysts and 3,735 case reviews.
�� This approach reveals a fundamental misalignment between quantitative metrics and human perception of clarity and decision utility.
�� Evaluate Shapley attributions along two complementary axes: quantitative evaluation and human-in-the-loop study.
�� Use a compact set of metrics to capture functional properties, cross-formulation agreement, and downstream analyst behavior.

Experiments

The experimental design includes four risk datasets and a realistic fraud detection environment involving 37 professional analysts and 3,735 case reviews. Baselines include popular implementations like KernelSHAP and TreeSHAP. Evaluation metrics include sparsity, faithfulness, contrastivity, Deletion AUC, and Recall@3. The experiments also include ablation studies to analyze performance differences across different Shapley variants.

Results

Experimental results show that standard quantitative metrics such as sparsity and faithfulness are decoupled from human-perceived clarity and decision utility. While explanations did not improve objective analyst performance, they consistently increased decision confidence, indicating a critical risk of automation bias in high-stakes settings. By comparing eight Shapley variants, the study finds no single variant dominates across all metrics. Fixed baseline variants perform well on Deletion AUC and Recall@3 but poorly on sparsity and contrastivity. Empirical variants show balanced performance, while Conditional Shapley deviates from this pattern, producing dense, sensitive attributions that reflect correlations rather than model behavior.

Applications

The application scenarios of this research include high-risk decision systems in financial and fraud detection domains. By revealing the misalignment between current evaluation metrics and human utility, the study provides evidence-based guidance for selecting appropriate Shapley variants and evaluation metrics. This not only enhances the transparency and interpretability of AI systems but also mitigates the risk of automation bias.

Limitations & Outlook

The study primarily focuses on financial and fraud detection domains, and results may not generalize to vision or language domains where feature semantics may have different dynamics. Additionally, experiments were conducted in controlled settings, unable to capture long-term effects such as learning, adaptation, or changes in institutional decision norms. Future research could extend to other domains such as vision and natural language processing to validate the utility of Shapley value variants in different applications. Additionally, developing new evaluation metrics to better predict human perception and decision utility is an important direction.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a bunch of ingredients like carrots, potatoes, and chicken. You want to know how much each ingredient contributes to the final taste of the dish. Shapley values are like a chef's assistant that can tell you the importance of each ingredient in the dish. It considers combinations of all ingredients and tells you what the dish would taste like without carrots or if it would be better without potatoes.

In high-stakes environments like banks' fraud detection, Shapley values help analysts understand AI models' decisions. It's like having a transparent kitchen where analysts can see the impact of each feature (like transaction amount, location) on the model's judgment.

However, the problem is that different Shapley variants are like different chefs who might have different opinions on the ingredients. Some chefs might think carrots are crucial, while others might prioritize potatoes. This leads to situations where analysts might have varying confidence in the model's explanations in practical applications.

This study is like a cooking contest, evaluating the performance of different chefs (Shapley variants) to see which explanation aligns better with human intuition and needs.

ELI14 Explained like you're 14

Hey there! Did you know that in places like banks or hospitals, AI often helps make important decisions, like figuring out if a transaction is fraudulent or if a patient needs special care? To make these AI decisions more transparent, we need to know how they think.

Imagine you're playing a game, and AI is your teammate, telling you how dangerous each enemy is. Shapley values are like the AI's translator, explaining why it thinks a particular enemy is especially dangerous. It considers all possible combinations, just like in a game where you consider each teammate's role.

But different Shapley values are like different translators; some might make things clearer, while others might leave you more confused. This study compares these translators to see which explanations help you make better decisions in the game.

So next time you're making decisions in a game, think about these AIs and Shapley values—they're like your invisible assistants, helping you better understand the game world!

Glossary

Shapley Values

Shapley values are a cooperative game theory-based explanation method that decomposes model predictions into feature-level contributions.

Used in this paper to evaluate different variants' utility in high-risk settings.

KernelSHAP

KernelSHAP is a popular Shapley value implementation that estimates feature contributions using weighted least squares regression.

Used as one of the baselines to compare different Shapley variants.

TreeSHAP

TreeSHAP is a Shapley value implementation specifically designed for decision tree models, offering efficient computation.

Used to evaluate the performance of different Shapley variants.

Faithfulness

Faithfulness refers to the consistency of an explanation method with the model's predictions, i.e., whether the explanation accurately reflects the model's decision process.

Used as one of the evaluation metrics to assess different Shapley variants' performance.

Sparsity

Sparsity refers to the number of non-zero features in an explanation, with higher sparsity often indicating simpler explanations.

Used to evaluate the simplicity of different Shapley variants.

Automation Bias

Automation bias refers to the risk of humans over-relying on automated systems, even when the system's decisions may not be accurate.

The study reveals the potential for automation bias due to Shapley value explanations.

Amortized Framework

An amortized framework is a method to eliminate implementation confounders, allowing fair comparison of different algorithms' performance.

Used to eliminate implementation confounders for different Shapley variants.

Contrastivity

Contrastivity refers to the degree of change in an explanation method under different inputs, with higher contrastivity often indicating more sensitive explanations.

Used to assess the sensitivity of different Shapley variants.

Empirical Variants

Empirical variants are Shapley value implementations based on empirical data distributions, preserving feature marginals.

Used as one of the Shapley variants for comparison.

Conditional Shapley

Conditional Shapley is a Shapley value implementation that preserves empirical dependencies, challenging to estimate in high-dimensional settings.

Used to evaluate the performance of different Shapley variants.

Open Questions Unanswered questions from this research

1 The application of current Shapley value variants on high-dimensional datasets is limited by computational complexity. Future research needs to develop more efficient algorithms to support applications on larger-scale datasets.
2 Existing evaluation metrics, such as faithfulness and sparsity, have not been sufficiently validated for alignment with human perception. New metrics need to be developed to better predict human perception and decision utility.
3 The utility of different Shapley variants in vision and natural language processing domains has not been fully validated. Future research could extend to these domains to verify their applicability.
4 The long-term effects of automation bias remain unclear. Longitudinal studies are needed to assess the impact of automation bias on decision quality and human trust.
5 How to improve the interpretability and transparency of Shapley value explanations without increasing computational complexity is a pressing issue.

Applications

Immediate Applications

Financial Fraud Detection

Shapley values can help analysts understand model decisions in fraud detection, enhancing decision transparency and accountability.

Credit Assessment

In credit assessment, Shapley values can explain model scores for different applicants, helping credit officers make more informed decisions.

Medical Diagnosis

In healthcare, Shapley values can explain model predictions of patient risk, aiding doctors in making more accurate diagnostic decisions.

Long-term Vision

Fully Transparent AI Systems

By improving Shapley values' explanatory capabilities, fully transparent AI systems can be achieved, enhancing human trust in AI decisions.

Cross-Domain Explainable AI

In the future, Shapley values and their variants could expand to more domains like autonomous driving and smart homes, achieving broader applications.

Abstract

Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.

cs.LG cs.AI cs.HC

References (20)

Explaining machine learning classifiers through diverse counterfactual explanations

Ramaravind Kommiya Mothilal, Amit Sharma, Chenhao Tan

2019 1309 citations View Analysis →

Consistent Individualized Feature Attribution for Tree Ensembles

Scott M. Lundberg, Gabriel Erion-Barner, Su-In Lee

2018 1764 citations View Analysis →

Stabilizing Estimates of Shapley Values with Control Variates

Jeremy Goldwasser, Giles Hooker

2023 10 citations View Analysis →

Transparency, auditability, and explainability of machine learning models in credit scoring

Michael Bücker, G. Szepannek, Alicja Gosiewska et al.

2020 167 citations View Analysis →

Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Tree Ensembles

Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer et al.

2024 34 citations View Analysis →

How can I choose an explainer?: An Application-grounded Evaluation of Post-hoc Explanations

Sérgio Jesus, Catarina Bel'em, Vladimir Balayan et al.

2021 135 citations View Analysis →

Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems

Anupam Datta, S. Sen, Yair Zick

2016 774 citations

Fast TreeSHAP: Accelerating SHAP Value Computation for Trees

Jilei Yang

2021 62 citations View Analysis →

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

C. Rudin

2018 8512 citations

Interpretable Machine Learning - A Brief History, State-of-the-Art and Challenges

Christoph Molnar, Giuseppe Casalicchio, B. Bischl

2020 505 citations View Analysis →

The Explanation Game: Explaining Machine Learning Models Using Shapley Values

Luke Merrick, Ankur Taly

2020 247 citations

Algorithms to estimate Shapley value feature attributions

Hugh Chen, Ian Covert, Scott M. Lundberg et al.

2022 410 citations View Analysis →

Interventionally Consistent Surrogates for Complex Simulation Models

Joel Dyer, Nicholas Bishop, Yorgos Felekis et al.

2024 9 citations

Ignore, Trust, or Negotiate: Understanding Clinician Acceptance of AI-Based Treatment Recommendations in Health Care

Venkatesh Sivaraman, L. Bukowski, J. Levin et al.

2023 140 citations View Analysis →

Causal Shapley Values: Exploiting Causal Knowledge to Explain Individual Predictions of Complex Models

T. Heskes, E. Sijben, I. G. Bucur et al.

2020 210 citations View Analysis →

Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance

Gagan Bansal, Tongshuang Sherry Wu, Joyce Zhou et al.

2020 855 citations View Analysis →

Generalized Linear Models

E. Ziegel

2002 18686 citations

Notions of explainability and evaluation approaches for explainable artificial intelligence

Giulia Vilone, Longo Luca

2021 495 citations

Why Tabular Foundation Models Should Be a Research Priority

B. V. Breugel, M. Schaar

2024 110 citations View Analysis →

The many Shapley values for model explanation

Mukund Sundararajan, A. Najmi

2019 796 citations View Analysis →

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Shapley Values

KernelSHAP

TreeSHAP

Faithfulness

Sparsity

Automation Bias

Amortized Framework

Contrastivity

Empirical Variants

Conditional Shapley

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Financial Fraud Detection

Credit Assessment

Medical Diagnosis

Long-term Vision

Fully Transparent AI Systems

Cross-Domain Explainable AI

Abstract

References (20)

Related Papers

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Efficient learning by implicit exploration in bandit problems with side observations

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data