2025-11-12T09:04:09.780506

SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

Lin, Fukuyama

In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.

academic

SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

Basic Information

Paper ID: 2510.08737
Title: SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot
Authors: Justin Lin (Indiana University Mathematics Department), Julia Fukuyama (Indiana University Statistics Department)
Classification: cs.LG, stat.ME, stat.ML
Publication Date: October 9, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.08737v1

Abstract

In an era of rapid data and technological advancement, large-scale black-box models have become mainstream due to their capacity to process massive datasets and learn complex input-output relationships. However, these methods suffer from a critical limitation: the inability to interpret the prediction process, rendering their application in high-risk scenarios unreliable and potentially dangerous. SHAP (SHapley Additive exPlanations) analysis, as an interpretable AI methodology, has gained increasing popularity for its ability to explain model predictions using original features. This paper proposes clustering analysis on SHAP values, which not only groups samples receiving identical predictions but, more importantly, groups samples obtaining the same predictions for similar reasons. The method's effectiveness is demonstrated through simulation experiments and an Alzheimer's disease case study (using the ADNI database), along with a proposed generalization of waterfall plots for multi-classification problems.

Research Background and Motivation

Problem Definition

As machine learning model complexity continues to increase, black-box models demonstrate superior predictive accuracy; however, their lack of interpretability creates barriers to application in high-risk domains such as healthcare. Traditional clustering analysis, based solely on original data features, cannot reveal the different pathways through which samples arrive at identical prediction outcomes.

Research Significance

Medical Application Needs: In heterogeneous diseases such as Alzheimer's disease, different patients may reach identical diagnostic conclusions through completely different pathological mechanisms
Precision Medicine: Understanding disease heterogeneity facilitates the development of personalized treatment plans
Model Interpretability: In high-risk decision scenarios, understanding the rationale behind model predictions is paramount

Limitations of Existing Methods

Traditional Clustering Methods: Based solely on original data features, unable to capture complex input-output relationships learned by models
Scarcity of SHAP Value Clustering Research: Existing literature contains extremely limited research on SHAP value clustering
Insufficient Visualization Tools: Lack of effective SHAP value visualization methods for multi-classification problems

Core Contributions

Proposed SHAP-based Supervised Clustering Method: Performs clustering based on SHAP values rather than raw data, revealing different pathways through which samples arrive at identical predictions
Developed High-Dimensional Waterfall Plot: Generalizes traditional waterfall plots to multi-classification problems, supporting visualization of k-dimensional SHAP vectors
Provided Complete Analysis Pipeline: Encompasses a five-step workflow including predictive modeling, SHAP analysis, visualization, clustering analysis, and cluster interpretation
Validated Method Effectiveness: Verified practical utility through simulation experiments and real-world Alzheimer's disease case studies

Methodology Details

Task Definition

Given training dataset X' ⊂ X ⊂ R^p and trained model f: X → R, compute SHAP values φ(f;x)₁, ..., φ(f;x)ₚ for each sample x ∈ X such that:

$\sum_{i=1}^{p} \phi(f;x)_i = f(x) - E[f(X')]$

The objective is to cluster the SHAP value matrix to discover sample groups with similar model explanations.

Supervised Clustering Workflow

1. Predictive Modeling

Construct predictive models using XGBoost
Ensure model generalization performance through repeated cross-validation

2. SHAP Analysis

Binary Classification: Each feature corresponds to one SHAP value
Multi-Classification: Each feature corresponds to a k-dimensional SHAP vector (k = number of classes)
Employ TreeSHAP algorithm to compute SHAP values for tree models
Utilize cross-validation to prevent overfitting

3. Visualization

Apply UMAP for dimensionality reduction and visualization
Preserve local structure, suitable for cluster detection

4. Clustering Analysis

Employ HDBSCAN for hierarchical density-based clustering
Capable of handling noise and variable-density clustering

5. Cluster Interpretation

Analyze original data using heatmaps
Explain clusters using high-dimensional waterfall plots

High-Dimensional Waterfall Plot Innovation

Limitations of Traditional Waterfall Plots

Traditional waterfall plots are applicable only to one-dimensional SHAP values and cannot handle k-dimensional SHAP vectors in multi-classification scenarios.

Solution

Projection to Class Subspace: Select two classes and ignore SHAP values of other classes, suitable for pairwise class comparisons
PCA Projection: Project onto the two-dimensional subspace retaining maximum information, preserving all k classes but with complex axis interpretation

Mathematical Representation

View SHAP vector sequences as paths in k-dimensional space, where each path segment corresponds to a feature's contribution, originating from the average prediction point and reaching the sample's specific prediction point.

Experimental Setup

Datasets

Simulation Data

Generative Model: Multinomial logistic regression
Sample Size: 1,500 samples, 10-dimensional features
Design Rationale: Create different pathways to reach identical target classes
Function Definition:
- f₁(x) = 4x₁x₂ + 4x₁ + 4x₂ + Σβ₁,ᵢxᵢ
- f₂(x) = 4x₁x₂ - 4x₁ - 4x₂ + Σβ₂,ᵢxᵢ
- where βⱼ,ᵢ ~ N(0,1)

ADNI Data

Data Source: Alzheimer's Disease Neuroimaging Initiative database
Sample Size: 2,422 patients, 39 features
Target Classes: Cognitively Normal (CN), Mild Cognitive Impairment (MCI), Alzheimer's Disease/Dementia (AD)
Preprocessing: Remove visit data and device information; linear scaling to 0,1 interval

Evaluation Metrics

Classification Performance: Precision, recall, F1 score
Clustering Quality: Validated through visualization and domain knowledge

Implementation Details

Predictive Model: XGBoost
Dimensionality Reduction: UMAP
Clustering Algorithm: HDBSCAN
Cross-Validation: Repeated cross-validation for SHAP value computation

Experimental Results

Simulation Experiment Results

Model Performance

XGBoost model demonstrates excellent performance on test set:

Overall Accuracy: 90%
Per-Class F1 Scores: 0.88-0.92
Validates reliability of model explanations

Clustering Discoveries

No Clustering Structure in Raw Data: UMAP visualization reveals no apparent clustering patterns in original data
SHAP Values Reveal 4 Clusters:
- Cluster 0: x₁ < 0, x₂ < 0 → Class 0
- Cluster 3: x₁ > 0, x₂ > 0 → Class 1
- Clusters 1 and 2: x₁, x₂ with opposite signs → Class 2 (two different pathways)

High-Dimensional Waterfall Plot Validation

Successfully identified two distinct pathways to Class 2
Cluster 1: x₁ > 0, x₂ < 0
Cluster 2: x₁ < 0, x₂ > 0

Finer-Grained Clustering

Further analysis reveals Cluster 3 can be subdivided into two sub-clusters, with primary distinction in Feature 8 contribution, validating method stability.

ADNI Case Study Results

Model Performance

Overall Accuracy: 93%
Per-Class Performance: CN (F1=0.96), MCI (F1=0.92), AD (F1=0.86)

Key Feature Identification

CDRSB (Clinical Dementia Rating Scale Sum of Boxes): Most important predictive factor
LDELTOTAL: Significant role in CN and MCI differentiation
mPACCdigit and MMSE: Important in MCI and AD differentiation

Clustering Discoveries

CN Patients: Clusters 0 and 4, with similar SHAP patterns despite different APOE4 genotypes
MCI Patients: Clusters 3 and 6
- Cluster 3: CDRSB contribution to AD = -1.50 (protective)
- Cluster 6: CDRSB contribution to AD = -0.50 (risk-associated)
AD Patients: Clusters 1, 2, 5, exhibiting different disease pathways

Clinical Significance

Reveals heterogeneity within identical diagnostic categories
CDRSB assessment can be used for risk stratification in MCI patients
Different AD clusters may require different therapeutic strategies

SHAP Analysis Development

Theoretical Foundation: Based on Shapley values (Lloyd Shapley, 1953)
Modern Development: Application to machine learning by Lundberg and Lee (2017)
TreeSHAP Algorithm: Specialized SHAP value computation for tree models

Clustering Method Evolution

Traditional Methods: K-means, hierarchical clustering based on original features
Density-Based Clustering: DBSCAN and its improved variant HDBSCAN
Supervised Clustering: Clustering methods incorporating supervised learning information

SHAP Value Clustering Research

Existing research is extremely limited; this paper represents an important contribution to the field, establishing foundations for subsequent research.

Conclusions and Discussion

Main Conclusions

Effectiveness of SHAP-based Clustering: Successfully discovers meaningful groupings unobservable in raw data
Practicality of High-Dimensional Waterfall Plot: Successfully addresses SHAP value visualization challenges in multi-classification
Medical Application Value: Demonstrates practical application potential in Alzheimer's disease research
Disease Heterogeneity Insights: Reveals different pathological pathways within identical diagnostic categories

Limitations

Computational Complexity: Requires computation of numerous SHAP values with high computational cost
Model Dependency: Clustering results depend on quality of underlying predictive model
Parameter Sensitivity: Parameter selection for algorithms such as HDBSCAN may influence results
Class Number Constraints: High-dimensional waterfall plot visualization remains limited by number of classes

Future Directions

Visualization Method Extension: Develop high-dimensional versions of other SHAP plots (bar plots, heatmaps, beeswarm plots, etc.)
Algorithm Optimization: Improve computational efficiency for large-scale data
Theoretical Analysis: Establish theoretical foundations for SHAP-based clustering
Application Expansion: Validate method universality across additional domains

In-Depth Evaluation

Strengths

Strong Innovation: First systematic proposal of SHAP-based supervised clustering method
High Practical Value: Possesses important application value in high-risk domains such as medicine
Complete Methodology: Provides comprehensive workflow from modeling to interpretation
Sufficient Validation: Dual verification through simulation and real-world case studies
Visualization Innovation: High-dimensional waterfall plot addresses multi-classification interpretability challenges

Weaknesses

Weak Theoretical Foundation: Lacks theoretical analysis of SHAP-based clustering
Computational Efficiency: Computational complexity issues for large-scale applications insufficiently discussed
Parameter Selection: Insufficient guidance principles for clustering algorithm parameter selection
Statistical Significance: Lacks statistical significance testing of clustering results
Insufficient Comparative Experiments: Limited comparison with other interpretable clustering methods

Impact

Academic Contribution: Provides novel perspectives for interpretable AI and supervised clustering fields
Practical Value: Possesses direct application potential in precision medicine and related domains
Method Generalization: Workflow generalizable to other fields and problems
Subsequent Research: Opens new directions for deep application of SHAP values

Applicable Scenarios

Medical Diagnosis: Disease heterogeneity analysis and personalized treatment
Financial Risk Control: Customer risk stratification and differentiated strategies
Recommendation Systems: User behavior pattern analysis
Quality Control: Analysis of different causes of product defects

References

The paper cites 23 important references covering SHAP theory, clustering algorithms, visualization methods, and Alzheimer's disease research, providing solid theoretical support for interdisciplinary research.

Overall Assessment: This is a high-quality interdisciplinary research paper making important contributions at the intersection of interpretable AI and supervised clustering. The methodology demonstrates strong innovation, comprehensive experimental validation, and significant value for high-risk application domains such as healthcare. While improvements remain possible in theoretical analysis and computational efficiency, the work establishes a solid foundation for subsequent research.