In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.
SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot
- Paper ID: 2510.08737
- Title: SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot
- Authors: Justin Lin (Indiana University Mathematics Department), Julia Fukuyama (Indiana University Statistics Department)
- Classification: cs.LG, stat.ME, stat.ML
- Publication Date: October 9, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.08737v1
In an era of rapid data and technological advancement, large-scale black-box models have become mainstream due to their capacity to process massive datasets and learn complex input-output relationships. However, these methods suffer from a critical limitation: the inability to interpret the prediction process, rendering their application in high-risk scenarios unreliable and potentially dangerous. SHAP (SHapley Additive exPlanations) analysis, as an interpretable AI methodology, has gained increasing popularity for its ability to explain model predictions using original features. This paper proposes clustering analysis on SHAP values, which not only groups samples receiving identical predictions but, more importantly, groups samples obtaining the same predictions for similar reasons. The method's effectiveness is demonstrated through simulation experiments and an Alzheimer's disease case study (using the ADNI database), along with a proposed generalization of waterfall plots for multi-classification problems.
As machine learning model complexity continues to increase, black-box models demonstrate superior predictive accuracy; however, their lack of interpretability creates barriers to application in high-risk domains such as healthcare. Traditional clustering analysis, based solely on original data features, cannot reveal the different pathways through which samples arrive at identical prediction outcomes.
- Medical Application Needs: In heterogeneous diseases such as Alzheimer's disease, different patients may reach identical diagnostic conclusions through completely different pathological mechanisms
- Precision Medicine: Understanding disease heterogeneity facilitates the development of personalized treatment plans
- Model Interpretability: In high-risk decision scenarios, understanding the rationale behind model predictions is paramount
- Traditional Clustering Methods: Based solely on original data features, unable to capture complex input-output relationships learned by models
- Scarcity of SHAP Value Clustering Research: Existing literature contains extremely limited research on SHAP value clustering
- Insufficient Visualization Tools: Lack of effective SHAP value visualization methods for multi-classification problems
- Proposed SHAP-based Supervised Clustering Method: Performs clustering based on SHAP values rather than raw data, revealing different pathways through which samples arrive at identical predictions
- Developed High-Dimensional Waterfall Plot: Generalizes traditional waterfall plots to multi-classification problems, supporting visualization of k-dimensional SHAP vectors
- Provided Complete Analysis Pipeline: Encompasses a five-step workflow including predictive modeling, SHAP analysis, visualization, clustering analysis, and cluster interpretation
- Validated Method Effectiveness: Verified practical utility through simulation experiments and real-world Alzheimer's disease case studies
Given training dataset X' ⊂ X ⊂ R^p and trained model f: X → R, compute SHAP values φ(f;x)₁, ..., φ(f;x)ₚ for each sample x ∈ X such that:
∑i=1pϕ(f;x)i=f(x)−E[f(X′)]
The objective is to cluster the SHAP value matrix to discover sample groups with similar model explanations.
- Construct predictive models using XGBoost
- Ensure model generalization performance through repeated cross-validation
- Binary Classification: Each feature corresponds to one SHAP value
- Multi-Classification: Each feature corresponds to a k-dimensional SHAP vector (k = number of classes)
- Employ TreeSHAP algorithm to compute SHAP values for tree models
- Utilize cross-validation to prevent overfitting
- Apply UMAP for dimensionality reduction and visualization
- Preserve local structure, suitable for cluster detection
- Employ HDBSCAN for hierarchical density-based clustering
- Capable of handling noise and variable-density clustering
- Analyze original data using heatmaps
- Explain clusters using high-dimensional waterfall plots
Traditional waterfall plots are applicable only to one-dimensional SHAP values and cannot handle k-dimensional SHAP vectors in multi-classification scenarios.
- Projection to Class Subspace: Select two classes and ignore SHAP values of other classes, suitable for pairwise class comparisons
- PCA Projection: Project onto the two-dimensional subspace retaining maximum information, preserving all k classes but with complex axis interpretation
View SHAP vector sequences as paths in k-dimensional space, where each path segment corresponds to a feature's contribution, originating from the average prediction point and reaching the sample's specific prediction point.
- Generative Model: Multinomial logistic regression
- Sample Size: 1,500 samples, 10-dimensional features
- Design Rationale: Create different pathways to reach identical target classes
- Function Definition:
- f₁(x) = 4x₁x₂ + 4x₁ + 4x₂ + Σβ₁,ᵢxᵢ
- f₂(x) = 4x₁x₂ - 4x₁ - 4x₂ + Σβ₂,ᵢxᵢ
- where βⱼ,ᵢ ~ N(0,1)
- Data Source: Alzheimer's Disease Neuroimaging Initiative database
- Sample Size: 2,422 patients, 39 features
- Target Classes: Cognitively Normal (CN), Mild Cognitive Impairment (MCI), Alzheimer's Disease/Dementia (AD)
- Preprocessing: Remove visit data and device information; linear scaling to 0,1 interval
- Classification Performance: Precision, recall, F1 score
- Clustering Quality: Validated through visualization and domain knowledge
- Predictive Model: XGBoost
- Dimensionality Reduction: UMAP
- Clustering Algorithm: HDBSCAN
- Cross-Validation: Repeated cross-validation for SHAP value computation
XGBoost model demonstrates excellent performance on test set:
- Overall Accuracy: 90%
- Per-Class F1 Scores: 0.88-0.92
- Validates reliability of model explanations
- No Clustering Structure in Raw Data: UMAP visualization reveals no apparent clustering patterns in original data
- SHAP Values Reveal 4 Clusters:
- Cluster 0: x₁ < 0, x₂ < 0 → Class 0
- Cluster 3: x₁ > 0, x₂ > 0 → Class 1
- Clusters 1 and 2: x₁, x₂ with opposite signs → Class 2 (two different pathways)
- Successfully identified two distinct pathways to Class 2
- Cluster 1: x₁ > 0, x₂ < 0
- Cluster 2: x₁ < 0, x₂ > 0
Further analysis reveals Cluster 3 can be subdivided into two sub-clusters, with primary distinction in Feature 8 contribution, validating method stability.
- Overall Accuracy: 93%
- Per-Class Performance: CN (F1=0.96), MCI (F1=0.92), AD (F1=0.86)
- CDRSB (Clinical Dementia Rating Scale Sum of Boxes): Most important predictive factor
- LDELTOTAL: Significant role in CN and MCI differentiation
- mPACCdigit and MMSE: Important in MCI and AD differentiation
- CN Patients: Clusters 0 and 4, with similar SHAP patterns despite different APOE4 genotypes
- MCI Patients: Clusters 3 and 6
- Cluster 3: CDRSB contribution to AD = -1.50 (protective)
- Cluster 6: CDRSB contribution to AD = -0.50 (risk-associated)
- AD Patients: Clusters 1, 2, 5, exhibiting different disease pathways
- Reveals heterogeneity within identical diagnostic categories
- CDRSB assessment can be used for risk stratification in MCI patients
- Different AD clusters may require different therapeutic strategies
- Theoretical Foundation: Based on Shapley values (Lloyd Shapley, 1953)
- Modern Development: Application to machine learning by Lundberg and Lee (2017)
- TreeSHAP Algorithm: Specialized SHAP value computation for tree models
- Traditional Methods: K-means, hierarchical clustering based on original features
- Density-Based Clustering: DBSCAN and its improved variant HDBSCAN
- Supervised Clustering: Clustering methods incorporating supervised learning information
Existing research is extremely limited; this paper represents an important contribution to the field, establishing foundations for subsequent research.
- Effectiveness of SHAP-based Clustering: Successfully discovers meaningful groupings unobservable in raw data
- Practicality of High-Dimensional Waterfall Plot: Successfully addresses SHAP value visualization challenges in multi-classification
- Medical Application Value: Demonstrates practical application potential in Alzheimer's disease research
- Disease Heterogeneity Insights: Reveals different pathological pathways within identical diagnostic categories
- Computational Complexity: Requires computation of numerous SHAP values with high computational cost
- Model Dependency: Clustering results depend on quality of underlying predictive model
- Parameter Sensitivity: Parameter selection for algorithms such as HDBSCAN may influence results
- Class Number Constraints: High-dimensional waterfall plot visualization remains limited by number of classes
- Visualization Method Extension: Develop high-dimensional versions of other SHAP plots (bar plots, heatmaps, beeswarm plots, etc.)
- Algorithm Optimization: Improve computational efficiency for large-scale data
- Theoretical Analysis: Establish theoretical foundations for SHAP-based clustering
- Application Expansion: Validate method universality across additional domains
- Strong Innovation: First systematic proposal of SHAP-based supervised clustering method
- High Practical Value: Possesses important application value in high-risk domains such as medicine
- Complete Methodology: Provides comprehensive workflow from modeling to interpretation
- Sufficient Validation: Dual verification through simulation and real-world case studies
- Visualization Innovation: High-dimensional waterfall plot addresses multi-classification interpretability challenges
- Weak Theoretical Foundation: Lacks theoretical analysis of SHAP-based clustering
- Computational Efficiency: Computational complexity issues for large-scale applications insufficiently discussed
- Parameter Selection: Insufficient guidance principles for clustering algorithm parameter selection
- Statistical Significance: Lacks statistical significance testing of clustering results
- Insufficient Comparative Experiments: Limited comparison with other interpretable clustering methods
- Academic Contribution: Provides novel perspectives for interpretable AI and supervised clustering fields
- Practical Value: Possesses direct application potential in precision medicine and related domains
- Method Generalization: Workflow generalizable to other fields and problems
- Subsequent Research: Opens new directions for deep application of SHAP values
- Medical Diagnosis: Disease heterogeneity analysis and personalized treatment
- Financial Risk Control: Customer risk stratification and differentiated strategies
- Recommendation Systems: User behavior pattern analysis
- Quality Control: Analysis of different causes of product defects
The paper cites 23 important references covering SHAP theory, clustering algorithms, visualization methods, and Alzheimer's disease research, providing solid theoretical support for interdisciplinary research.
Overall Assessment: This is a high-quality interdisciplinary research paper making important contributions at the intersection of interpretable AI and supervised clustering. The methodology demonstrates strong innovation, comprehensive experimental validation, and significant value for high-risk application domains such as healthcare. While improvements remain possible in theoretical analysis and computational efficiency, the work establishes a solid foundation for subsequent research.