2025-11-12T09:04:09.780506

SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

Lin, Fukuyama

In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.

academic

SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

基本信息

论文ID: 2510.08737
标题: SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot
作者: Justin Lin (Indiana University Mathematics Department), Julia Fukuyama (Indiana University Statistics Department)
分类: cs.LG, stat.ME, stat.ML
发表时间: 2025年10月9日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.08737v1

摘要

在数据和技术快速发展的时代，大型黑盒模型因其处理海量数据和学习复杂输入输出关系的能力而成为主流。然而，这些方法的缺陷在于无法解释预测过程，使其在高风险场景中的应用变得不可信且危险。SHAP（SHapley Additive exPlanations）分析作为一种可解释AI方法，因其能够用原始特征解释模型预测而日益流行。本文提出对SHAP值进行聚类分析，不仅能将获得相同预测的样本分组，更重要的是将因相似原因获得相同预测的样本分组。通过仿真实验和阿尔茨海默病案例研究（使用ADNI数据库），展示了该方法的有效性，并提出了多分类问题的瀑布图泛化方法。

研究背景与动机

问题定义

随着机器学习模型复杂度的不断提升，黑盒模型在预测准确性方面表现优异，但其缺乏可解释性的特点在医疗等高风险领域造成了应用障碍。传统的聚类分析仅基于原始数据特征，无法揭示样本到达相同预测结果的不同路径。

研究重要性

医学应用需求：在阿尔茨海默病等异质性疾病中，不同患者可能通过完全不同的病理机制到达相同的诊断结果
精准医疗：理解疾病的异质性有助于制定个性化治疗方案
模型可解释性：在高风险决策场景中，理解模型预测的原因至关重要

现有方法局限性

传统聚类方法：仅基于原始数据特征，无法捕获模型学习到的复杂输入输出关系
SHAP值聚类研究稀少：现有文献中对SHAP值聚类的研究极为有限
可视化工具不足：多分类问题缺乏有效的SHAP值可视化方法

核心贡献

提出SHAP-based监督聚类方法：基于SHAP值而非原始数据进行聚类，揭示样本到达相同预测的不同路径
开发高维瀑布图：将传统瀑布图泛化到多分类问题，支持k维SHAP向量的可视化
提供完整的分析流程：包含预测建模、SHAP分析、可视化、聚类分析和聚类解释的五步工作流
验证方法有效性：通过仿真实验和阿尔茨海默病真实案例验证方法的实用性

方法详解

任务定义

给定训练数据集X' ⊂ X ⊂ R^p和训练好的模型f: X → R，对每个样本x ∈ X计算SHAP值φ(f;x)₁, ..., φ(f;x)ₚ，使得：

$\sum_{i=1}^{p} \phi(f;x)_i = f(x) - E[f(X')]$

目标是对SHAP值矩阵进行聚类，发现具有相似模型解释的样本群组。

监督聚类工作流

1. 预测建模

使用XGBoost构建预测模型
通过重复交叉验证确保模型泛化性能

2. SHAP分析

二分类：每个特征对应一个SHAP值
多分类：每个特征对应k维SHAP向量（k为类别数）
使用TreeSHAP算法计算树模型的SHAP值
通过交叉验证避免过拟合

3. 可视化

使用UMAP进行降维可视化
保持局部结构，适合聚类检测

4. 聚类分析

采用HDBSCAN进行层次密度聚类
能够处理噪声和可变密度聚类

5. 聚类解释

使用热图分析原始数据
采用高维瀑布图解释聚类

高维瀑布图创新

传统瀑布图局限

传统瀑布图仅适用于一维SHAP值，无法处理多分类的k维SHAP向量。

解决方案

投影到类别子空间：选择两个类别，忽略其他类别的SHAP值，适合类别间的两两比较
PCA投影：投影到保留最多信息的二维子空间，保留所有k个类别的信息但轴解释较复杂

数学表示

将SHAP向量序列视为k维空间中的路径，每个路径段对应一个特征的贡献，从平均预测点出发到达样本的具体预测点。

实验设置

数据集

仿真数据

生成模型：多项逻辑回归
样本规模：1,500个样本，10维特征
设计思想：创建到达相同目标类别的不同路径
函数定义：
- f₁(x) = 4x₁x₂ + 4x₁ + 4x₂ + Σβ₁,ᵢxᵢ
- f₂(x) = 4x₁x₂ - 4x₁ - 4x₂ + Σβ₂,ᵢxᵢ
- 其中βⱼ,ᵢ ~ N(0,1)