2025-11-10T02:57:56.733881

Regularized Sparse Optimal Discriminant Clustering

Hiraishi, Tanioka, Yadohisa

We propose a new method based on sparse optimal discriminant clustering (SODC), incorporating a penalty term into the scoring matrix based on convex clustering. With the addition of this penalty term, it is expected to improve the accuracy of cluster identification by pulling points within the same cluster closer together and points from different clusters further apart. When the estimation results are visualized, the clustering structure can be depicted more clearly. Moreover, we develop a novel algorithm to derive the updated formula of this scoring matrix using a majorizing function. The scoring matrix is updated using the alternating direction method of multipliers (ADMM), which is often employed to calculate the parameters of the objective function in the convex clustering. In the proposed method, as in the conventional SODC, the scoring matrix is subject to an orthogonal constraint. Therefore, it is necessary to satisfy the orthogonal constraint on the scoring matrix while maintaining the clustering structure. Using a majorizing function, we adress the challenge of enforcing both orthogonal constraint and the clustering structure within the scoring matrix. We demonstrate numerical simulations and an application to real data to assess the performance of the proposed method.

academic

Regularized Sparse Optimal Discriminant Clustering

基本信息

论文ID: 2501.10147
标题: Regularized Sparse Optimal Discriminant Clustering
作者: Mayu Hiraishi, Kensuke Tanioka, Hiroshi Yadohisa (同志社大学)
分类: stat.ME (统计方法)
发表时间: 2025年10月15日
论文链接: https://arxiv.org/abs/2501.10147

摘要

本文提出了一种基于稀疏最优判别聚类(SODC)的新方法，将基于凸聚类的惩罚项纳入评分矩阵。通过添加该惩罚项，期望通过将同一聚类内的点拉近、不同聚类间的点推远来提高聚类识别的准确性。当估计结果可视化时，聚类结构能够更清晰地描绘出来。此外，作者开发了一种新颖的算法，使用主化函数推导该评分矩阵的更新公式。评分矩阵使用交替方向乘数法(ADMM)进行更新，该方法常用于计算凸聚类目标函数的参数。

研究背景与动机

问题定义

维度缩减聚类在解释大型复杂数据特征方面被广泛使用，它估计一个低维空间来识别聚类，在保持高维数据重要特征的同时实现高效处理。现有的最优判别聚类(ODC)和稀疏最优判别聚类(SODC)方法虽然比主成分分析更清晰地描述聚类，但存在以下问题：

评分矩阵结构问题: SODC中的评分矩阵没有保持与LDA中最优评分相同的识别聚类结构
缺乏独立聚类信息矩阵: ODC和SODC不包含包含聚类信息的独立矩阵，可能影响聚类估计的准确性
可视化效果不佳: SODC在数据降维到低维空间并可视化结果时，可能无法产生良好分离的聚类结构

研究动机

为了解决上述问题，作者提出在SODC中添加基于凸聚类的惩罚项，使评分矩阵提供比传统SODC更清晰的聚类结构，通过将来自同一聚类的数据点拉近、将来自不同聚类的数据点分开。

核心贡献

提出RSODC方法: 在SODC基础上添加基于凸聚类的正则化项，改善聚类识别准确性
开发新算法: 使用主化函数推导评分矩阵的更新公式，同时满足正交约束和聚类结构要求
ADMM优化框架: 采用交替方向乘数法更新评分矩阵，有效处理复杂约束条件
理论和实证验证: 通过数值仿真和真实数据应用验证方法的有效性

约束条件： $Y^{\dagger\top}Y^{\dagger} = I_{k-1}$ 且 $Y^{\dagger\top}1 = 0$

其中：

前三项与SODC相同
第四项是基于凸聚类的惩罚项，鼓励相似样本更接近
$\alpha_{i,j}$ 是权重，计算为： $\alpha_{i,j} = \iota_{\delta_{i,j}}\exp(-\tau\|x_i - x_j\|_2^2)$

ADMM分解

为应用ADMM算法，将问题重写为：

$\min_{B,Y,V,\Lambda} \frac{1}{2}\|Y - H_nXB\|_F^2 + \eta_2\|B\|_F^2 + \eta_1\sum_{j=1}^p\|\beta_j\|_2 + \gamma\sum_{l \in \varepsilon}\alpha_l\|v_l\|_2$

约束条件：

$y_i - y_j = v_l$
$Y^{\top}Y = I_{k-1}$
$Y^{\top}1 = 0$

技术创新点

主化函数方法

关键创新是使用主化函数处理评分矩阵更新中的二次项。对于二次形式 $\text{tr}(Y^{\top}CY)$ ，构造主化函数：

$\text{tr}(Y^{\top}CY) \leq 2\omega - 2\text{tr}(Y^{\top}(\omega I - C)Q) - \text{tr}(Q^{\top}CQ)$

其中 $\omega$ 是 $C = \frac{\rho}{2}\sum_{l \in \varepsilon}g_lg_l^{\top}$ 的最大特征值。

正交Procrustes分析

通过主化函数，将Y的更新转化为正交Procrustes问题：

$\min_Y \|Y - D\|_F^2, \quad \text{s.t. } Y^{\top}Y = I$

解为 $Y \leftarrow LR^{\top}$ ，其中 $D = L\Sigma R^{\top}$ 是奇异值分解。

实验设置

数据集

仿真数据:
- 样本数 $n = 60, 96, 156$
- 变量数 $p = 20, 50, 80, 100$
- 聚类数 $k = 3, 4$
- 信息变量数 $q = 2$
真实数据: 乳腺癌蛋白质组学数据(breast TCGA)
- 150个样本，142个蛋白质
- 3个癌症亚型：Basal, Her2, LumA
- 选择10个信息变量和70个非信息变量