2025-11-10T02:57:56.733881

Regularized Sparse Optimal Discriminant Clustering

Hiraishi, Tanioka, Yadohisa

We propose a new method based on sparse optimal discriminant clustering (SODC), incorporating a penalty term into the scoring matrix based on convex clustering. With the addition of this penalty term, it is expected to improve the accuracy of cluster identification by pulling points within the same cluster closer together and points from different clusters further apart. When the estimation results are visualized, the clustering structure can be depicted more clearly. Moreover, we develop a novel algorithm to derive the updated formula of this scoring matrix using a majorizing function. The scoring matrix is updated using the alternating direction method of multipliers (ADMM), which is often employed to calculate the parameters of the objective function in the convex clustering. In the proposed method, as in the conventional SODC, the scoring matrix is subject to an orthogonal constraint. Therefore, it is necessary to satisfy the orthogonal constraint on the scoring matrix while maintaining the clustering structure. Using a majorizing function, we adress the challenge of enforcing both orthogonal constraint and the clustering structure within the scoring matrix. We demonstrate numerical simulations and an application to real data to assess the performance of the proposed method.

academic

Regularized Sparse Optimal Discriminant Clustering

Basic Information

Paper ID: 2501.10147
Title: Regularized Sparse Optimal Discriminant Clustering
Authors: Mayu Hiraishi, Kensuke Tanioka, Hiroshi Yadohisa (Doshisha University)
Classification: stat.ME (Statistical Methods)
Publication Date: October 15, 2025
Paper Link: https://arxiv.org/abs/2501.10147

Abstract

This paper proposes a novel method based on Sparse Optimal Discriminant Clustering (SODC) by incorporating a penalty term derived from convex clustering into the scoring matrix. By adding this penalty term, the authors aim to improve clustering identification accuracy by pulling points within the same cluster closer together and pushing points from different clusters farther apart. When visualization of estimation results is performed, the clustering structure can be depicted more clearly. Furthermore, the authors develop a novel algorithm that uses majorization functions to derive update formulas for the scoring matrix. The scoring matrix is updated using the Alternating Direction Method of Multipliers (ADMM), a method commonly employed for computing parameters in convex clustering objective functions.

Research Background and Motivation

Problem Definition

Dimensionality reduction clustering is widely used in interpreting features of large complex datasets. It estimates a low-dimensional space to identify clusters while maintaining important features of high-dimensional data and achieving efficient processing. Although existing Optimal Discriminant Clustering (ODC) and Sparse Optimal Discriminant Clustering (SODC) methods describe clusters more clearly than Principal Component Analysis, they suffer from the following issues:

Scoring Matrix Structure Problem: The scoring matrix in SODC does not maintain the same cluster identification structure as the optimal scoring in LDA
Lack of Independent Cluster Information Matrix: ODC and SODC do not include an independent matrix containing cluster information, which may affect the accuracy of cluster estimation
Poor Visualization Effects: When SODC reduces data to low-dimensional space and visualizes results, it may fail to produce well-separated cluster structures

Research Motivation

To address these issues, the authors propose adding a penalty term based on convex clustering to SODC, enabling the scoring matrix to provide clearer cluster structures than traditional SODC by pulling data points from the same cluster closer together and separating data points from different clusters.

Core Contributions

Proposes RSODC Method: Adds a regularization term based on convex clustering to SODC, improving cluster identification accuracy
Develops Novel Algorithm: Uses majorization functions to derive update formulas for the scoring matrix while satisfying orthogonality constraints and cluster structure requirements
ADMM Optimization Framework: Employs the Alternating Direction Method of Multipliers to update the scoring matrix, effectively handling complex constraints
Theoretical and Empirical Validation: Verifies method effectiveness through numerical simulations and real data applications

Methodology Details

Task Definition

Given data matrix $X \in \mathbb{R}^{n \times p}$ , the objective is to identify $k$ clusters in low-dimensional space while performing variable selection and dimensionality reduction.

Model Architecture

RSODC Objective Function

The optimization problem for RSODC is defined as:

$\min_{B,Y^{\dagger}} \frac{1}{2}\|Y^{\dagger} - H_nXB\|_F^2 + \eta_2\|B\|_F^2 + \eta_1\sum_{j=1}^p\|\beta_j\|_2 + \gamma\sum_{i<j}\alpha_{i,j}\|y_i^{\dagger} - y_j^{\dagger}\|_2$

Subject to constraints: $Y^{\dagger\top}Y^{\dagger} = I_{k-1}$ and $Y^{\dagger\top}1 = 0$

Where:

The first three terms are identical to SODC
The fourth term is a penalty term based on convex clustering, encouraging similar samples to be closer
$\alpha_{i,j}$ is a weight calculated as: $\alpha_{i,j} = \iota_{\delta_{i,j}}\exp(-\tau\|x_i - x_j\|_2^2)$

ADMM Decomposition

To apply the ADMM algorithm, the problem is reformulated as:

$\min_{B,Y,V,\Lambda} \frac{1}{2}\|Y - H_nXB\|_F^2 + \eta_2\|B\|_F^2 + \eta_1\sum_{j=1}^p\|\beta_j\|_2 + \gamma\sum_{l \in \varepsilon}\alpha_l\|v_l\|_2$

Subject to constraints:

$y_i - y_j = v_l$
$Y^{\top}Y = I_{k-1}$
$Y^{\top}1 = 0$

Technical Innovations

Majorization Function Method

The key innovation is using majorization functions to handle quadratic terms in scoring matrix updates. For the quadratic form $\text{tr}(Y^{\top}CY)$ , a majorization function is constructed:

$\text{tr}(Y^{\top}CY) \leq 2\omega - 2\text{tr}(Y^{\top}(\omega I - C)Q) - \text{tr}(Q^{\top}CQ)$

Where $\omega$ is the largest eigenvalue of $C = \frac{\rho}{2}\sum_{l \in \varepsilon}g_lg_l^{\top}$ .

Orthogonal Procrustes Analysis

Through the majorization function, the Y update is transformed into an orthogonal Procrustes problem:

$\min_Y \|Y - D\|_F^2, \quad \text{s.t. } Y^{\top}Y = I$

The solution is $Y \leftarrow LR^{\top}$ , where $D = L\Sigma R^{\top}$ is the singular value decomposition.

Experimental Setup

Datasets

Simulated Data:
- Sample size $n = 60, 96, 156$
- Number of variables $p = 20, 50, 80, 100$
- Number of clusters $k = 3, 4$
- Number of informative variables $q = 2$
Real Data: Breast cancer proteomics data (breast TCGA)
- 150 samples, 142 proteins
- 3 cancer subtypes: Basal, Her2, LumA
- 10 informative variables and 70 non-informative variables selected

Evaluation Metrics

Adjusted Rand Index (ARI): Assesses clustering accuracy
Variance Ratio: Ratio of within-cluster variance to between-cluster variance
Sensitivity and Specificity: Evaluate variable selection effectiveness

Comparison Methods

Sparse Optimal Discriminant Clustering (SODC)
Tandem Clustering
Reduced k-means
Factorial k-means
t-SNE

Implementation Details

Parameter selection: Cross-validation based on kappa coefficient
$\eta_2 = 0$ , $\tau = 0.1$ , $\delta = 25$
Convergence threshold: $\epsilon > 0$

Experimental Results

Main Results

Simulation Experiments

RSODC outperforms comparison methods on the ARI metric across all simulation settings:

When k=3: RSODC performs best in nearly all scenarios
When k=4: RSODC outperforms all comparison methods
Stability: As $p$ increases, RSODC remains stable while SODC becomes unstable
Clustering Quality: As cluster center distance $\vartheta$ increases, RSODC's ARI values increase more noticeably

Real Data Application

Results on breast cancer data:

Method	ARI( $X\hat{B}$ )	ARI( $\hat{Y}$ )	Variance Ratio( $X\hat{B}$ )	Variance Ratio( $\hat{Y}$ )
RSODC	0.406	0.441	3.038	3.056
SODC	0.401	0.363	2.909	2.660

Ablation Studies

Parameter Sensitivity Analysis

Weight Parameters: ARI is highest when $\tau = 0$ and $\delta = 0.01$
Tuning Parameters: Different combinations of $\eta_1, \gamma, \rho$ have minimal impact on results
Cluster Number Selection: Gap statistic effectively determines the true number of clusters

Computational Complexity

RSODC requires more computation time than SODC, particularly when $n$ is large, but provides better clustering quality.

Case Analysis

Visualization results demonstrate that RSODC can:

Aggregate points within the same cluster more tightly
Separate points from different clusters more distinctly
Provide clearer cluster boundaries

Dimensionality Reduction Clustering Methods

Optimal Discriminant Clustering (ODC): Zhang and Dai (2009) based on optimal scoring from Fisher's Linear Discriminant Analysis
Sparse ODC (SODC): Wang et al. (2016) adding group lasso penalty to ODC
Convex Clustering: Pelckmans et al. (2005), Hocking et al. (2011) using convex optimization for stable clustering

Innovations in This Work

Compared to existing methods, RSODC's main advantages are:

Approximates the model in the reduced-dimensional space rather than the original space
Uses majorization functions to simultaneously handle orthogonality constraints and cluster structure
Provides clearer cluster visualization effects

Conclusions and Discussion

Main Conclusions

RSODC significantly improves cluster identification accuracy by adding a convex clustering penalty term
The majorization function method effectively addresses the simultaneous satisfaction of orthogonality constraints and cluster structure
Experiments validate method effectiveness on both simulated and real data

Limitations

Computational Complexity: Requires more computation time than SODC
Parameter Selection: Cross-validation is computationally expensive
Weight Calculation: Computing weights in the original space may not be optimal
Data Distribution Assumptions: Implicitly assumes errors follow a normal distribution

Future Directions

Develop more efficient parameter selection methods
Compute weights in the low-dimensional space
Derive information criteria to reduce cross-validation costs
Consider extensions for non-normal distributions

In-Depth Evaluation

Strengths

Strong Methodological Innovation: Cleverly combines convex clustering and sparse discriminant analysis
Solid Theoretical Foundation: Majorization function method is theoretically rigorous
Comprehensive Experiments: Includes five simulation experiments and real data validation
Elegant Algorithm Design: ADMM framework effectively handles complex constraints

Weaknesses

Computational Efficiency: Significantly higher computational cost compared to SODC
Parameter Sensitivity: Multiple hyperparameters require tuning
Applicable Scope: Primarily suitable for small-scale clustering problems
Theoretical Analysis: Lacks convergence and complexity theoretical analysis

Impact

Academic Contribution: Provides new perspectives for dimensionality reduction clustering
Practical Value: Applicable to scenarios requiring clear cluster visualization
Reproducibility: Algorithm description is detailed and easy to implement

Applicable Scenarios

Clustering analysis of high-dimensional data
Clustering tasks requiring variable selection
Exploratory data analysis requiring clear visualization
Bioinformatics and genomics data analysis

References

Main references include:

Zhang & Dai (2009): Original work on optimal discriminant clustering
Wang et al. (2016): Sparse optimal discriminant clustering
Chi & Lange (2015): ADMM algorithm for convex clustering
Hunter & Lange (2004): MM algorithm and majorization functions

Overall Assessment: This is a high-quality statistical methodology paper that proposes the RSODC method with theoretical innovations and comprehensive experimental validation. Although computational complexity is higher, it demonstrates significant improvements in clustering quality and visualization effects, making important contributions to the dimensionality reduction clustering field.