We propose a new method based on sparse optimal discriminant clustering (SODC), incorporating a penalty term into the scoring matrix based on convex clustering. With the addition of this penalty term, it is expected to improve the accuracy of cluster identification by pulling points within the same cluster closer together and points from different clusters further apart. When the estimation results are visualized, the clustering structure can be depicted more clearly. Moreover, we develop a novel algorithm to derive the updated formula of this scoring matrix using a majorizing function. The scoring matrix is updated using the alternating direction method of multipliers (ADMM), which is often employed to calculate the parameters of the objective function in the convex clustering. In the proposed method, as in the conventional SODC, the scoring matrix is subject to an orthogonal constraint. Therefore, it is necessary to satisfy the orthogonal constraint on the scoring matrix while maintaining the clustering structure. Using a majorizing function, we adress the challenge of enforcing both orthogonal constraint and the clustering structure within the scoring matrix. We demonstrate numerical simulations and an application to real data to assess the performance of the proposed method.
- Paper ID: 2501.10147
- Title: Regularized Sparse Optimal Discriminant Clustering
- Authors: Mayu Hiraishi, Kensuke Tanioka, Hiroshi Yadohisa (Doshisha University)
- Classification: stat.ME (Statistical Methods)
- Publication Date: October 15, 2025
- Paper Link: https://arxiv.org/abs/2501.10147
This paper proposes a novel method based on Sparse Optimal Discriminant Clustering (SODC) by incorporating a penalty term derived from convex clustering into the scoring matrix. By adding this penalty term, the authors aim to improve clustering identification accuracy by pulling points within the same cluster closer together and pushing points from different clusters farther apart. When visualization of estimation results is performed, the clustering structure can be depicted more clearly. Furthermore, the authors develop a novel algorithm that uses majorization functions to derive update formulas for the scoring matrix. The scoring matrix is updated using the Alternating Direction Method of Multipliers (ADMM), a method commonly employed for computing parameters in convex clustering objective functions.
Dimensionality reduction clustering is widely used in interpreting features of large complex datasets. It estimates a low-dimensional space to identify clusters while maintaining important features of high-dimensional data and achieving efficient processing. Although existing Optimal Discriminant Clustering (ODC) and Sparse Optimal Discriminant Clustering (SODC) methods describe clusters more clearly than Principal Component Analysis, they suffer from the following issues:
- Scoring Matrix Structure Problem: The scoring matrix in SODC does not maintain the same cluster identification structure as the optimal scoring in LDA
- Lack of Independent Cluster Information Matrix: ODC and SODC do not include an independent matrix containing cluster information, which may affect the accuracy of cluster estimation
- Poor Visualization Effects: When SODC reduces data to low-dimensional space and visualizes results, it may fail to produce well-separated cluster structures
To address these issues, the authors propose adding a penalty term based on convex clustering to SODC, enabling the scoring matrix to provide clearer cluster structures than traditional SODC by pulling data points from the same cluster closer together and separating data points from different clusters.
- Proposes RSODC Method: Adds a regularization term based on convex clustering to SODC, improving cluster identification accuracy
- Develops Novel Algorithm: Uses majorization functions to derive update formulas for the scoring matrix while satisfying orthogonality constraints and cluster structure requirements
- ADMM Optimization Framework: Employs the Alternating Direction Method of Multipliers to update the scoring matrix, effectively handling complex constraints
- Theoretical and Empirical Validation: Verifies method effectiveness through numerical simulations and real data applications
Given data matrix X∈Rn×p, the objective is to identify k clusters in low-dimensional space while performing variable selection and dimensionality reduction.
The optimization problem for RSODC is defined as:
minB,Y†21∥Y†−HnXB∥F2+η2∥B∥F2+η1∑j=1p∥βj∥2+γ∑i<jαi,j∥yi†−yj†∥2
Subject to constraints: Y†⊤Y†=Ik−1 and Y†⊤1=0
Where:
- The first three terms are identical to SODC
- The fourth term is a penalty term based on convex clustering, encouraging similar samples to be closer
- αi,j is a weight calculated as: αi,j=ιδi,jexp(−τ∥xi−xj∥22)
To apply the ADMM algorithm, the problem is reformulated as:
minB,Y,V,Λ21∥Y−HnXB∥F2+η2∥B∥F2+η1∑j=1p∥βj∥2+γ∑l∈εαl∥vl∥2
Subject to constraints:
- yi−yj=vl
- Y⊤Y=Ik−1
- Y⊤1=0
The key innovation is using majorization functions to handle quadratic terms in scoring matrix updates. For the quadratic form tr(Y⊤CY), a majorization function is constructed:
tr(Y⊤CY)≤2ω−2tr(Y⊤(ωI−C)Q)−tr(Q⊤CQ)
Where ω is the largest eigenvalue of C=2ρ∑l∈εglgl⊤.
Through the majorization function, the Y update is transformed into an orthogonal Procrustes problem:
minY∥Y−D∥F2,s.t. Y⊤Y=I
The solution is Y←LR⊤, where D=LΣR⊤ is the singular value decomposition.
- Simulated Data:
- Sample size n=60,96,156
- Number of variables p=20,50,80,100
- Number of clusters k=3,4
- Number of informative variables q=2
- Real Data: Breast cancer proteomics data (breast TCGA)
- 150 samples, 142 proteins
- 3 cancer subtypes: Basal, Her2, LumA
- 10 informative variables and 70 non-informative variables selected
- Adjusted Rand Index (ARI): Assesses clustering accuracy
- Variance Ratio: Ratio of within-cluster variance to between-cluster variance
- Sensitivity and Specificity: Evaluate variable selection effectiveness
- Sparse Optimal Discriminant Clustering (SODC)
- Tandem Clustering
- Reduced k-means
- Factorial k-means
- t-SNE
- Parameter selection: Cross-validation based on kappa coefficient
- η2=0, τ=0.1, δ=25
- Convergence threshold: ϵ>0
RSODC outperforms comparison methods on the ARI metric across all simulation settings:
- When k=3: RSODC performs best in nearly all scenarios
- When k=4: RSODC outperforms all comparison methods
- Stability: As p increases, RSODC remains stable while SODC becomes unstable
- Clustering Quality: As cluster center distance ϑ increases, RSODC's ARI values increase more noticeably
Results on breast cancer data:
| Method | ARI(XB^) | ARI(Y^) | Variance Ratio(XB^) | Variance Ratio(Y^) |
|---|
| RSODC | 0.406 | 0.441 | 3.038 | 3.056 |
| SODC | 0.401 | 0.363 | 2.909 | 2.660 |
- Weight Parameters: ARI is highest when τ=0 and δ=0.01
- Tuning Parameters: Different combinations of η1,γ,ρ have minimal impact on results
- Cluster Number Selection: Gap statistic effectively determines the true number of clusters
RSODC requires more computation time than SODC, particularly when n is large, but provides better clustering quality.
Visualization results demonstrate that RSODC can:
- Aggregate points within the same cluster more tightly
- Separate points from different clusters more distinctly
- Provide clearer cluster boundaries
- Optimal Discriminant Clustering (ODC): Zhang and Dai (2009) based on optimal scoring from Fisher's Linear Discriminant Analysis
- Sparse ODC (SODC): Wang et al. (2016) adding group lasso penalty to ODC
- Convex Clustering: Pelckmans et al. (2005), Hocking et al. (2011) using convex optimization for stable clustering
Compared to existing methods, RSODC's main advantages are:
- Approximates the model in the reduced-dimensional space rather than the original space
- Uses majorization functions to simultaneously handle orthogonality constraints and cluster structure
- Provides clearer cluster visualization effects
- RSODC significantly improves cluster identification accuracy by adding a convex clustering penalty term
- The majorization function method effectively addresses the simultaneous satisfaction of orthogonality constraints and cluster structure
- Experiments validate method effectiveness on both simulated and real data
- Computational Complexity: Requires more computation time than SODC
- Parameter Selection: Cross-validation is computationally expensive
- Weight Calculation: Computing weights in the original space may not be optimal
- Data Distribution Assumptions: Implicitly assumes errors follow a normal distribution
- Develop more efficient parameter selection methods
- Compute weights in the low-dimensional space
- Derive information criteria to reduce cross-validation costs
- Consider extensions for non-normal distributions
- Strong Methodological Innovation: Cleverly combines convex clustering and sparse discriminant analysis
- Solid Theoretical Foundation: Majorization function method is theoretically rigorous
- Comprehensive Experiments: Includes five simulation experiments and real data validation
- Elegant Algorithm Design: ADMM framework effectively handles complex constraints
- Computational Efficiency: Significantly higher computational cost compared to SODC
- Parameter Sensitivity: Multiple hyperparameters require tuning
- Applicable Scope: Primarily suitable for small-scale clustering problems
- Theoretical Analysis: Lacks convergence and complexity theoretical analysis
- Academic Contribution: Provides new perspectives for dimensionality reduction clustering
- Practical Value: Applicable to scenarios requiring clear cluster visualization
- Reproducibility: Algorithm description is detailed and easy to implement
- Clustering analysis of high-dimensional data
- Clustering tasks requiring variable selection
- Exploratory data analysis requiring clear visualization
- Bioinformatics and genomics data analysis
Main references include:
- Zhang & Dai (2009): Original work on optimal discriminant clustering
- Wang et al. (2016): Sparse optimal discriminant clustering
- Chi & Lange (2015): ADMM algorithm for convex clustering
- Hunter & Lange (2004): MM algorithm and majorization functions
Overall Assessment: This is a high-quality statistical methodology paper that proposes the RSODC method with theoretical innovations and comprehensive experimental validation. Although computational complexity is higher, it demonstrates significant improvements in clustering quality and visualization effects, making important contributions to the dimensionality reduction clustering field.