2025-11-10T02:57:56.733881

Regularized Sparse Optimal Discriminant Clustering

Hiraishi, Tanioka, Yadohisa
We propose a new method based on sparse optimal discriminant clustering (SODC), incorporating a penalty term into the scoring matrix based on convex clustering. With the addition of this penalty term, it is expected to improve the accuracy of cluster identification by pulling points within the same cluster closer together and points from different clusters further apart. When the estimation results are visualized, the clustering structure can be depicted more clearly. Moreover, we develop a novel algorithm to derive the updated formula of this scoring matrix using a majorizing function. The scoring matrix is updated using the alternating direction method of multipliers (ADMM), which is often employed to calculate the parameters of the objective function in the convex clustering. In the proposed method, as in the conventional SODC, the scoring matrix is subject to an orthogonal constraint. Therefore, it is necessary to satisfy the orthogonal constraint on the scoring matrix while maintaining the clustering structure. Using a majorizing function, we adress the challenge of enforcing both orthogonal constraint and the clustering structure within the scoring matrix. We demonstrate numerical simulations and an application to real data to assess the performance of the proposed method.
academic

Regularized Sparse Optimal Discriminant Clustering

Basic Information

  • Paper ID: 2501.10147
  • Title: Regularized Sparse Optimal Discriminant Clustering
  • Authors: Mayu Hiraishi, Kensuke Tanioka, Hiroshi Yadohisa (Doshisha University)
  • Classification: stat.ME (Statistical Methods)
  • Publication Date: October 15, 2025
  • Paper Link: https://arxiv.org/abs/2501.10147

Abstract

This paper proposes a novel method based on Sparse Optimal Discriminant Clustering (SODC) by incorporating a penalty term derived from convex clustering into the scoring matrix. By adding this penalty term, the authors aim to improve clustering identification accuracy by pulling points within the same cluster closer together and pushing points from different clusters farther apart. When visualization of estimation results is performed, the clustering structure can be depicted more clearly. Furthermore, the authors develop a novel algorithm that uses majorization functions to derive update formulas for the scoring matrix. The scoring matrix is updated using the Alternating Direction Method of Multipliers (ADMM), a method commonly employed for computing parameters in convex clustering objective functions.

Research Background and Motivation

Problem Definition

Dimensionality reduction clustering is widely used in interpreting features of large complex datasets. It estimates a low-dimensional space to identify clusters while maintaining important features of high-dimensional data and achieving efficient processing. Although existing Optimal Discriminant Clustering (ODC) and Sparse Optimal Discriminant Clustering (SODC) methods describe clusters more clearly than Principal Component Analysis, they suffer from the following issues:

  1. Scoring Matrix Structure Problem: The scoring matrix in SODC does not maintain the same cluster identification structure as the optimal scoring in LDA
  2. Lack of Independent Cluster Information Matrix: ODC and SODC do not include an independent matrix containing cluster information, which may affect the accuracy of cluster estimation
  3. Poor Visualization Effects: When SODC reduces data to low-dimensional space and visualizes results, it may fail to produce well-separated cluster structures

Research Motivation

To address these issues, the authors propose adding a penalty term based on convex clustering to SODC, enabling the scoring matrix to provide clearer cluster structures than traditional SODC by pulling data points from the same cluster closer together and separating data points from different clusters.

Core Contributions

  1. Proposes RSODC Method: Adds a regularization term based on convex clustering to SODC, improving cluster identification accuracy
  2. Develops Novel Algorithm: Uses majorization functions to derive update formulas for the scoring matrix while satisfying orthogonality constraints and cluster structure requirements
  3. ADMM Optimization Framework: Employs the Alternating Direction Method of Multipliers to update the scoring matrix, effectively handling complex constraints
  4. Theoretical and Empirical Validation: Verifies method effectiveness through numerical simulations and real data applications

Methodology Details

Task Definition

Given data matrix XRn×pX \in \mathbb{R}^{n \times p}, the objective is to identify kk clusters in low-dimensional space while performing variable selection and dimensionality reduction.

Model Architecture

RSODC Objective Function

The optimization problem for RSODC is defined as:

minB,Y12YHnXBF2+η2BF2+η1j=1pβj2+γi<jαi,jyiyj2\min_{B,Y^{\dagger}} \frac{1}{2}\|Y^{\dagger} - H_nXB\|_F^2 + \eta_2\|B\|_F^2 + \eta_1\sum_{j=1}^p\|\beta_j\|_2 + \gamma\sum_{i<j}\alpha_{i,j}\|y_i^{\dagger} - y_j^{\dagger}\|_2

Subject to constraints: YY=Ik1Y^{\dagger\top}Y^{\dagger} = I_{k-1} and Y1=0Y^{\dagger\top}1 = 0

Where:

  • The first three terms are identical to SODC
  • The fourth term is a penalty term based on convex clustering, encouraging similar samples to be closer
  • αi,j\alpha_{i,j} is a weight calculated as: αi,j=ιδi,jexp(τxixj22)\alpha_{i,j} = \iota_{\delta_{i,j}}\exp(-\tau\|x_i - x_j\|_2^2)

ADMM Decomposition

To apply the ADMM algorithm, the problem is reformulated as:

minB,Y,V,Λ12YHnXBF2+η2BF2+η1j=1pβj2+γlεαlvl2\min_{B,Y,V,\Lambda} \frac{1}{2}\|Y - H_nXB\|_F^2 + \eta_2\|B\|_F^2 + \eta_1\sum_{j=1}^p\|\beta_j\|_2 + \gamma\sum_{l \in \varepsilon}\alpha_l\|v_l\|_2

Subject to constraints:

  • yiyj=vly_i - y_j = v_l
  • YY=Ik1Y^{\top}Y = I_{k-1}
  • Y1=0Y^{\top}1 = 0

Technical Innovations

Majorization Function Method

The key innovation is using majorization functions to handle quadratic terms in scoring matrix updates. For the quadratic form tr(YCY)\text{tr}(Y^{\top}CY), a majorization function is constructed:

tr(YCY)2ω2tr(Y(ωIC)Q)tr(QCQ)\text{tr}(Y^{\top}CY) \leq 2\omega - 2\text{tr}(Y^{\top}(\omega I - C)Q) - \text{tr}(Q^{\top}CQ)

Where ω\omega is the largest eigenvalue of C=ρ2lεglglC = \frac{\rho}{2}\sum_{l \in \varepsilon}g_lg_l^{\top}.

Orthogonal Procrustes Analysis

Through the majorization function, the Y update is transformed into an orthogonal Procrustes problem:

minYYDF2,s.t. YY=I\min_Y \|Y - D\|_F^2, \quad \text{s.t. } Y^{\top}Y = I

The solution is YLRY \leftarrow LR^{\top}, where D=LΣRD = L\Sigma R^{\top} is the singular value decomposition.

Experimental Setup

Datasets

  1. Simulated Data:
    • Sample size n=60,96,156n = 60, 96, 156
    • Number of variables p=20,50,80,100p = 20, 50, 80, 100
    • Number of clusters k=3,4k = 3, 4
    • Number of informative variables q=2q = 2
  2. Real Data: Breast cancer proteomics data (breast TCGA)
    • 150 samples, 142 proteins
    • 3 cancer subtypes: Basal, Her2, LumA
    • 10 informative variables and 70 non-informative variables selected

Evaluation Metrics

  • Adjusted Rand Index (ARI): Assesses clustering accuracy
  • Variance Ratio: Ratio of within-cluster variance to between-cluster variance
  • Sensitivity and Specificity: Evaluate variable selection effectiveness

Comparison Methods

  • Sparse Optimal Discriminant Clustering (SODC)
  • Tandem Clustering
  • Reduced k-means
  • Factorial k-means
  • t-SNE

Implementation Details

  • Parameter selection: Cross-validation based on kappa coefficient
  • η2=0\eta_2 = 0, τ=0.1\tau = 0.1, δ=25\delta = 25
  • Convergence threshold: ϵ>0\epsilon > 0

Experimental Results

Main Results

Simulation Experiments

RSODC outperforms comparison methods on the ARI metric across all simulation settings:

  • When k=3: RSODC performs best in nearly all scenarios
  • When k=4: RSODC outperforms all comparison methods
  • Stability: As pp increases, RSODC remains stable while SODC becomes unstable
  • Clustering Quality: As cluster center distance ϑ\vartheta increases, RSODC's ARI values increase more noticeably

Real Data Application

Results on breast cancer data:

MethodARI(XB^X\hat{B})ARI(Y^\hat{Y})Variance Ratio(XB^X\hat{B})Variance Ratio(Y^\hat{Y})
RSODC0.4060.4413.0383.056
SODC0.4010.3632.9092.660

Ablation Studies

Parameter Sensitivity Analysis

  • Weight Parameters: ARI is highest when τ=0\tau = 0 and δ=0.01\delta = 0.01
  • Tuning Parameters: Different combinations of η1,γ,ρ\eta_1, \gamma, \rho have minimal impact on results
  • Cluster Number Selection: Gap statistic effectively determines the true number of clusters

Computational Complexity

RSODC requires more computation time than SODC, particularly when nn is large, but provides better clustering quality.

Case Analysis

Visualization results demonstrate that RSODC can:

  • Aggregate points within the same cluster more tightly
  • Separate points from different clusters more distinctly
  • Provide clearer cluster boundaries

Dimensionality Reduction Clustering Methods

  • Optimal Discriminant Clustering (ODC): Zhang and Dai (2009) based on optimal scoring from Fisher's Linear Discriminant Analysis
  • Sparse ODC (SODC): Wang et al. (2016) adding group lasso penalty to ODC
  • Convex Clustering: Pelckmans et al. (2005), Hocking et al. (2011) using convex optimization for stable clustering

Innovations in This Work

Compared to existing methods, RSODC's main advantages are:

  1. Approximates the model in the reduced-dimensional space rather than the original space
  2. Uses majorization functions to simultaneously handle orthogonality constraints and cluster structure
  3. Provides clearer cluster visualization effects

Conclusions and Discussion

Main Conclusions

  1. RSODC significantly improves cluster identification accuracy by adding a convex clustering penalty term
  2. The majorization function method effectively addresses the simultaneous satisfaction of orthogonality constraints and cluster structure
  3. Experiments validate method effectiveness on both simulated and real data

Limitations

  1. Computational Complexity: Requires more computation time than SODC
  2. Parameter Selection: Cross-validation is computationally expensive
  3. Weight Calculation: Computing weights in the original space may not be optimal
  4. Data Distribution Assumptions: Implicitly assumes errors follow a normal distribution

Future Directions

  1. Develop more efficient parameter selection methods
  2. Compute weights in the low-dimensional space
  3. Derive information criteria to reduce cross-validation costs
  4. Consider extensions for non-normal distributions

In-Depth Evaluation

Strengths

  1. Strong Methodological Innovation: Cleverly combines convex clustering and sparse discriminant analysis
  2. Solid Theoretical Foundation: Majorization function method is theoretically rigorous
  3. Comprehensive Experiments: Includes five simulation experiments and real data validation
  4. Elegant Algorithm Design: ADMM framework effectively handles complex constraints

Weaknesses

  1. Computational Efficiency: Significantly higher computational cost compared to SODC
  2. Parameter Sensitivity: Multiple hyperparameters require tuning
  3. Applicable Scope: Primarily suitable for small-scale clustering problems
  4. Theoretical Analysis: Lacks convergence and complexity theoretical analysis

Impact

  1. Academic Contribution: Provides new perspectives for dimensionality reduction clustering
  2. Practical Value: Applicable to scenarios requiring clear cluster visualization
  3. Reproducibility: Algorithm description is detailed and easy to implement

Applicable Scenarios

  • Clustering analysis of high-dimensional data
  • Clustering tasks requiring variable selection
  • Exploratory data analysis requiring clear visualization
  • Bioinformatics and genomics data analysis

References

Main references include:

  • Zhang & Dai (2009): Original work on optimal discriminant clustering
  • Wang et al. (2016): Sparse optimal discriminant clustering
  • Chi & Lange (2015): ADMM algorithm for convex clustering
  • Hunter & Lange (2004): MM algorithm and majorization functions

Overall Assessment: This is a high-quality statistical methodology paper that proposes the RSODC method with theoretical innovations and comprehensive experimental validation. Although computational complexity is higher, it demonstrates significant improvements in clustering quality and visualization effects, making important contributions to the dimensionality reduction clustering field.