2025-11-30T11:01:19.099104

A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data

Patock, Ratnapriya, Barman

The identification of disease-gene associations is instrumental in understanding the mechanisms of diseases and developing novel treatments. Besides identifying genes from RNA-Seq datasets, it is often necessary to identify gene clusters that have relationships with a disease. In this work, we propose a graph-based method for using an RNA-Seq dataset with known genes related to a disease and perform a robust clustering analysis to identify clusters of genes. Our method involves the construction of a gene co-expression network, followed by the computation of gene embeddings leveraging Node2Vec+, an algorithm applying weighted biased random walks and skipgram with negative sampling to compute node embeddings from undirected graphs with weighted edges. Finally, we perform spectral clustering to identify clusters of genes. All processes in our entire method are jointly optimized for stability, robustness, and optimality by applying Tree-structured Parzen Estimator. Our method was applied to an RNA-Seq dataset of known genes that have associations with Age-related Macular Degeneration (AMD). We also performed tests to validate and verify the robustness and statistical significance of our methods due to the stochastic nature of the involved processes. Our results show that our method is capable of generating consistent and robust clustering results. Our method can be seamlessly applied to other RNA-Seq datasets due to our process of joint optimization, ensuring the stability and optimality of the several steps in our method, including the construction of a gene co-expression network, computation of gene embeddings, and clustering of genes. Our work will aid in the discovery of natural structures in the RNA-Seq data, and understanding gene regulation and gene functions not just for AMD but for any disease in general.

academic

A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data

Basic Information

Paper ID: 2511.09590
Title: A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data
Authors: Jake R. Patock (Rice University), Rinki Ratnapriya (Baylor College of Medicine), Arko Barman (Rice University)
Classification: q-bio.GN (Genomics)
Submission Date: November 12, 2025 (arXiv submission)
Paper Link: https://arxiv.org/abs/2511.09590

Abstract

This study proposes a graph-based method for identifying disease-associated gene clusters from RNA sequencing data. The method first constructs a gene co-expression network, then computes gene embeddings using the Node2Vec+ algorithm, and finally identifies gene clusters through spectral clustering. The entire pipeline is jointly optimized using Tree-structured Parzen Estimator (TPE) to ensure stability, robustness, and optimality. The method is applied to an RNA-Seq dataset of 81 known age-related macular degeneration (AMD)-related genes, and validation experiments demonstrate that the method generates consistent and robust clustering results.

Research Background and Motivation

1. Research Problem

Gene expression regulation has become a key mechanism through which genetic variation mediates human disease risk. While identifying individual disease-related genes from RNA-Seq datasets is important, identifying gene clusters with disease associations is equally necessary, which helps to:

Understand shared biological pathways or processes
Identify potentially undiscovered genes
Target disease mechanisms rather than individual genes for therapeutic intervention

2. Importance of the Problem

Precision Medicine Needs: Findings from gene expression studies have tremendous potential for translation into precision medicine
AMD Research Gap: Although some AMD-related genes have been identified, most of the genetic heritability remains unexplained
Clinical Application Value: Discovery of new gene relationships can lead to novel drug targets, patient risk testing, and improved diagnostics

3. Limitations of Existing Methods

Traditional Statistical Methods: Hypothesis testing and similar approaches tend to produce noisy results and false positives in large-scale datasets
Step-wise Optimization Problem: Existing methods typically optimize individual steps (network construction, embedding computation, clustering) separately, which cannot guarantee overall pipeline optimality
Insufficient Robustness: Lack of systematic validation of stochastic processes

4. Research Motivation

Develop an end-to-end, jointly optimized gene clustering pipeline that can:

Handle high noise in transcriptomic data
Ensure overall pipeline optimality rather than local optima
Provide statistical significance and robustness guarantees
Be easily transferable to other diseases and datasets

Core Contributions

Innovative Pipeline Design: Proposes a complete gene clustering pipeline including gene co-expression network construction, Node2Vec+ embedding computation, and spectral clustering
Joint Optimization Strategy: For the first time, jointly optimizes all pipeline steps rather than traditional step-wise optimization, using TPE to optimize 9 hyperparameters to maximize the DBCVI clustering metric
Robustness Verification Framework: Designs a comprehensive testing scheme including:
- 100 repeated experiments to verify consistency
- Statistical significance testing against random gene sets
- Adjusted Mutual Information (AMI) assessment of clustering stability
Practicality and Scalability:
- No need for expensive computational resources like GPUs
- Seamlessly applicable to other RNA-Seq datasets
- Provides visualization results for medical professionals

Detailed Methodology

Task Definition

Input: Bulk mRNA-seq dataset containing nc=105 control samples and ns=61 late-stage AMD patients, focusing on 81 known AMD-related genes

Output: Cluster 81 genes into k* functionally similar gene clusters

Constraints:

Need to handle sequencing depth differences
Consider uncertainty from stochastic processes
Ensure statistical significance

Model Architecture

The overall pipeline consists of four main stages:

1. Gene Co-expression Network Construction

CS-CORE Method: Uses the CS-CORE statistical method to compute co-expression matrices, which corrects for sequencing depth differences and is more accurate than Pearson correlation coefficients
Graph Construction:
- Nodes: 81 genes
- Edges: Undirected weighted edges are added when the absolute value of CS-CORE co-expression exceeds threshold τ
- Edge Weights: CS-CORE co-expression coefficients

2. Node2Vec+ Gene Embedding

Node2Vec+ is an improvement over classical Node2Vec, better handling weighted graphs:

Stage One: Weighted Biased Random Walk

Select anchor nodes
Execute weighted biased random walks considering three hyperparameters:
- Return hyperparameter p: Controls tendency to return to visited nodes
- In-out hyperparameter q: Controls tendency to explore new regions
- Relaxation hyperparameter γ: Set to 0 to ensure robustness
Record visited node sequences

Stage Two: Skip-Gram with Negative Sampling (SGNS)

Input: Anchor nodes
Labels: Neighbor nodes
Training: 100 epochs
Execute 32,768 random walks to generate training data

Optimized Hyperparameters:

p, q: Random walk behavior
WL: Walk length per iteration
E: Embedding dimension
WS: Window size
Ns: Number of negative samples per positive sample

3. Spectral Clustering

Employs the Spectrum method, specifically designed for multi-omics data:

Adaptive Density-Aware Kernel: Affinity matrix is defined as:

Aij = exp(- d²(si, sj) / (σiσj(CNN(sisj) + 1)))

Where:

d(si, sj): Euclidean distance between nodes
σi, σj: Local scale parameters (distance to P-th nearest neighbor)
CNN(sisj): Size of intersection of S nearest neighbors of si and sj

Clustering Number Estimation:

Construct diagonal matrix D and normalized graph Laplacian: L = D^(-1/2)AD^(-1/2)
Eigendecomposition yields eigenvectors V and eigenvalues Λ
Compute dip test statistic Z for each eigenvector
Calculate multimodality gap: di = zi - zi-1
Use the last significant multimodality gap to determine optimal cluster number k*

Final Clustering:

Stack the first k* eigenvectors to form matrix X
Row-normalize to obtain Y
Use Gaussian Mixture Model (GMM) to cluster rows of Y

Technical Innovations

1. Joint Optimization vs. Step-wise Optimization

Traditional Approach:

Separately optimize network construction → separately optimize embedding → separately optimize clustering
Each step is locally optimal, but overall optimality is not guaranteed

This Paper's Approach:

Define a single objective function: maximize DBCVI (Density-Based Clustering Validation Index)
Simultaneously optimize 9 hyperparameters
Use TPE for Bayesian optimization with 256 samples
Repeat each configuration 8 times and average to handle stochasticity

2. Choice of Node2Vec+

Compared to classical Node2Vec:

Considers edge weights in second-order random walks
Better performance on biological networks and datasets
More suitable for characteristics of gene co-expression networks

3. Robustness Assurance Mechanism

Handling Stochasticity: Repeat each hyperparameter configuration 8 times
Consistency Verification: 100 complete pipeline repetitions
Statistical Testing: Compare against 100 random gene sets

Experimental Setup

Dataset

Source: Bulk mRNA-seq data from AMD patients

Control Group: 105 samples (Minnesota grading system level 1)
Case Group: 61 late-stage AMD patients (Minnesota grading system level 4)
Analyzed Genes: 81 known AMD-related genes (pre-identified and validated through ML methods and SHAP interpretability analysis)

Evaluation Metrics

1. DBCVI (Density-Based Clustering Validation Index)

Suitable for non-convex clustering algorithms (e.g., spectral clustering)
Range: Higher is better
Serves as the objective function for joint optimization

2. AMI (Adjusted Mutual Information)

Evaluates consistency between clustering results
Range: -1 to 1
Suitable for small clusters and imbalanced cluster sizes

3. Statistical Testing

Kolmogorov-Smirnov (K-S) Test: Detects distribution differences
k-sample Anderson-Darling Test: Non-parametric testing

Comparison Methods

Random Gene Sets: Randomly select 81 genes from all genes, repeated 100 times
Purpose: Verify that AMD-related genes cluster significantly better than random genes

Implementation Details

Hyperparameter Search Space (Table I):

Method	Hyperparameter	Search Space	Optimal Value
Graph Construction	τ	0.3, 0.5	0.4
Node2vec+	p	0.01, 100.0	0.35
	q	0.01, 100.0	11.66
	WL	10, 30	20
	E	2, 16	10
	WS	4, 10	10
	Ns	5, 15	7
Spectral Clustering	P	3, 7	7
	S	P+2, P+4	11

Training Configuration:

TPE sampling iterations: 256
Repetitions per configuration: 8
SGNS training epochs: 100
Random walk iterations: 32,768
γ fixed at 0

Experimental Results

Main Results

1. Optimization Performance

Optimization Phase DBCVI: 0.99 (average of 8 trials)
100-Repetition Average DBCVI: 0.95
Optimal Embedding Dimension: E = 10

2. Robustness Verification

AMI Mean: 0.49
AMI Variance: 0.022
Interpretation: Clustering results show moderate to high consistency, performing well for small-scale datasets with potential noise

3. Statistical Significance

AMD Genes vs. Random Genes:

AMD genes average DBCVI: 0.95
Random genes average DBCVI: 0.84
K-S test: p = 2.68 × 10^(-25)
Anderson-Darling test: p < 0.001

Conclusion: Clustering quality of AMD-related genes is significantly superior to random gene sets, with extremely high statistical significance

Visualization Results

Use UMAP to reduce 10-dimensional embeddings to 3D for visualization (Figure 2)
Provide interactive HTML visualization (code repository)
Clustering structure is clearly discernible, facilitating interpretation by medical professionals

Experimental Findings

1. Advantages of Joint Optimization

Compared to step-wise optimization, joint optimization produces more consistent, robust, and optimal clustering results
Single cost function ensures global optimality rather than local optima

2. Impact of Random Walk Iterations

More random walks lead to higher AMI
When computational resources are sufficient, increasing random walk iterations can further improve consistency

3. Role of CS-CORE

Compared to Pearson correlation coefficients, CS-CORE generates more refined co-expression networks
Corrects for sequencing depth differences, reducing false positives

4. Impact of Dataset Scale

Current dataset has limited sample size (166 samples)
Larger datasets are expected to produce more consistent results and higher AMI

1. Machine Learning Applications in RNA-Seq Data

Breast Cancer: Multi-class logistic regression for molecular subtype stratification 5
Colorectal Cancer: Identification of diagnostic biomarkers 15
AMD: ML identification of differentially expressed genes and independent regulatory gene sets 14, 24, 29

2. Classical ML Algorithms

Supervised Learning: SVM, XGBoost
Unsupervised Learning: SOM, k-means, hierarchical clustering
Dimensionality Reduction: t-SNE, PCA

3. Graph-Based Deep Learning

Knowledge Graphs: Application to transcriptomics 28
Node2Vec: Application to melanoma and other diseases 30
GNN: Capturing complex inter-gene dependencies 2

End-to-End Optimization: First to propose joint optimization of the entire pipeline
Robustness Guarantees: Systematic statistical validation framework
Practicality: No GPU required, easily applicable to other datasets
Interpretability: Provides visualization results for clinical use

Conclusions and Discussion

Main Conclusions

Method Effectiveness: The proposed graph-based method can identify robust and statistically significant gene clusters from RNA-Seq data
Importance of Joint Optimization: Joint optimization of all pipeline steps produces superior overall results compared to step-wise optimization
Statistical Verification: Clustering quality of AMD-related genes is significantly superior to random gene sets (p < 10^-20)
Robustness: Despite involving multiple stochastic processes, 100 repeated experiments show moderate to high consistency (AMI = 0.49)
Scalability: The method can be seamlessly applied to other diseases and RNA-Seq datasets

Limitations

1. Dataset Scale

Relatively limited sample size (166 samples)
Analysis of only 81 pre-identified genes
Larger-scale datasets may produce more stable results

2. Validation Methods

Lack of validation on synthetic datasets with known ground truth labels
No experimental biological validation

3. Computational Cost

While not requiring GPU, 256 TPE samples × 8 repetitions still require considerable time
Increasing random walk iterations significantly increases computational cost

4. Method Assumptions

Assumes CS-CORE is applicable to bulk RNA-seq data (originally designed for single-cell data)
Assumes gene relationships can be adequately captured through co-expression networks

Future Directions

1. Synthetic Data Validation

Use synthetic datasets with known ground truth to perform more rigorous evaluation and independently validate the method's ability to recover information structure

2. Extension to More Diseases

Apply the method to RNA-Seq datasets from other diseases to verify generalizability

3. Experimental Validation

Collaborate with molecular geneticists to experimentally validate identified gene clusters

4. Method Improvements

Explore more efficient optimization algorithms
Investigate strategies for adaptive adjustment of random walk iterations
Integrate other omics data (proteomics, metabolomics)

5. Clinical Applications

Develop user-friendly tools for clinical researchers
Integrate into disease diagnosis and drug target discovery pipelines

In-Depth Evaluation

Strengths

1. Method Innovation (★★★★★)

Joint Optimization Strategy: First to implement end-to-end joint optimization in gene clustering pipeline, breaking through limitations of traditional step-wise optimization
Technical Integration: Skillfully combines CS-CORE, Node2Vec+, and spectral clustering, with each component having sufficient theoretical support
Optimization Algorithm Choice: TPE as a Bayesian optimization method is more efficient than grid search

2. Experimental Sufficiency (★★★★☆)

Robustness Verification: 100 repeated experiments systematically evaluate consistency
Statistical Significance: Dual testing using K-S and Anderson-Darling tests
Control Design: Comparison with 100 random gene sets proves method specificity
Limitation: Lacks direct comparison with other gene clustering methods

3. Result Convincingness (★★★★☆)

High DBCVI Scores: Average score of 0.95 indicates excellent clustering quality
Highly Significant p-values: p < 10^-20 proves results are non-random
Moderate AMI: AMI of 0.49 is reasonable for noisy data
Visualization: UMAP dimensionality reduction visualization enhances interpretability

4. Writing Clarity (★★★★★)

Clear pipeline diagram (Figure 1)
Well-formatted algorithm pseudocode (Algorithm 1)
Complete hyperparameter table (Table I)
Detailed method description facilitates reproducibility

5. Practical Value (★★★★★)

No Expensive Hardware Required: Does not depend on GPU, lowering usage barriers
Open Source Code: Provides GitHub repository
Strong Transferability: Joint optimization ensures applicability to new datasets
Clinical Relevance: Directly addresses AMD, an important ophthalmological disease

Weaknesses

1. Method Limitations

CS-CORE Assumption: Originally designed for single-cell data; applicability to bulk data not fully verified
Linear Embedding: Node2Vec+ based on shallow embedding may fail to capture highly non-linear gene relationships
Static Network: Does not consider time or condition-specific dynamic networks

2. Experimental Design Flaws

Lack of Method Comparison: No quantitative comparison with other gene clustering methods (e.g., WGCNA, hierarchical clustering)
Single Dataset: Validation only on AMD dataset; generalization ability not fully demonstrated
No Ground Truth: Lacks validation set with known cluster labels

3. Insufficient Analysis

Biological Interpretation: No functional enrichment or pathway analysis of identified gene clusters
Cluster Quantity: Does not discuss specific identified cluster number k* and its biological significance
Hyperparameter Sensitivity: No analysis of how hyperparameter changes affect results

4. Computational Efficiency

Optimization Cost: 256 TPE samples × 8 repetitions = 2048 model trainings, relatively high computational cost
Scalability: For large-scale analysis of thousands of genes, computational complexity may become a bottleneck

Impact Assessment

1. Contribution to Field (★★★★☆)

Methodological Contribution: Joint optimization paradigm may inspire design of other bioinformatics pipelines
AMD Research: Provides new tools for AMD gene function research
General Framework: Generalizable to other diseases and omics data

2. Practical Value (★★★★★)

Drug Target Discovery: Gene clusters can guide identification of novel drug targets
Patient Stratification: May be used for AMD patient subtype classification
Hypothesis Generation: Provides testable hypotheses for experimental biologists

3. Reproducibility (★★★★★)

Open Source Code: Complete GitHub repository
Detailed Description: Sufficient method and hyperparameter description
Available Data: Uses publicly available AMD dataset
Interactive Visualization: Provides HTML visualization files

4. Citation Potential (★★★★☆)

Method Innovation: Joint optimization strategy likely to be widely cited
Application Value: May be adopted by AMD and other disease researchers
Limitation: Single dataset validation may limit early citations

Applicable Scenarios

1. Ideal Application Scenarios

Functional Grouping of Known Disease-Related Genes: When a set of disease-related genes is available and functional classification is needed
Small to Medium-Scale Gene Sets: Clustering analysis of tens to hundreds of genes
Exploratory Research: Discovering potential relationships and structures among genes
Multi-Disease Comparison: Comparing gene cluster patterns across different diseases

2. Less Suitable Scenarios

Genome-Scale Analysis: Analysis of tens of thousands of genes may face computational bottlenecks
Time Series Data: Current method does not consider temporal dynamics
Single-Cell Data: Although uses CS-CORE, overall pipeline is designed for bulk data
Causal Inference: Method identifies correlations rather than causal relationships

3. Extended Applications

Protein Interaction Networks: Can be adapted for protein network analysis
Metabolic Pathway Analysis: Can be applied to metabolite networks
Multi-Omics Integration: Can be extended to integrate multiple omics data types

Key References

10 Grover & Leskovec (2016): Original Node2vec paper, proposing random walk-based graph embedding method
13 Liu et al. (2023): Node2Vec+ improved version, considering edge weights in biological network embedding
12 John et al. (2020): Spectrum spectral clustering method, proposing adaptive density kernel and multimodality gap
26 Su et al. (2023): CS-CORE method, correcting co-expression estimation in single-cell RNA-seq
14 Ma et al. (2025): Original AMD gene identification study, providing the 81 genes analyzed in this paper
18 Moulavi et al. (2014): DBCVI clustering validation metric, suitable for non-convex clustering
3 Bergstra et al. (2013): TPE hyperparameter optimization method

Summary

This is a bioinformatics paper with strong methodological innovation and reasonable experimental design. The greatest highlight is the joint optimization strategy, which breaks through the limitations of traditional step-wise optimization and provides a new paradigm for gene clustering pipeline design. Robustness verification is thorough, statistical significance is evident, and practical value is high.

Main weaknesses include: (1) lack of direct comparison with other methods; (2) validation on only a single dataset; (3) absence of biological function analysis. Future work should validate on multiple datasets and conduct systematic comparison with traditional methods (e.g., WGCNA), while adding functional annotation and experimental validation of gene clusters.

Overall, this is a high-quality computational biology paper with important reference value for RNA-Seq data analysis and disease gene research. Recommendation Score: 8.5/10