2025-11-30T11:01:19.099104

A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data

Patock, Ratnapriya, Barman
The identification of disease-gene associations is instrumental in understanding the mechanisms of diseases and developing novel treatments. Besides identifying genes from RNA-Seq datasets, it is often necessary to identify gene clusters that have relationships with a disease. In this work, we propose a graph-based method for using an RNA-Seq dataset with known genes related to a disease and perform a robust clustering analysis to identify clusters of genes. Our method involves the construction of a gene co-expression network, followed by the computation of gene embeddings leveraging Node2Vec+, an algorithm applying weighted biased random walks and skipgram with negative sampling to compute node embeddings from undirected graphs with weighted edges. Finally, we perform spectral clustering to identify clusters of genes. All processes in our entire method are jointly optimized for stability, robustness, and optimality by applying Tree-structured Parzen Estimator. Our method was applied to an RNA-Seq dataset of known genes that have associations with Age-related Macular Degeneration (AMD). We also performed tests to validate and verify the robustness and statistical significance of our methods due to the stochastic nature of the involved processes. Our results show that our method is capable of generating consistent and robust clustering results. Our method can be seamlessly applied to other RNA-Seq datasets due to our process of joint optimization, ensuring the stability and optimality of the several steps in our method, including the construction of a gene co-expression network, computation of gene embeddings, and clustering of genes. Our work will aid in the discovery of natural structures in the RNA-Seq data, and understanding gene regulation and gene functions not just for AMD but for any disease in general.
academic

A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data

Basic Information

  • Paper ID: 2511.09590
  • Title: A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data
  • Authors: Jake R. Patock (Rice University), Rinki Ratnapriya (Baylor College of Medicine), Arko Barman (Rice University)
  • Classification: q-bio.GN (Genomics)
  • Submission Date: November 12, 2025 (arXiv submission)
  • Paper Link: https://arxiv.org/abs/2511.09590

Abstract

This study proposes a graph-based method for identifying disease-associated gene clusters from RNA sequencing data. The method first constructs a gene co-expression network, then computes gene embeddings using the Node2Vec+ algorithm, and finally identifies gene clusters through spectral clustering. The entire pipeline is jointly optimized using Tree-structured Parzen Estimator (TPE) to ensure stability, robustness, and optimality. The method is applied to an RNA-Seq dataset of 81 known age-related macular degeneration (AMD)-related genes, and validation experiments demonstrate that the method generates consistent and robust clustering results.

Research Background and Motivation

1. Research Problem

Gene expression regulation has become a key mechanism through which genetic variation mediates human disease risk. While identifying individual disease-related genes from RNA-Seq datasets is important, identifying gene clusters with disease associations is equally necessary, which helps to:

  • Understand shared biological pathways or processes
  • Identify potentially undiscovered genes
  • Target disease mechanisms rather than individual genes for therapeutic intervention

2. Importance of the Problem

  • Precision Medicine Needs: Findings from gene expression studies have tremendous potential for translation into precision medicine
  • AMD Research Gap: Although some AMD-related genes have been identified, most of the genetic heritability remains unexplained
  • Clinical Application Value: Discovery of new gene relationships can lead to novel drug targets, patient risk testing, and improved diagnostics

3. Limitations of Existing Methods

  • Traditional Statistical Methods: Hypothesis testing and similar approaches tend to produce noisy results and false positives in large-scale datasets
  • Step-wise Optimization Problem: Existing methods typically optimize individual steps (network construction, embedding computation, clustering) separately, which cannot guarantee overall pipeline optimality
  • Insufficient Robustness: Lack of systematic validation of stochastic processes

4. Research Motivation

Develop an end-to-end, jointly optimized gene clustering pipeline that can:

  • Handle high noise in transcriptomic data
  • Ensure overall pipeline optimality rather than local optima
  • Provide statistical significance and robustness guarantees
  • Be easily transferable to other diseases and datasets

Core Contributions

  1. Innovative Pipeline Design: Proposes a complete gene clustering pipeline including gene co-expression network construction, Node2Vec+ embedding computation, and spectral clustering
  2. Joint Optimization Strategy: For the first time, jointly optimizes all pipeline steps rather than traditional step-wise optimization, using TPE to optimize 9 hyperparameters to maximize the DBCVI clustering metric
  3. Robustness Verification Framework: Designs a comprehensive testing scheme including:
    • 100 repeated experiments to verify consistency
    • Statistical significance testing against random gene sets
    • Adjusted Mutual Information (AMI) assessment of clustering stability
  4. Practicality and Scalability:
    • No need for expensive computational resources like GPUs
    • Seamlessly applicable to other RNA-Seq datasets
    • Provides visualization results for medical professionals

Detailed Methodology

Task Definition

Input: Bulk mRNA-seq dataset containing nc=105 control samples and ns=61 late-stage AMD patients, focusing on 81 known AMD-related genes

Output: Cluster 81 genes into k* functionally similar gene clusters

Constraints:

  • Need to handle sequencing depth differences
  • Consider uncertainty from stochastic processes
  • Ensure statistical significance

Model Architecture

The overall pipeline consists of four main stages:

1. Gene Co-expression Network Construction

  • CS-CORE Method: Uses the CS-CORE statistical method to compute co-expression matrices, which corrects for sequencing depth differences and is more accurate than Pearson correlation coefficients
  • Graph Construction:
    • Nodes: 81 genes
    • Edges: Undirected weighted edges are added when the absolute value of CS-CORE co-expression exceeds threshold τ
    • Edge Weights: CS-CORE co-expression coefficients

2. Node2Vec+ Gene Embedding

Node2Vec+ is an improvement over classical Node2Vec, better handling weighted graphs:

Stage One: Weighted Biased Random Walk

  • Select anchor nodes
  • Execute weighted biased random walks considering three hyperparameters:
    • Return hyperparameter p: Controls tendency to return to visited nodes
    • In-out hyperparameter q: Controls tendency to explore new regions
    • Relaxation hyperparameter γ: Set to 0 to ensure robustness
  • Record visited node sequences

Stage Two: Skip-Gram with Negative Sampling (SGNS)

  • Input: Anchor nodes
  • Labels: Neighbor nodes
  • Training: 100 epochs
  • Execute 32,768 random walks to generate training data

Optimized Hyperparameters:

  • p, q: Random walk behavior
  • WL: Walk length per iteration
  • E: Embedding dimension
  • WS: Window size
  • Ns: Number of negative samples per positive sample

3. Spectral Clustering

Employs the Spectrum method, specifically designed for multi-omics data:

Adaptive Density-Aware Kernel: Affinity matrix is defined as:

Aij = exp(- d²(si, sj) / (σiσj(CNN(sisj) + 1)))

Where:

  • d(si, sj): Euclidean distance between nodes
  • σi, σj: Local scale parameters (distance to P-th nearest neighbor)
  • CNN(sisj): Size of intersection of S nearest neighbors of si and sj

Clustering Number Estimation:

  • Construct diagonal matrix D and normalized graph Laplacian: L = D^(-1/2)AD^(-1/2)
  • Eigendecomposition yields eigenvectors V and eigenvalues Λ
  • Compute dip test statistic Z for each eigenvector
  • Calculate multimodality gap: di = zi - zi-1
  • Use the last significant multimodality gap to determine optimal cluster number k*

Final Clustering:

  • Stack the first k* eigenvectors to form matrix X
  • Row-normalize to obtain Y
  • Use Gaussian Mixture Model (GMM) to cluster rows of Y

Technical Innovations

1. Joint Optimization vs. Step-wise Optimization

Traditional Approach:

  • Separately optimize network construction → separately optimize embedding → separately optimize clustering
  • Each step is locally optimal, but overall optimality is not guaranteed

This Paper's Approach:

  • Define a single objective function: maximize DBCVI (Density-Based Clustering Validation Index)
  • Simultaneously optimize 9 hyperparameters
  • Use TPE for Bayesian optimization with 256 samples
  • Repeat each configuration 8 times and average to handle stochasticity

2. Choice of Node2Vec+

Compared to classical Node2Vec:

  • Considers edge weights in second-order random walks
  • Better performance on biological networks and datasets
  • More suitable for characteristics of gene co-expression networks

3. Robustness Assurance Mechanism

  • Handling Stochasticity: Repeat each hyperparameter configuration 8 times
  • Consistency Verification: 100 complete pipeline repetitions
  • Statistical Testing: Compare against 100 random gene sets

Experimental Setup

Dataset

Source: Bulk mRNA-seq data from AMD patients

  • Control Group: 105 samples (Minnesota grading system level 1)
  • Case Group: 61 late-stage AMD patients (Minnesota grading system level 4)
  • Analyzed Genes: 81 known AMD-related genes (pre-identified and validated through ML methods and SHAP interpretability analysis)

Evaluation Metrics

1. DBCVI (Density-Based Clustering Validation Index)

  • Suitable for non-convex clustering algorithms (e.g., spectral clustering)
  • Range: Higher is better
  • Serves as the objective function for joint optimization

2. AMI (Adjusted Mutual Information)

  • Evaluates consistency between clustering results
  • Range: -1 to 1
  • Suitable for small clusters and imbalanced cluster sizes

3. Statistical Testing

  • Kolmogorov-Smirnov (K-S) Test: Detects distribution differences
  • k-sample Anderson-Darling Test: Non-parametric testing

Comparison Methods

  • Random Gene Sets: Randomly select 81 genes from all genes, repeated 100 times
  • Purpose: Verify that AMD-related genes cluster significantly better than random genes

Implementation Details

Hyperparameter Search Space (Table I):

MethodHyperparameterSearch SpaceOptimal Value
Graph Constructionτ0.3, 0.50.4
Node2vec+p0.01, 100.00.35
q0.01, 100.011.66
WL10, 3020
E2, 1610
WS4, 1010
Ns5, 157
Spectral ClusteringP3, 77
SP+2, P+411

Training Configuration:

  • TPE sampling iterations: 256
  • Repetitions per configuration: 8
  • SGNS training epochs: 100
  • Random walk iterations: 32,768
  • γ fixed at 0

Experimental Results

Main Results

1. Optimization Performance

  • Optimization Phase DBCVI: 0.99 (average of 8 trials)
  • 100-Repetition Average DBCVI: 0.95
  • Optimal Embedding Dimension: E = 10

2. Robustness Verification

  • AMI Mean: 0.49
  • AMI Variance: 0.022
  • Interpretation: Clustering results show moderate to high consistency, performing well for small-scale datasets with potential noise

3. Statistical Significance

AMD Genes vs. Random Genes:

  • AMD genes average DBCVI: 0.95
  • Random genes average DBCVI: 0.84
  • K-S test: p = 2.68 × 10^(-25)
  • Anderson-Darling test: p < 0.001

Conclusion: Clustering quality of AMD-related genes is significantly superior to random gene sets, with extremely high statistical significance

Visualization Results

  • Use UMAP to reduce 10-dimensional embeddings to 3D for visualization (Figure 2)
  • Provide interactive HTML visualization (code repository)
  • Clustering structure is clearly discernible, facilitating interpretation by medical professionals

Experimental Findings

1. Advantages of Joint Optimization

  • Compared to step-wise optimization, joint optimization produces more consistent, robust, and optimal clustering results
  • Single cost function ensures global optimality rather than local optima

2. Impact of Random Walk Iterations

  • More random walks lead to higher AMI
  • When computational resources are sufficient, increasing random walk iterations can further improve consistency

3. Role of CS-CORE

  • Compared to Pearson correlation coefficients, CS-CORE generates more refined co-expression networks
  • Corrects for sequencing depth differences, reducing false positives

4. Impact of Dataset Scale

  • Current dataset has limited sample size (166 samples)
  • Larger datasets are expected to produce more consistent results and higher AMI

1. Machine Learning Applications in RNA-Seq Data

  • Breast Cancer: Multi-class logistic regression for molecular subtype stratification 5
  • Colorectal Cancer: Identification of diagnostic biomarkers 15
  • AMD: ML identification of differentially expressed genes and independent regulatory gene sets 14, 24, 29

2. Classical ML Algorithms

  • Supervised Learning: SVM, XGBoost
  • Unsupervised Learning: SOM, k-means, hierarchical clustering
  • Dimensionality Reduction: t-SNE, PCA

3. Graph-Based Deep Learning

  • Knowledge Graphs: Application to transcriptomics 28
  • Node2Vec: Application to melanoma and other diseases 30
  • GNN: Capturing complex inter-gene dependencies 2
  • End-to-End Optimization: First to propose joint optimization of the entire pipeline
  • Robustness Guarantees: Systematic statistical validation framework
  • Practicality: No GPU required, easily applicable to other datasets
  • Interpretability: Provides visualization results for clinical use

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: The proposed graph-based method can identify robust and statistically significant gene clusters from RNA-Seq data
  2. Importance of Joint Optimization: Joint optimization of all pipeline steps produces superior overall results compared to step-wise optimization
  3. Statistical Verification: Clustering quality of AMD-related genes is significantly superior to random gene sets (p < 10^-20)
  4. Robustness: Despite involving multiple stochastic processes, 100 repeated experiments show moderate to high consistency (AMI = 0.49)
  5. Scalability: The method can be seamlessly applied to other diseases and RNA-Seq datasets

Limitations

1. Dataset Scale

  • Relatively limited sample size (166 samples)
  • Analysis of only 81 pre-identified genes
  • Larger-scale datasets may produce more stable results

2. Validation Methods

  • Lack of validation on synthetic datasets with known ground truth labels
  • No experimental biological validation

3. Computational Cost

  • While not requiring GPU, 256 TPE samples × 8 repetitions still require considerable time
  • Increasing random walk iterations significantly increases computational cost

4. Method Assumptions

  • Assumes CS-CORE is applicable to bulk RNA-seq data (originally designed for single-cell data)
  • Assumes gene relationships can be adequately captured through co-expression networks

Future Directions

1. Synthetic Data Validation

Use synthetic datasets with known ground truth to perform more rigorous evaluation and independently validate the method's ability to recover information structure

2. Extension to More Diseases

Apply the method to RNA-Seq datasets from other diseases to verify generalizability

3. Experimental Validation

Collaborate with molecular geneticists to experimentally validate identified gene clusters

4. Method Improvements

  • Explore more efficient optimization algorithms
  • Investigate strategies for adaptive adjustment of random walk iterations
  • Integrate other omics data (proteomics, metabolomics)

5. Clinical Applications

  • Develop user-friendly tools for clinical researchers
  • Integrate into disease diagnosis and drug target discovery pipelines

In-Depth Evaluation

Strengths

1. Method Innovation (★★★★★)

  • Joint Optimization Strategy: First to implement end-to-end joint optimization in gene clustering pipeline, breaking through limitations of traditional step-wise optimization
  • Technical Integration: Skillfully combines CS-CORE, Node2Vec+, and spectral clustering, with each component having sufficient theoretical support
  • Optimization Algorithm Choice: TPE as a Bayesian optimization method is more efficient than grid search

2. Experimental Sufficiency (★★★★☆)

  • Robustness Verification: 100 repeated experiments systematically evaluate consistency
  • Statistical Significance: Dual testing using K-S and Anderson-Darling tests
  • Control Design: Comparison with 100 random gene sets proves method specificity
  • Limitation: Lacks direct comparison with other gene clustering methods

3. Result Convincingness (★★★★☆)

  • High DBCVI Scores: Average score of 0.95 indicates excellent clustering quality
  • Highly Significant p-values: p < 10^-20 proves results are non-random
  • Moderate AMI: AMI of 0.49 is reasonable for noisy data
  • Visualization: UMAP dimensionality reduction visualization enhances interpretability

4. Writing Clarity (★★★★★)

  • Clear pipeline diagram (Figure 1)
  • Well-formatted algorithm pseudocode (Algorithm 1)
  • Complete hyperparameter table (Table I)
  • Detailed method description facilitates reproducibility

5. Practical Value (★★★★★)

  • No Expensive Hardware Required: Does not depend on GPU, lowering usage barriers
  • Open Source Code: Provides GitHub repository
  • Strong Transferability: Joint optimization ensures applicability to new datasets
  • Clinical Relevance: Directly addresses AMD, an important ophthalmological disease

Weaknesses

1. Method Limitations

  • CS-CORE Assumption: Originally designed for single-cell data; applicability to bulk data not fully verified
  • Linear Embedding: Node2Vec+ based on shallow embedding may fail to capture highly non-linear gene relationships
  • Static Network: Does not consider time or condition-specific dynamic networks

2. Experimental Design Flaws

  • Lack of Method Comparison: No quantitative comparison with other gene clustering methods (e.g., WGCNA, hierarchical clustering)
  • Single Dataset: Validation only on AMD dataset; generalization ability not fully demonstrated
  • No Ground Truth: Lacks validation set with known cluster labels

3. Insufficient Analysis

  • Biological Interpretation: No functional enrichment or pathway analysis of identified gene clusters
  • Cluster Quantity: Does not discuss specific identified cluster number k* and its biological significance
  • Hyperparameter Sensitivity: No analysis of how hyperparameter changes affect results

4. Computational Efficiency

  • Optimization Cost: 256 TPE samples × 8 repetitions = 2048 model trainings, relatively high computational cost
  • Scalability: For large-scale analysis of thousands of genes, computational complexity may become a bottleneck

Impact Assessment

1. Contribution to Field (★★★★☆)

  • Methodological Contribution: Joint optimization paradigm may inspire design of other bioinformatics pipelines
  • AMD Research: Provides new tools for AMD gene function research
  • General Framework: Generalizable to other diseases and omics data

2. Practical Value (★★★★★)

  • Drug Target Discovery: Gene clusters can guide identification of novel drug targets
  • Patient Stratification: May be used for AMD patient subtype classification
  • Hypothesis Generation: Provides testable hypotheses for experimental biologists

3. Reproducibility (★★★★★)

  • Open Source Code: Complete GitHub repository
  • Detailed Description: Sufficient method and hyperparameter description
  • Available Data: Uses publicly available AMD dataset
  • Interactive Visualization: Provides HTML visualization files

4. Citation Potential (★★★★☆)

  • Method Innovation: Joint optimization strategy likely to be widely cited
  • Application Value: May be adopted by AMD and other disease researchers
  • Limitation: Single dataset validation may limit early citations

Applicable Scenarios

1. Ideal Application Scenarios

  • Functional Grouping of Known Disease-Related Genes: When a set of disease-related genes is available and functional classification is needed
  • Small to Medium-Scale Gene Sets: Clustering analysis of tens to hundreds of genes
  • Exploratory Research: Discovering potential relationships and structures among genes
  • Multi-Disease Comparison: Comparing gene cluster patterns across different diseases

2. Less Suitable Scenarios

  • Genome-Scale Analysis: Analysis of tens of thousands of genes may face computational bottlenecks
  • Time Series Data: Current method does not consider temporal dynamics
  • Single-Cell Data: Although uses CS-CORE, overall pipeline is designed for bulk data
  • Causal Inference: Method identifies correlations rather than causal relationships

3. Extended Applications

  • Protein Interaction Networks: Can be adapted for protein network analysis
  • Metabolic Pathway Analysis: Can be applied to metabolite networks
  • Multi-Omics Integration: Can be extended to integrate multiple omics data types

Key References

  1. 10 Grover & Leskovec (2016): Original Node2vec paper, proposing random walk-based graph embedding method
  2. 13 Liu et al. (2023): Node2Vec+ improved version, considering edge weights in biological network embedding
  3. 12 John et al. (2020): Spectrum spectral clustering method, proposing adaptive density kernel and multimodality gap
  4. 26 Su et al. (2023): CS-CORE method, correcting co-expression estimation in single-cell RNA-seq
  5. 14 Ma et al. (2025): Original AMD gene identification study, providing the 81 genes analyzed in this paper
  6. 18 Moulavi et al. (2014): DBCVI clustering validation metric, suitable for non-convex clustering
  7. 3 Bergstra et al. (2013): TPE hyperparameter optimization method

Summary

This is a bioinformatics paper with strong methodological innovation and reasonable experimental design. The greatest highlight is the joint optimization strategy, which breaks through the limitations of traditional step-wise optimization and provides a new paradigm for gene clustering pipeline design. Robustness verification is thorough, statistical significance is evident, and practical value is high.

Main weaknesses include: (1) lack of direct comparison with other methods; (2) validation on only a single dataset; (3) absence of biological function analysis. Future work should validate on multiple datasets and conduct systematic comparison with traditional methods (e.g., WGCNA), while adding functional annotation and experimental validation of gene clusters.

Overall, this is a high-quality computational biology paper with important reference value for RNA-Seq data analysis and disease gene research. Recommendation Score: 8.5/10