A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data
Patock, Ratnapriya, Barman
The identification of disease-gene associations is instrumental in understanding the mechanisms of diseases and developing novel treatments. Besides identifying genes from RNA-Seq datasets, it is often necessary to identify gene clusters that have relationships with a disease. In this work, we propose a graph-based method for using an RNA-Seq dataset with known genes related to a disease and perform a robust clustering analysis to identify clusters of genes. Our method involves the construction of a gene co-expression network, followed by the computation of gene embeddings leveraging Node2Vec+, an algorithm applying weighted biased random walks and skipgram with negative sampling to compute node embeddings from undirected graphs with weighted edges. Finally, we perform spectral clustering to identify clusters of genes. All processes in our entire method are jointly optimized for stability, robustness, and optimality by applying Tree-structured Parzen Estimator. Our method was applied to an RNA-Seq dataset of known genes that have associations with Age-related Macular Degeneration (AMD). We also performed tests to validate and verify the robustness and statistical significance of our methods due to the stochastic nature of the involved processes. Our results show that our method is capable of generating consistent and robust clustering results. Our method can be seamlessly applied to other RNA-Seq datasets due to our process of joint optimization, ensuring the stability and optimality of the several steps in our method, including the construction of a gene co-expression network, computation of gene embeddings, and clustering of genes. Our work will aid in the discovery of natural structures in the RNA-Seq data, and understanding gene regulation and gene functions not just for AMD but for any disease in general.
academic
A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data
This study proposes a graph-based method for identifying disease-associated gene clusters from RNA sequencing data. The method first constructs a gene co-expression network, then computes gene embeddings using the Node2Vec+ algorithm, and finally identifies gene clusters through spectral clustering. The entire pipeline is jointly optimized using Tree-structured Parzen Estimator (TPE) to ensure stability, robustness, and optimality. The method is applied to an RNA-Seq dataset of 81 known age-related macular degeneration (AMD)-related genes, and validation experiments demonstrate that the method generates consistent and robust clustering results.
Gene expression regulation has become a key mechanism through which genetic variation mediates human disease risk. While identifying individual disease-related genes from RNA-Seq datasets is important, identifying gene clusters with disease associations is equally necessary, which helps to:
Understand shared biological pathways or processes
Identify potentially undiscovered genes
Target disease mechanisms rather than individual genes for therapeutic intervention
Innovative Pipeline Design: Proposes a complete gene clustering pipeline including gene co-expression network construction, Node2Vec+ embedding computation, and spectral clustering
Joint Optimization Strategy: For the first time, jointly optimizes all pipeline steps rather than traditional step-wise optimization, using TPE to optimize 9 hyperparameters to maximize the DBCVI clustering metric
Robustness Verification Framework: Designs a comprehensive testing scheme including:
100 repeated experiments to verify consistency
Statistical significance testing against random gene sets
Adjusted Mutual Information (AMI) assessment of clustering stability
Practicality and Scalability:
No need for expensive computational resources like GPUs
Seamlessly applicable to other RNA-Seq datasets
Provides visualization results for medical professionals
CS-CORE Method: Uses the CS-CORE statistical method to compute co-expression matrices, which corrects for sequencing depth differences and is more accurate than Pearson correlation coefficients
Graph Construction:
Nodes: 81 genes
Edges: Undirected weighted edges are added when the absolute value of CS-CORE co-expression exceeds threshold τ
Use synthetic datasets with known ground truth to perform more rigorous evaluation and independently validate the method's ability to recover information structure
Joint Optimization Strategy: First to implement end-to-end joint optimization in gene clustering pipeline, breaking through limitations of traditional step-wise optimization
Technical Integration: Skillfully combines CS-CORE, Node2Vec+, and spectral clustering, with each component having sufficient theoretical support
Optimization Algorithm Choice: TPE as a Bayesian optimization method is more efficient than grid search
This is a bioinformatics paper with strong methodological innovation and reasonable experimental design. The greatest highlight is the joint optimization strategy, which breaks through the limitations of traditional step-wise optimization and provides a new paradigm for gene clustering pipeline design. Robustness verification is thorough, statistical significance is evident, and practical value is high.
Main weaknesses include: (1) lack of direct comparison with other methods; (2) validation on only a single dataset; (3) absence of biological function analysis. Future work should validate on multiple datasets and conduct systematic comparison with traditional methods (e.g., WGCNA), while adding functional annotation and experimental validation of gene clusters.
Overall, this is a high-quality computational biology paper with important reference value for RNA-Seq data analysis and disease gene research. Recommendation Score: 8.5/10