Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models
Wang, Campino, Clark et al.
Positive selection drives the emergence of adaptive mutations in Mycobacterium tuberculosis, shaping drug resistance, transmissibility, and virulence. Phylogenetic trees capture evolutionary relationships among isolates and provide a natural framework for detecting such adaptive signals. We present a phylogeny-guided graph attention network (GAT) approach, introducing a method for converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis. Using 500 M. tuberculosis isolates from four major lineages and 249 single-nucleotide variants (84 resistance-associated and 165 neutral) across 61 drug-resistance genes, we constructed graphs where nodes represented isolates and edges reflected phylogenetic distances. Edges between isolates separated by more than seven internal nodes were pruned to emphasise local evolutionary structure. Node features encoded SNP presence or absence, and the GAT architecture included two attention layers, a residual connection, global attention pooling, and a multilayer perceptron classifier. The model achieved an accuracy of 0.88 on a held-out test set and, when applied to 146 WHO-classified "uncertain" variants, identified 41 candidates with convergent emergence across multiple lineages, consistent with adaptive evolution. This work demonstrates the feasibility of transforming phylogenies into GNN-compatible structures and highlights attention-based models as effective tools for detecting positive selection, aiding genomic surveillance and variant prioritisation.
academic
Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models
This study proposes a phylogeny-guided graph attention network (GAT) approach for detecting positive selection signals in Mycobacterium tuberculosis. By converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis, the method achieves an accuracy of 0.88 on 500 M. tuberculosis isolates and 249 single nucleotide variants, successfully identifying 41 candidate variants with adaptive evolutionary characteristics.
Tuberculosis (TB) remains one of the leading infectious disease causes of death globally, causing 1.09 million deaths in 2024. The development of drug resistance exacerbates this epidemic, with 400,000 newly diagnosed TB cases showing resistance to at least rifampicin, a first-line drug. Positive selection is a key driver of M. tuberculosis evolution, promoting the emergence of adaptive mutations that affect drug resistance, transmissibility, and virulence.
Clinical Relevance: Identifying positively selected mutations is crucial for understanding resistance mechanisms and guiding treatment strategies
Evolutionary Biology Value: The strictly clonal population structure and lack of recombination in M. tuberculosis make it an ideal model for studying adaptive evolution
Public Health Need: Genomic surveillance requires rapid and accurate identification of variants with adaptive advantages
Methodological Innovation: First to propose a method for converting phylogenetic trees into graph neural network-compatible structures
Architecture Design: Develops a graph attention network architecture that integrates edge length information and simultaneously processes topological structure and mutation patterns
Practical Application: Identifies 41 candidate adaptive variants with convergent appearance patterns in WHO "uncertain" variant classifications
Tool Development: Provides complete open-source code and data processing pipeline
Input: SNP-annotated phylogenetic tree, where nodes represent M. tuberculosis isolates and edges reflect phylogenetic distances
Output: Binary classification prediction determining whether a specific SNP is under positive selection
Constraints: Maintain integrity of phylogenetic relationships while adapting to graph neural network input requirements
Graph Construction: Converts phylogenetic tree into undirected graph with nodes representing isolates and edge weights as internal node count distances
Edge Pruning: Removes edges between samples separated by more than 7 internal nodes, emphasizing local evolutionary structure
Node Features: Uses binary indicators to encode SNP presence/absence status
Stage 1: Dual-layer Graph Attention Network
- First layer: 8 attention heads, 32 output features per head
- Second layer: Single attention head, 256-dimensional output
- Residual connections: Linking outputs of both layers
Stage 2: Global Pooling and Classification
- Global attention pooling
- Multi-layer perceptron classifier (256→32→2)
Although the paper does not provide direct comparisons with traditional methods, it validates the approach's effectiveness through consistency verification with WHO classifications.
The paper cites 26 important references covering multiple domains including TB epidemiology, phylogenetic analysis, and graph neural networks, providing a solid theoretical foundation for the research.
Overall Assessment: This is an innovative interdisciplinary research paper that successfully applies deep learning techniques to infectious disease evolutionary genomics, providing new technical approaches for TB drug resistance surveillance. Despite certain limitations, its methodological contributions and practical application value are noteworthy.