2025-11-15T01:28:11.271605

Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models

Wang, Campino, Clark et al.
Positive selection drives the emergence of adaptive mutations in Mycobacterium tuberculosis, shaping drug resistance, transmissibility, and virulence. Phylogenetic trees capture evolutionary relationships among isolates and provide a natural framework for detecting such adaptive signals. We present a phylogeny-guided graph attention network (GAT) approach, introducing a method for converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis. Using 500 M. tuberculosis isolates from four major lineages and 249 single-nucleotide variants (84 resistance-associated and 165 neutral) across 61 drug-resistance genes, we constructed graphs where nodes represented isolates and edges reflected phylogenetic distances. Edges between isolates separated by more than seven internal nodes were pruned to emphasise local evolutionary structure. Node features encoded SNP presence or absence, and the GAT architecture included two attention layers, a residual connection, global attention pooling, and a multilayer perceptron classifier. The model achieved an accuracy of 0.88 on a held-out test set and, when applied to 146 WHO-classified "uncertain" variants, identified 41 candidates with convergent emergence across multiple lineages, consistent with adaptive evolution. This work demonstrates the feasibility of transforming phylogenies into GNN-compatible structures and highlights attention-based models as effective tools for detecting positive selection, aiding genomic surveillance and variant prioritisation.
academic

Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models

Basic Information

  • Paper ID: 2510.08703
  • Title: Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models
  • Authors: Linfeng Wang, Susana Campino, Taane G. Clark, Jody E. Phelan
  • Classification: q-bio.PE (Populations and Evolution), cs.LG (Machine Learning)
  • Institution: London School of Hygiene & Tropical Medicine
  • Paper Link: https://arxiv.org/abs/2510.08703

Abstract

This study proposes a phylogeny-guided graph attention network (GAT) approach for detecting positive selection signals in Mycobacterium tuberculosis. By converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis, the method achieves an accuracy of 0.88 on 500 M. tuberculosis isolates and 249 single nucleotide variants, successfully identifying 41 candidate variants with adaptive evolutionary characteristics.

Research Background and Motivation

Problem Definition

Tuberculosis (TB) remains one of the leading infectious disease causes of death globally, causing 1.09 million deaths in 2024. The development of drug resistance exacerbates this epidemic, with 400,000 newly diagnosed TB cases showing resistance to at least rifampicin, a first-line drug. Positive selection is a key driver of M. tuberculosis evolution, promoting the emergence of adaptive mutations that affect drug resistance, transmissibility, and virulence.

Research Significance

  1. Clinical Relevance: Identifying positively selected mutations is crucial for understanding resistance mechanisms and guiding treatment strategies
  2. Evolutionary Biology Value: The strictly clonal population structure and lack of recombination in M. tuberculosis make it an ideal model for studying adaptive evolution
  3. Public Health Need: Genomic surveillance requires rapid and accurate identification of variants with adaptive advantages

Limitations of Existing Methods

  1. Traditional Phylogenetic Analysis: Relies on manual interpretation and struggles with large-scale data
  2. Standard GNN Methods: Cannot effectively integrate phylogenetic information and mutation patterns
  3. Existing Classification Approaches: Lack consideration of evolutionary context, potentially missing important adaptive signals

Core Contributions

  1. Methodological Innovation: First to propose a method for converting phylogenetic trees into graph neural network-compatible structures
  2. Architecture Design: Develops a graph attention network architecture that integrates edge length information and simultaneously processes topological structure and mutation patterns
  3. Practical Application: Identifies 41 candidate adaptive variants with convergent appearance patterns in WHO "uncertain" variant classifications
  4. Tool Development: Provides complete open-source code and data processing pipeline

Methodology Details

Task Definition

Input: SNP-annotated phylogenetic tree, where nodes represent M. tuberculosis isolates and edges reflect phylogenetic distances Output: Binary classification prediction determining whether a specific SNP is under positive selection Constraints: Maintain integrity of phylogenetic relationships while adapting to graph neural network input requirements

Model Architecture

Data Structure Transformation

  1. Graph Construction: Converts phylogenetic tree into undirected graph with nodes representing isolates and edge weights as internal node count distances
  2. Edge Pruning: Removes edges between samples separated by more than 7 internal nodes, emphasizing local evolutionary structure
  3. Node Features: Uses binary indicators to encode SNP presence/absence status

GAT Architecture Design

Stage 1: Dual-layer Graph Attention Network
- First layer: 8 attention heads, 32 output features per head
- Second layer: Single attention head, 256-dimensional output
- Residual connections: Linking outputs of both layers

Stage 2: Global Pooling and Classification
- Global attention pooling
- Multi-layer perceptron classifier (256→32→2)

Attention Mechanism

The key innovation lies in edge-aware attention computation:

hi(l+1)=σ(jN(i)αijWhj(l))h_i^{(l+1)} = \sigma\left(\sum_{j \in N(i)} \alpha_{ij} W h_j^{(l)}\right)

where attention weights αij\alpha_{ij} simultaneously consider node features and edge length information: αij=softmax(σ(aT[WhiWhj]+bedgeij))\alpha_{ij} = \text{softmax}\left(\sigma\left(\mathbf{a}^T [Wh_i \| Wh_j] + b \cdot edge_{ij}\right)\right)

Technical Innovations

  1. Phylogeny-Aware Design: First to introduce internal node counts as edge weights in graph neural networks
  2. Adaptive Pruning: Preserves local neighborhood structure through distance thresholds, reducing noise
  3. Multi-Scale Attention: Combines node-level and edge-level information in attention mechanisms
  4. Residual Design: Ensures training stability in deeper networks

Experimental Setup

Dataset

  • Sample Scale: 500 clinical M. tuberculosis samples
  • Lineage Coverage: Four major lineages (L1-L4) with distribution L1:8, L2:175, L3:109, L4:223
  • Variant Data: 249 SNP variants spanning 61 drug resistance genes
  • Label Distribution: 84 WHO-confirmed drug resistance-related mutations, 165 neutral variants

Data Processing Pipeline

  1. Sequence Processing: Quality control and alignment using Trimmomatic and BWA-mem
  2. Variant Calling: BCF/VCF toolkit with >10-fold coverage
  3. Phylogenetic Reconstruction: Maximum likelihood tree construction using RAxML
  4. Data Splitting: Training set 149, validation set 50, test set 50

Evaluation Metrics

  • Accuracy: 0.88
  • AUC: 0.89
  • F1 Score: 0.81
  • Sensitivity: 0.76
  • Specificity: 0.94

Comparative Analysis

Although the paper does not provide direct comparisons with traditional methods, it validates the approach's effectiveness through consistency verification with WHO classifications.

Experimental Results

Main Results

On the holdout test dataset of 50 samples:

  • Overall Performance: Accuracy of 0.88, demonstrating good generalization capability
  • Class Balance: High specificity (0.94) and moderate sensitivity (0.76), suitable for screening applications
  • Biological Plausibility: Model nearly completely excludes synonymous mutations, consistent with functional expectations

Attention Analysis

Through Top-k Attention quality (TAM) analysis:

  • Attention Concentration: Top 10% of edges capture 44.1% of total attention
  • Biological Significance: High-attention edges primarily connect central nodes with rich mutation diversity
  • Structural Understanding: Model successfully identifies and focuses on evolutionarily important graph regions

Practical Application Validation

Among 146 WHO "uncertain" variants:

  • Prediction Results: 27 (18.5%) predicted as under positive selection
  • Convergent Patterns: 41 candidate variants show convergent appearance across multiple lineages
  • Functional Relevance: Successfully identifies known resistance mutations and compensatory mutations

Key Findings

  1. embA c.-43G>C: Appears in 43 sub-lineages with 47.48% frequency in MDR+ strains
  2. rpoC Series Mutations: Multiple compensatory mutations successfully identified
  3. ubiA Variants: Novel candidate mutations related to ethambutol resistance

Traditional Phylogenetic Methods

  • dN/dS Ratio Analysis: Classical approach for detecting selection pressure
  • Phylogenetic Convergence Analysis: Manual identification of independent origin events
  • Molecular Clock Analysis: Estimation of mutation occurrence timing

Graph Neural Network Applications

  • Biological Network Analysis: GNN applications in protein interaction networks
  • Phylogenetic Inference: Deep learning-based tree reconstruction methods
  • Genomic Analysis: Sequence classification and functional prediction

Advantages of This Work

  1. Originality: First systematic conversion of phylogenetic trees to GNN inputs
  2. Integration: Simultaneously considers topological and feature information
  3. Practicality: Direct application to actual drug resistance surveillance needs

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: Successfully demonstrates the feasibility of converting phylogenetic trees to graph neural networks
  2. Predictive Capability: GAT model effectively identifies positive selection signals
  3. Application Value: Discovers multiple valuable candidates in WHO uncertain variant classifications

Limitations

  1. Sample Scale: Relatively small dataset (249 variants) may limit model generalization
  2. Label Noise: Using drug resistance as a proxy for positive selection may introduce classification errors
  3. Method Dependency: Requires high-quality phylogenetic trees as input
  4. Computational Complexity: Processing efficiency for large-scale datasets requires verification

Future Directions

  1. Extended Applications: Application to adaptive evolution research in other pathogens
  2. Method Improvements: Development of graph-independent learning architectures
  3. Multimodal Integration: Combining phenotypic and genotypic data
  4. Real-Time Monitoring: Construction of online drug resistance surveillance systems

In-Depth Evaluation

Strengths

  1. Strong Innovation: First systematic integration of phylogenetic information into deep learning framework
  2. Reasonable Methodology: Edge pruning strategy and attention mechanism design align with biological intuition
  3. Practical Value: Directly serves actual TB drug resistance surveillance needs
  4. Open-Source Contribution: Provides complete code and data, promoting field development

Weaknesses

  1. Insufficient Comparison: Lacks quantitative comparison with traditional phylogenetic methods
  2. Limited Validation: Experimental validation of predictions requires further research
  3. Unknown Generalization: Applicability to other pathogens remains unverified
  4. Theoretical Foundation: Lacks theoretical analysis of why GAT is particularly suitable for this task

Impact

  1. Methodological Contribution: Provides new analytical tools for phylogenetic genomics
  2. Application Prospects: Broad application potential in infectious disease surveillance and evolutionary biology
  3. Interdisciplinary Value: Bridges evolutionary biology, machine learning, and public health fields

Applicable Scenarios

  1. Pathogen Surveillance: Real-time identification of emerging drug resistance mutations
  2. Evolutionary Research: Large-scale adaptive evolution signal detection
  3. Drug Development: Prediction of potential resistance targets
  4. Epidemiology: Tracking transmission patterns of resistant strains

References

The paper cites 26 important references covering multiple domains including TB epidemiology, phylogenetic analysis, and graph neural networks, providing a solid theoretical foundation for the research.


Overall Assessment: This is an innovative interdisciplinary research paper that successfully applies deep learning techniques to infectious disease evolutionary genomics, providing new technical approaches for TB drug resistance surveillance. Despite certain limitations, its methodological contributions and practical application value are noteworthy.