2025-11-19T01:43:13.983106

Inhomogeneous continuous-time Markov chains to infer flexible time-varying evolutionary rates

Datta, Lemey, Suchard
Reconstructing evolutionary histories and estimating the rate of evolution from molecular sequence data is of central importance in evolutionary biology and infectious disease research. We introduce a flexible Bayesian phylogenetic inference framework that accommodates changing evolutionary rates over time by modeling sequence character substitution processes as inhomogeneous continuous-time Markov chains (ICTMCs) acting along the unknown phylogeny, where the rate remains as an unknown, positive and integrable function of time. The integral of the rate function appears in the finite-time transition probabilities of the ICTMCs that must be efficiently computed for all branches of the phylogeny to evaluate the observed data likelihood. Circumventing computational challenges that arise from a fully nonparametric function, we successfully parameterize the rate function as piecewise constant with a large number of epochs that we call the polyepoch clock model. This makes the transition probability computation relatively inexpensive and continues to flexibly capture rate change over time. We employ a Gaussian Markov random field prior to achieve temporal smoothing of the estimated rate function. Hamiltonian Monte Carlo sampling enabled by scalable gradient evaluation under this model makes our framework computationally efficient. We assess the performance of the polyepoch clock model in recovering the true timescales and rates through simulations under two different evolutionary scenarios. We then apply the polyepoch clock model to examine the rates of West Nile virus, Dengue virus and influenza A/H3N2 evolution, and estimate the time-varying rate of SARS-CoV-2 spread in Europe in 2020.
academic

Inhomogeneous continuous-time Markov chains to infer flexible time-varying evolutionary rates

Basic Information

  • Paper ID: 2510.11982
  • Title: Inhomogeneous continuous-time Markov chains to infer flexible time-varying evolutionary rates
  • Authors: Pratyusa Datta (UCLA), Philippe Lemey (KU Leuven), Marc A. Suchard (UCLA)
  • Classification: stat.ME (Statistics - Methodology), q-bio.PE (Quantitative Biology - Populations and Evolution)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.11982

Abstract

This paper proposes a flexible Bayesian phylogenetic inference framework that accommodates time-varying evolutionary rates by modeling sequence character substitution processes as inhomogeneous continuous-time Markov chains (ICTMCs). The method parameterizes evolutionary rates as piecewise constant functions across numerous time periods (multi-period clock model), making transition probability calculations relatively inexpensive while flexibly capturing rate variations. Temporal smoothing of the estimated rate function is achieved through Gaussian Markov random field priors, and computational efficiency is enhanced via Hamiltonian Monte Carlo sampling with scalable gradient evaluation.

Research Background and Motivation

Problem Definition

The central problem in phylogenetics is reconstructing evolutionary history from molecular sequence data and estimating evolutionary rates. Traditional methods assume evolutionary rates remain constant over time, but this assumption does not hold for rapidly evolving organisms such as viruses.

Significance

  1. Evolutionary biology relevance: Accurate estimation of time-varying evolutionary rates is crucial for understanding mechanisms of biological diversification
  2. Infectious disease research value: Viral genome sequences accumulate significant genetic changes over short timescales, requiring real-time analytical capabilities
  3. Timescale dependence: Research demonstrates that viral evolutionary rate estimates are heavily dependent on the sampling time framework

Limitations of Existing Methods

  1. Homogeneous CTMC assumption: Traditional methods assume substitution processes on branches follow homogeneous continuous-time Markov chains
  2. Fixed rate variation patterns: Existing relaxed clock models make fixed assumptions about rate variation patterns
  3. Computational complexity: Fully nonparametric functional approaches face computational challenges

Research Motivation

Develop a flexible framework capable of directly modeling evolutionary rates as time functions, overcoming the limitations of homogeneous CTMC assumptions, and providing more accurate evolutionary rate estimates for rapidly evolving viruses and similar organisms.

Core Contributions

  1. Theoretical innovation: First systematic introduction of inhomogeneous continuous-time Markov chains (ICTMCs) to phylogenetic inference
  2. Methodological breakthrough: Proposes multi-period clock model that parameterizes rate functions as piecewise constant functions across numerous time periods
  3. Computational optimization: Develops linear-time-complexity gradient evaluation algorithm combined with HMC for efficient sampling
  4. Prior design: Employs appropriate Gaussian Markov random field priors to ensure propriety of posterior distributions
  5. Empirical validation: Validates method effectiveness on multiple viral datasets, including SARS-CoV-2 transmission analysis

Methodology Details

Task Definition

Input: N aligned molecular sequences with sampling time information Output: Phylogenetic tree, time-varying evolutionary rate trajectory, divergence time estimates Constraints: Rate function must be positive and integrable

Model Architecture

1. ICTMC Foundational Framework

For inhomogeneous CTMC, the infinitesimal generator matrix is a time function: Q(t)=f(t)QQ(t) = f(t)Q, where:

  • QQ: Time-independent base infinitesimal generator matrix
  • f(t)f(t): Unknown positive integrable rate function

Finite-time transition probability matrix: P(t0,t)=exp[t0tf(τ)dτQ]P(t_0, t) = \exp\left[\int_{t_0}^t f(\tau)d\tau \cdot Q\right]

2. Multi-Period Clock Model

Parameterizes rate function as piecewise constant: f(t)=θm,wmt<wm1,m=1,,Mf(t) = \theta_m, \quad w_m \leq t < w_{m-1}, \quad m = 1,\ldots,M

where wM<<w1w_M < \cdots < w_1 are time grid points and θ=(θ1,,θM+1)\theta = (\theta_1,\ldots,\theta_{M+1}) is the rate parameter vector.

3. Branch Length Calculation

For a branch connecting node ii to pa(i)pa(i), the expected number of substitutions is: bi=θq+1(wqtpa(i))+m=pq1θm+1(wmwm+1)+θp(tiwp)b_i = \theta_{q+1}(w_q - t_{pa(i)}) + \sum_{m=p}^{q-1}\theta_{m+1}(w_m - w_{m+1}) + \theta_p(t_i - w_p)

4. Bayesian Inference Framework

Prior Design:

  • Gaussian Markov random field prior on ζm=logθm\zeta_m = \log\theta_m
  • First-order differences: ζm+1ζmτN(0,dm/τ)\zeta_{m+1} - \zeta_m | \tau \sim N(0, d_m/\tau)
  • Proper prior: P(ζτ)τM/2exp[τ2ζ(DwρW)ζ]P(\zeta|\tau) \propto \tau^{M/2}\exp[-\frac{\tau}{2}\zeta'(D_w - \rho W)\zeta]

Posterior Sampling: Uses Hamiltonian Monte Carlo with gradient computation via chain rule: θmlogP(θ,τ,ρ,Q,α,FY)=i=12N2logPbibiθm\frac{\partial}{\partial\theta_m}\log P(\theta,\tau,\rho,Q,\alpha,F|Y) = \sum_{i=1}^{2N-2}\frac{\partial\log P}{\partial b_i}\frac{\partial b_i}{\partial\theta_m}

Technical Innovations

  1. Propriety guarantee: Ensures propriety of GMRF prior by introducing parameter ρ<1\rho < 1
  2. Gradient optimization: Develops gradient computation with O(NCS2+NM)O(NCS^2 + NM) complexity, significantly better than traditional O(N2CS2)O(N^2CS^2) approach
  3. Flexible grid design: Supports equally-spaced or adaptive grid point configurations
  4. Multi-scale modeling: Handles different timescales from weeks to centuries

Experimental Setup

Datasets

  1. Simulated Data:
    • Strict clock model simulation
    • Log-linear clock model simulation (f(t)=e4.50.05tf(t) = e^{-4.5-0.05t})
  2. Real Viral Datasets:
    • West Nile Virus: 104 complete genomes (1999-2007)
    • Dengue Virus Type 3: 352 sequences (1972-2010)
    • Seasonal Influenza A/H3N2: 402 sequences (1968-2010)
    • SARS-CoV-2: 3,959 genomes (2020 European data)

Evaluation Metrics

  • Posterior median and 95% Bayesian credible intervals of evolutionary rate trajectories
  • Accuracy of time to most recent common ancestor (tMRCA) estimates
  • Log marginal likelihood (model comparison)
  • Effective sample size (ESS)

Comparison Methods

  • Strict clock model
  • Random local clock model
  • Log-linear clock model

Implementation Details

  • BEAST X software package implementation
  • MCMC iterations: 3-40 million
  • Number of grid points: 60-360 time periods
  • GMRF precision prior: Gamma(0.001, 0.001)

Experimental Results

Main Results

Simulation Validation

  1. Strict clock scenario: Multi-period model accurately recovers constant rates with precise tMRCA estimates
  2. Log-linear scenario: Accurately recovers true rate trajectories in data-rich regions with slight overestimation at root

Real Data Analysis

West Nile Virus:

  • Relatively constant rate trajectory (5×104\approx 5 \times 10^{-4} subst./site/yr)
  • tMRCA: 1998 1997, 1999
  • Strict clock model fits better (log marginal likelihood difference 27\approx 27)

Dengue Virus:

  • Strong time-varying pattern: 10-fold rate decrease 1995-2000, 10-fold increase 2003-2009
  • Multi-period model outperforms random local clock (log marginal likelihood improvement 220\approx 220)
  • tMRCA: 1972 1963, 1973

Seasonal Influenza A/H3N2:

  • Pronounced seasonal pattern: peak December-February
  • Increased peak heights post-2001
  • Posterior ρ=0.26\rho = 0.26 0.07, 0.58, avoiding over-smoothing

SARS-CoV-2 European Transmission:

  • 90% reduction in spatial spread rate during March 2020 lockdown
  • 9-fold rate increase after summer reopening
  • Negative correlation with effective population size

Ablation Studies

  • Grid density impact: More periods provide higher temporal resolution
  • Prior sensitivity: GMRF precision prior selection has limited impact on results
  • Propriety parameter ρ\rho: Critical for detecting seasonal patterns

Experimental Findings

  1. Timescale dependence confirmation: Multiple viruses show significant time-varying rate patterns
  2. Epidemiological associations: Rate changes highly consistent with real-world intervention measures
  3. Computational efficiency: Gradient optimization enables large-scale data analysis

Main Research Directions

  1. Relaxed clock models: Random effects, local clocks, etc.
  2. Time-dependent models: Power-law decay, change-point models
  3. Nonparametric methods: Gaussian processes, spline functions

Advantages of This Work

  1. Theoretical rigor: Solid mathematical foundation based on ICTMC
  2. Computational feasibility: Avoids computational difficulties of Gaussian process integration
  3. Flexibility: Handles arbitrary complex rate variation patterns
  4. Scalability: Linear time complexity supports large-scale data

Conclusions and Discussion

Main Conclusions

  1. Method effectiveness: Multi-period clock model successfully captures time-varying evolutionary rates
  2. Biological significance: Reveals complex temporal dynamics of viral evolutionary rates
  3. Practical value: Provides real-time analytical tools for infectious disease surveillance

Limitations

  1. Root uncertainty: Lack of calibration points leads to large uncertainty in root rate estimates
  2. Computational complexity: Despite optimization, still requires substantial MCMC iterations
  3. Grid selection: Requires prior knowledge to guide grid point configuration
  4. Model selection: Lacks automatic method for determining optimal number of periods

Future Directions

  1. Bivariate CAR models: Joint modeling of rates and effective population size
  2. Adaptive grids: Develop data-driven grid selection methods
  3. Multi-locus extension: Handle heterogeneity in whole-genome data
  4. Real-time inference: Develop online update algorithms

In-Depth Evaluation

Strengths

  1. Theoretical innovation: First systematic introduction of ICTMC to phylogenetics with solid theoretical foundation
  2. Clever methodology: Piecewise constant parameterization cleverly balances flexibility and computational feasibility
  3. Computational optimization: Linear-time gradient algorithm is important technical contribution
  4. Comprehensive validation: Thorough verification across simulations and multiple real datasets
  5. Biological insights: Reveals important temporal dynamics characteristics of viral evolution

Weaknesses

  1. Prior sensitivity: GMRF prior propriety requires careful tuning of ρ\rho parameter
  2. Model complexity: High-dimensional parameter space may cause convergence issues
  3. Interpretation challenges: Biological interpretation of complex time-varying patterns requires further investigation
  4. Computational resources: Large-scale data analysis still requires substantial computational resources

Impact

  1. Methodological contribution: Provides new theoretical framework for phylogenetic clock models
  2. Software implementation: BEAST X integration ensures broad applicability
  3. Interdisciplinary value: Successful application of statistical methods to biological problems
  4. Real-time monitoring: Provides important tools for infectious disease outbreak response

Applicable Scenarios

  1. Rapidly evolving viruses: RNA viruses, influenza viruses, etc.
  2. Epidemic monitoring: Real-time tracking of pathogen transmission dynamics
  3. Evolutionary biology: Studying temporal patterns of adaptive evolution
  4. Paleontology: Analyzing evolutionary rate changes over long timescales

References

The paper cites important literature from phylogenetics, Bayesian inference, and Markov processes, including Felsenstein's classic pruning algorithm, Drummond et al.'s relaxed clock models, and Rue & Held's Gaussian Markov random field theory and other foundational works.


Overall Assessment: This is a high-quality methodological paper with significant contributions in theoretical innovation, technical implementation, and practical application. The multi-period clock model provides new tools for phylogenetic inference, particularly suited for studying rapidly evolving organisms. The paper features rigorous mathematical derivations, well-designed experiments, and convincing results, and is expected to have important impacts on phylogenetics and infectious disease research.