2025-11-24T11:34:17.231709

Likelihood-free inference of phylogenetic tree posterior distributions

Blassel, Boussau, Lartillot et al.

Phylogenetic inference, the task of reconstructing how related sequences evolved from common ancestors, is a central task in evolutionary genomics. The current state-of-the-art methods exploit probabilistic models of sequence evolution along phylogenetic trees, by searching for the tree maximizing the likelihood of observed sequences, or by estimating the posterior of the tree given the sequences in a Bayesian framework. Both approaches typically require to compute likelihoods, which is only feasible under simplifying assumptions such as independence of the evolution at the different positions of the sequence, and even then remains a costly operation. Here we present Phyloformer 2, the first likelihood-free inference method for posterior distributions over phylogenies. Phyloformer 2 exploits a novel encoding for pairs of sequences that makes it more scalable than previous approaches, and a parameterized probability distribution factorized over a succession of subtree merges. The resulting network provides accurate estimates of the posterior distribution, and outperforms both state-of-the-art maximum likelihood methods and a previous likelihood-free method for point estimation. It opens the way to fast and accurate phylogenetic inference under realistic models of sequence evolution.

academic

Likelihood-free inference of phylogenetic tree posterior distributions

Basic Information

Paper ID: 2510.12976
Title: Likelihood-free inference of phylogenetic tree posterior distributions
Authors: Luc Blassel, Bastien Boussau, Nicolas Lartillot, Laurent Jacob
Classification: q-bio.PE (Populations and Evolution), q-bio.QM (Quantitative Methods)
Publication Date: October 14, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12976v1

Abstract

Phylogenetic inference is a central task in evolutionary genomics, aiming to reconstruct how related sequences have evolved from a common ancestor. Current state-of-the-art methods leverage probabilistic models of sequence evolution along phylogenetic trees, either by finding trees that maximize the likelihood of observed sequences or by estimating the posterior distribution of trees given sequences within a Bayesian framework. Both approaches typically require computing the likelihood function, which is only feasible under simplified assumptions (such as independence of evolution across sequence positions) and remains computationally expensive even then. This paper introduces Phyloformer 2, the first likelihood-free inference method for phylogenetic posterior distributions. Phyloformer 2 employs a novel sequence pair encoding scheme that offers greater scalability than previous methods and adopts a parametric probability distribution decomposition based on continuous subtree merging. The network provides accurate posterior distribution estimates, outperforming state-of-the-art maximum likelihood methods and previous likelihood-free approaches in point estimation.

Research Background and Motivation

Problem Definition

Phylogenetic inference is the task of reconstructing the evolutionary history of a set of extant sequences, requiring determination of the binary tree structure describing how they diverged from a common ancestor. This task is significant in multiple domains:

Evolutionary Biology: Understanding how extant species evolved from common ancestors
Disease Transmission: Tracking the emergence and spread of bacterial antibiotic resistance
Epidemiology: Monitoring disease transmission patterns

Limitations of Existing Methods

Traditional phylogenetic inference methods primarily rely on probabilistic models and face the following key challenges:

Computational Complexity: Likelihood function computation requires expensive pruning algorithms (Felsenstein, 1981)
Massive Search Space: The number of tree topologies for n leaf nodes is (2n-5)!!, making search extremely difficult
Simplified Model Assumptions: To make computation feasible, one must assume that sequence positions evolve independently and identically, ignoring natural selection
Unrealistic Simulation Results: These simplified assumptions lead to generating unrealistic sequence sets and artifacts in phylogenetic reconstruction

Research Motivation

Likelihood-free inference (simulation-based inference) provides a new paradigm for addressing these issues:

Effective estimation when likelihood evaluation is infeasible but sampling is inexpensive
Leveraging deep learning to train neural networks on simulated data to approximate posterior distributions
Amortized inference: costly training but extremely fast inference
Capability to handle more complex and realistic evolutionary models

Core Contributions

First End-to-End Likelihood-Free Posterior Estimation Method: Proposes the first likelihood-free posterior estimation method directly from sequences to phylogenies, surpassing previous work limited to quartets
Novel Network Architecture EvoPF: Inspired by AlphaFold 2's EvoFormer, designs a more scalable and expressive sequence encoder capable of handling over 200 sequences
BayesNJ Probability Distribution Decomposition: Introduces a parametrization of phylogenetic probability distributions based on continuous merging processes, ensuring distributional correctness
Significant Performance Improvements: Outperforms state-of-the-art likelihood-based methods in topological accuracy with 1-2 orders of magnitude faster inference
Applicability to Complex Models: Can be trained under intractable likelihood models, with performance gaps further widening compared to misspecified likelihood-based estimators

Methodology Details

Task Definition

Input: A set of aligned sequences $x = \{x_1, \ldots, x_N\}$ , where each sequence contains L characters Output: Phylogeny $\theta = (\tau, \ell)$ , including topology $\tau$ and branch lengths $\ell$ Objective: Learn an approximation $q_\psi(\theta|x)$ of the posterior distribution $p(\theta|x)$

Model Architecture

Phyloformer 2 comprises two core modules:

1. EvoPF Encoder

EvoPF is a transposed version of EvoFormer, maintaining two representations:

MSA Stack: Embeddings for each position in each sequence
Pairing Stack: Embeddings for each pair of sequences

Key Design Features:

Axial attention: Alternating column-wise (across sequences within positions) and row-wise (across positions within sequences) self-attention in the MSA stack
Flat self-attention across pairs: Simplified triangular attention from EvoFormer
Information interaction: Information exchange between MSA and pairing stacks through outer product means and pairing biases

2. BayesNJ Probability Distribution

Defines a probability distribution over phylogenies, decomposed as a continuous merging process:

$q_{\psi(x)}(\theta = (\tau, \ell)|x) = \prod_{k=1}^{2N-3} q_m(m^{(k)}|m^{(<k)}) q_\ell(\ell^{(k)}|m^{(k)}, m^{(<k)})$

Key Innovations:

Canonical Merging Order: Ensures each phylogeny has only one valid merging sequence
Constraint Handling: Ensures consistency between sampling and evaluation through distance constraints
Branch Length Parametrization: Reparametrizes using sum ( $s^{(k)}$ ) and ratio ( $r^{(k)}$ ), modeling with Gamma and Beta distributions

Technical Innovations

Scalable Encoding Scheme: Compared to Phyloformer's sequence pair representation, EvoPF significantly improves scalability while maintaining expressiveness
Correct Probability Distribution Definition: Resolves the issue that the same phylogeny can be generated by multiple merging sequences through canonical merging order
End-to-End Training: Directly optimizes posterior probability, avoiding intermediate distance prediction steps
Constraint Satisfaction: Ensures sampled phylogenies conform to canonical order through dynamic constraint matrices

Experimental Setup

Datasets

Primary Training Set: 1.3 million tree/MSA pairs with 50 taxa, based on LG+G8 model
Multi-Size Dataset: 10-170 taxa for fine-tuning to avoid overfitting to taxon count
Complex Model Dataset: Cherry model (position-dependent) and SelReg model (position-heterogeneous)
MCMC Comparison Dataset: Generated using RevBayes priors for posterior distribution quality assessment

Evaluation Metrics

Topological Accuracy: Normalized Robinson-Foulds distance
Branch Length Accuracy: Kuhner-Felsenstein distance
Posterior Quality: Split frequency comparison with MCMC samples
Computational Efficiency: Runtime and memory usage

Comparison Methods

Likelihood-Based: IQTree, FastTree, FastME
Likelihood-Free: Original Phyloformer (PF)
Variants: PF2topo (topology only), PF2ℓ1 (L1 loss)

Substantial improvements over original PF across all sizes
Outperforms state-of-the-art maximum likelihood methods like IQTree and FastTree for trees with 10-175 leaves
Performance advantages primarily stem from posterior distribution estimation using correct priors

Dramatic Computational Efficiency Gains

Speed: 1 order of magnitude faster than FastTree, 2 orders of magnitude faster than IQTree
Scalability: While memory-intensive, better scalability than PF, capable of handling larger trees
PF2topo: Topology-only version is nearly 1 order of magnitude faster than original PF

Advantages Under Complex Models

Under intractable likelihood models (Cherry and SelReg):

PF2 significantly outperforms equivalent PF models
Performance gaps further widen compared to misspecified likelihood-based methods
Demonstrates advantages of likelihood-free methods under complex models

Ablation Studies

Training PF2ℓ1 variant using L1 loss reveals:

EvoPF encoder provides some assistance to topology prediction
Most topological accuracy improvements derive from BayesNJ loss function
Demonstrates advantages of end-to-end posterior estimation over distance prediction

Posterior Distribution Quality Assessment

Comparison with RevBayes MCMC samples shows:

RevBayes produces hard posterior distributions (most branches either appear in all or none)
PF2 provides softer posterior distributions with substantial agreement with RevBayes
Branches appearing in all RevBayes trees have frequency >0.6 in PF2
Unsampled branches have frequency <0.3 in PF2

Traditional Phylogenetic Inference

Maximum Likelihood Methods: IQTree, FastTree, etc., requiring heuristic tree space search
Bayesian Methods: MCMC sampling of posterior distributions with high computational cost
Variational Inference: Approximating posterior distributions, but still requiring likelihood computation

Likelihood-Free Phylogenetic Inference

Quartet Methods: Simplifying problems to 3-class classification, not scalable to larger problems
Distance Prediction Methods: Phyloformer predicts evolutionary distances, then reconstructs trees using NJ
This Work's Contribution: First end-to-end full phylogenetic posterior estimation method

Neural Posterior Estimation (NPE)

Learning neural network approximations of posterior distributions by minimizing KL divergence
Amortized inference: extremely fast inference after training
Key challenge: Designing appropriate parametric distribution families for phylogenies

Conclusions and Discussion

Main Conclusions

Method Effectiveness: Phyloformer 2 successfully achieves likelihood-free posterior estimation for phylogenies
Performance Advantages: Outperforms existing methods in both accuracy and speed
Scalability: Handles larger-scale problems than previous methods
Practical Value: Opens new avenues for inference under complex evolutionary models

Limitations

Scalability Constraints: Currently handles at most 200 sequences, limiting applications on larger datasets
Out-of-Distribution Generalization: May produce inaccurate estimates without warning for inputs outside training data
Expressiveness Limitations:
- Embeddings not updated within recursive processes
- Branch length posteriors restricted to specific parametric distributions (Gamma and Beta)
Calibration Quality: Calibration quality of posterior distributions requires further investigation

Future Directions

More Efficient Encoders: Exploring more efficient architectures for handling larger-scale problems
Hierarchical Methods: Combining with existing heuristics to build larger trees
Uncertainty Assessment: Providing uncertainty estimates for predictions
Unaligned Sequences: Handling unaligned sequence inputs
More Complex Models: Inference under broader evolutionary models incorporating population dynamics and coevolution

In-Depth Evaluation

Strengths

Major Technical Breakthrough: First end-to-end phylogenetic posterior estimation, breaking through quartet limitations
Theoretical Rigor: Elegantly solves technical challenges in probability distribution definition through canonical merging order
Comprehensive Experiments: Multiple datasets, evaluation metrics, and comparison methods with thorough ablation studies
High Practical Value: Significant speed improvements and accuracy gains have important application value
Clear Presentation: Technical details clearly described with intuitive architecture diagrams

Weaknesses

Limited Scalability: 200-sequence limitation remains insufficient in the genomic era
Model Expressiveness: Limitations such as non-updated embeddings in recursive processes and fixed parametric distribution forms constrain model expressiveness
Insufficient Calibration Assessment: Posterior distribution calibration quality evaluation is relatively simple, requiring deeper analysis
Cherry Dataset Issues: Acknowledges using erroneous Cherry dataset, affecting credibility of related conclusions

Impact

Academic Contribution: Introduces entirely new likelihood-free paradigm to phylogenetic inference
Methodological Value: BayesNJ decomposition ideas may inspire probabilistic modeling of other structured objects
Application Prospects: Fast and accurate inference capabilities will promote large-scale evolutionary research
Reproducibility: Detailed implementation details and training parameters facilitate reproduction and improvement

Applicable Scenarios

Medium-Scale Phylogenetics: Phylogenetic inference for 50-200 sequences
Complex Evolutionary Models: Scenarios requiring consideration of position-dependent effects or selection pressure
Fast Inference Needs: Applications requiring numerous repeated inferences
Bayesian Analysis: Research requiring posterior distributions rather than point estimates

References

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach.
Minh, B. Q., et al. (2020). IQ-TREE 2: New models and efficient methods for phylogenetic inference.
Nesterenko, L., et al. (2025). Phyloformer: Fast, accurate, and versatile phylogenetic reconstruction.
Lueckmann, J.-M., et al. (2021). Benchmarking simulation-based inference.
Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold.

Likelihood-free inference of phylogenetic tree posterior distributions

Likelihood-free inference of phylogenetic tree posterior distributions

Basic Information

Abstract

Research Background and Motivation

Problem Definition

Limitations of Existing Methods

Research Motivation

Core Contributions

Methodology Details

Task Definition

Model Architecture

1. EvoPF Encoder

2. BayesNJ Probability Distribution

Technical Innovations

Experimental Setup

Datasets

Evaluation Metrics

Comparison Methods

Experimental Results

Main Results

Topological Accuracy Improvements

Dramatic Computational Efficiency Gains

Advantages Under Complex Models

Ablation Studies

Posterior Distribution Quality Assessment

Traditional Phylogenetic Inference

Likelihood-Free Phylogenetic Inference

Neural Posterior Estimation (NPE)

Conclusions and Discussion

Main Conclusions

Limitations

Future Directions

In-Depth Evaluation

Strengths

Weaknesses

Impact

Applicable Scenarios

References