Phylogenetic inference, the task of reconstructing how related sequences evolved from common ancestors, is a central task in evolutionary genomics. The current state-of-the-art methods exploit probabilistic models of sequence evolution along phylogenetic trees, by searching for the tree maximizing the likelihood of observed sequences, or by estimating the posterior of the tree given the sequences in a Bayesian framework. Both approaches typically require to compute likelihoods, which is only feasible under simplifying assumptions such as independence of the evolution at the different positions of the sequence, and even then remains a costly operation. Here we present Phyloformer 2, the first likelihood-free inference method for posterior distributions over phylogenies. Phyloformer 2 exploits a novel encoding for pairs of sequences that makes it more scalable than previous approaches, and a parameterized probability distribution factorized over a succession of subtree merges. The resulting network provides accurate estimates of the posterior distribution, and outperforms both state-of-the-art maximum likelihood methods and a previous likelihood-free method for point estimation. It opens the way to fast and accurate phylogenetic inference under realistic models of sequence evolution.
- Paper ID: 2510.12976
- Title: Likelihood-free inference of phylogenetic tree posterior distributions
- Authors: Luc Blassel, Bastien Boussau, Nicolas Lartillot, Laurent Jacob
- Classification: q-bio.PE (Populations and Evolution), q-bio.QM (Quantitative Methods)
- Publication Date: October 14, 2024 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.12976v1
Phylogenetic inference is a central task in evolutionary genomics, aiming to reconstruct how related sequences have evolved from a common ancestor. Current state-of-the-art methods leverage probabilistic models of sequence evolution along phylogenetic trees, either by finding trees that maximize the likelihood of observed sequences or by estimating the posterior distribution of trees given sequences within a Bayesian framework. Both approaches typically require computing the likelihood function, which is only feasible under simplified assumptions (such as independence of evolution across sequence positions) and remains computationally expensive even then. This paper introduces Phyloformer 2, the first likelihood-free inference method for phylogenetic posterior distributions. Phyloformer 2 employs a novel sequence pair encoding scheme that offers greater scalability than previous methods and adopts a parametric probability distribution decomposition based on continuous subtree merging. The network provides accurate posterior distribution estimates, outperforming state-of-the-art maximum likelihood methods and previous likelihood-free approaches in point estimation.
Phylogenetic inference is the task of reconstructing the evolutionary history of a set of extant sequences, requiring determination of the binary tree structure describing how they diverged from a common ancestor. This task is significant in multiple domains:
- Evolutionary Biology: Understanding how extant species evolved from common ancestors
- Disease Transmission: Tracking the emergence and spread of bacterial antibiotic resistance
- Epidemiology: Monitoring disease transmission patterns
Traditional phylogenetic inference methods primarily rely on probabilistic models and face the following key challenges:
- Computational Complexity: Likelihood function computation requires expensive pruning algorithms (Felsenstein, 1981)
- Massive Search Space: The number of tree topologies for n leaf nodes is (2n-5)!!, making search extremely difficult
- Simplified Model Assumptions: To make computation feasible, one must assume that sequence positions evolve independently and identically, ignoring natural selection
- Unrealistic Simulation Results: These simplified assumptions lead to generating unrealistic sequence sets and artifacts in phylogenetic reconstruction
Likelihood-free inference (simulation-based inference) provides a new paradigm for addressing these issues:
- Effective estimation when likelihood evaluation is infeasible but sampling is inexpensive
- Leveraging deep learning to train neural networks on simulated data to approximate posterior distributions
- Amortized inference: costly training but extremely fast inference
- Capability to handle more complex and realistic evolutionary models
- First End-to-End Likelihood-Free Posterior Estimation Method: Proposes the first likelihood-free posterior estimation method directly from sequences to phylogenies, surpassing previous work limited to quartets
- Novel Network Architecture EvoPF: Inspired by AlphaFold 2's EvoFormer, designs a more scalable and expressive sequence encoder capable of handling over 200 sequences
- BayesNJ Probability Distribution Decomposition: Introduces a parametrization of phylogenetic probability distributions based on continuous merging processes, ensuring distributional correctness
- Significant Performance Improvements: Outperforms state-of-the-art likelihood-based methods in topological accuracy with 1-2 orders of magnitude faster inference
- Applicability to Complex Models: Can be trained under intractable likelihood models, with performance gaps further widening compared to misspecified likelihood-based estimators
Input: A set of aligned sequences x={x1,…,xN}, where each sequence contains L characters
Output: Phylogeny θ=(τ,ℓ), including topology τ and branch lengths ℓObjective: Learn an approximation qψ(θ∣x) of the posterior distribution p(θ∣x)
Phyloformer 2 comprises two core modules:
EvoPF is a transposed version of EvoFormer, maintaining two representations:
- MSA Stack: Embeddings for each position in each sequence
- Pairing Stack: Embeddings for each pair of sequences
Key Design Features:
- Axial attention: Alternating column-wise (across sequences within positions) and row-wise (across positions within sequences) self-attention in the MSA stack
- Flat self-attention across pairs: Simplified triangular attention from EvoFormer
- Information interaction: Information exchange between MSA and pairing stacks through outer product means and pairing biases
Defines a probability distribution over phylogenies, decomposed as a continuous merging process:
qψ(x)(θ=(τ,ℓ)∣x)=∏k=12N−3qm(m(k)∣m(<k))qℓ(ℓ(k)∣m(k),m(<k))
Key Innovations:
- Canonical Merging Order: Ensures each phylogeny has only one valid merging sequence
- Constraint Handling: Ensures consistency between sampling and evaluation through distance constraints
- Branch Length Parametrization: Reparametrizes using sum (s(k)) and ratio (r(k)), modeling with Gamma and Beta distributions
- Scalable Encoding Scheme: Compared to Phyloformer's sequence pair representation, EvoPF significantly improves scalability while maintaining expressiveness
- Correct Probability Distribution Definition: Resolves the issue that the same phylogeny can be generated by multiple merging sequences through canonical merging order
- End-to-End Training: Directly optimizes posterior probability, avoiding intermediate distance prediction steps
- Constraint Satisfaction: Ensures sampled phylogenies conform to canonical order through dynamic constraint matrices
- Primary Training Set: 1.3 million tree/MSA pairs with 50 taxa, based on LG+G8 model
- Multi-Size Dataset: 10-170 taxa for fine-tuning to avoid overfitting to taxon count
- Complex Model Dataset: Cherry model (position-dependent) and SelReg model (position-heterogeneous)
- MCMC Comparison Dataset: Generated using RevBayes priors for posterior distribution quality assessment
- Topological Accuracy: Normalized Robinson-Foulds distance
- Branch Length Accuracy: Kuhner-Felsenstein distance
- Posterior Quality: Split frequency comparison with MCMC samples
- Computational Efficiency: Runtime and memory usage
- Likelihood-Based: IQTree, FastTree, FastME
- Likelihood-Free: Original Phyloformer (PF)
- Variants: PF2topo (topology only), PF2ℓ1 (L1 loss)
In testing on 10-200 taxa, Phyloformer 2 significantly outperforms all comparison methods:
- Substantial improvements over original PF across all sizes
- Outperforms state-of-the-art maximum likelihood methods like IQTree and FastTree for trees with 10-175 leaves
- Performance advantages primarily stem from posterior distribution estimation using correct priors
- Speed: 1 order of magnitude faster than FastTree, 2 orders of magnitude faster than IQTree
- Scalability: While memory-intensive, better scalability than PF, capable of handling larger trees
- PF2topo: Topology-only version is nearly 1 order of magnitude faster than original PF
Under intractable likelihood models (Cherry and SelReg):
- PF2 significantly outperforms equivalent PF models
- Performance gaps further widen compared to misspecified likelihood-based methods
- Demonstrates advantages of likelihood-free methods under complex models
Training PF2ℓ1 variant using L1 loss reveals:
- EvoPF encoder provides some assistance to topology prediction
- Most topological accuracy improvements derive from BayesNJ loss function
- Demonstrates advantages of end-to-end posterior estimation over distance prediction
Comparison with RevBayes MCMC samples shows:
- RevBayes produces hard posterior distributions (most branches either appear in all or none)
- PF2 provides softer posterior distributions with substantial agreement with RevBayes
- Branches appearing in all RevBayes trees have frequency >0.6 in PF2
- Unsampled branches have frequency <0.3 in PF2
- Maximum Likelihood Methods: IQTree, FastTree, etc., requiring heuristic tree space search
- Bayesian Methods: MCMC sampling of posterior distributions with high computational cost
- Variational Inference: Approximating posterior distributions, but still requiring likelihood computation
- Quartet Methods: Simplifying problems to 3-class classification, not scalable to larger problems
- Distance Prediction Methods: Phyloformer predicts evolutionary distances, then reconstructs trees using NJ
- This Work's Contribution: First end-to-end full phylogenetic posterior estimation method
- Learning neural network approximations of posterior distributions by minimizing KL divergence
- Amortized inference: extremely fast inference after training
- Key challenge: Designing appropriate parametric distribution families for phylogenies
- Method Effectiveness: Phyloformer 2 successfully achieves likelihood-free posterior estimation for phylogenies
- Performance Advantages: Outperforms existing methods in both accuracy and speed
- Scalability: Handles larger-scale problems than previous methods
- Practical Value: Opens new avenues for inference under complex evolutionary models
- Scalability Constraints: Currently handles at most 200 sequences, limiting applications on larger datasets
- Out-of-Distribution Generalization: May produce inaccurate estimates without warning for inputs outside training data
- Expressiveness Limitations:
- Embeddings not updated within recursive processes
- Branch length posteriors restricted to specific parametric distributions (Gamma and Beta)
- Calibration Quality: Calibration quality of posterior distributions requires further investigation
- More Efficient Encoders: Exploring more efficient architectures for handling larger-scale problems
- Hierarchical Methods: Combining with existing heuristics to build larger trees
- Uncertainty Assessment: Providing uncertainty estimates for predictions
- Unaligned Sequences: Handling unaligned sequence inputs
- More Complex Models: Inference under broader evolutionary models incorporating population dynamics and coevolution
- Major Technical Breakthrough: First end-to-end phylogenetic posterior estimation, breaking through quartet limitations
- Theoretical Rigor: Elegantly solves technical challenges in probability distribution definition through canonical merging order
- Comprehensive Experiments: Multiple datasets, evaluation metrics, and comparison methods with thorough ablation studies
- High Practical Value: Significant speed improvements and accuracy gains have important application value
- Clear Presentation: Technical details clearly described with intuitive architecture diagrams
- Limited Scalability: 200-sequence limitation remains insufficient in the genomic era
- Model Expressiveness: Limitations such as non-updated embeddings in recursive processes and fixed parametric distribution forms constrain model expressiveness
- Insufficient Calibration Assessment: Posterior distribution calibration quality evaluation is relatively simple, requiring deeper analysis
- Cherry Dataset Issues: Acknowledges using erroneous Cherry dataset, affecting credibility of related conclusions
- Academic Contribution: Introduces entirely new likelihood-free paradigm to phylogenetic inference
- Methodological Value: BayesNJ decomposition ideas may inspire probabilistic modeling of other structured objects
- Application Prospects: Fast and accurate inference capabilities will promote large-scale evolutionary research
- Reproducibility: Detailed implementation details and training parameters facilitate reproduction and improvement
- Medium-Scale Phylogenetics: Phylogenetic inference for 50-200 sequences
- Complex Evolutionary Models: Scenarios requiring consideration of position-dependent effects or selection pressure
- Fast Inference Needs: Applications requiring numerous repeated inferences
- Bayesian Analysis: Research requiring posterior distributions rather than point estimates
- Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach.
- Minh, B. Q., et al. (2020). IQ-TREE 2: New models and efficient methods for phylogenetic inference.
- Nesterenko, L., et al. (2025). Phyloformer: Fast, accurate, and versatile phylogenetic reconstruction.
- Lueckmann, J.-M., et al. (2021). Benchmarking simulation-based inference.
- Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold.