Antibodies are essential proteins responsible for immune responses in organisms, capable of specifically recognizing antigen molecules of pathogens. Recent advances in generative models have significantly enhanced rational antibody design. However, existing methods mainly create antibodies from scratch without template constraints, leading to model optimization challenges and unnatural sequences. To address these issues, we propose a retrieval-augmented diffusion framework, termed RADAb, for efficient antibody design. Our method leverages a set of structural homologous motifs that align with query structural constraints to guide the generative model in inversely optimizing antibodies according to desired design criteria. Specifically, we introduce a structure-informed retrieval mechanism that integrates these exemplar motifs with the input backbone through a novel dual-branch denoising module, utilizing both structural and evolutionary information. Additionally, we develop a conditional diffusion model that iteratively refines the optimization process by incorporating both global context and local evolutionary conditions. Our approach is agnostic to the choice of generative models. Empirical experiments demonstrate that our method achieves state-of-the-art performance in multiple antibody inverse folding and optimization tasks, offering a new perspective on biomolecular generative models.
Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization
- Paper ID: 2410.15040
- Title: Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization
- Authors: Zichen Wang, Yaokun Ji, Jianing Tian, Shuangjia Zheng
- Classification: cs.AI
- Conference: ICLR 2025
- Paper Link: https://arxiv.org/abs/2410.15040
Antibodies are essential proteins responsible for immune responses in organisms, capable of specifically recognizing pathogenic antigens. Although recent advances in generative models have significantly enhanced rational antibody design capabilities, existing methods primarily create antibodies de novo while lacking template constraints, leading to optimization difficulties and unnatural sequences. To address these challenges, this paper proposes RADAb, a retrieval-augmented diffusion framework for efficient antibody design. The method leverages a set of structurally homologous motifs aligned with query structure constraints to guide the generative model in reverse-optimizing antibodies according to desired design criteria. Specifically, a structure-informed retrieval mechanism is introduced, which integrates exemplar motifs with input scaffolds through a novel dual-branch denoising module while leveraging structural and evolutionary information. Additionally, a conditional diffusion model is developed to iteratively optimize the process by combining global context and local evolutionary conditions. The approach is model-agnostic, and experiments demonstrate state-of-the-art performance on multiple antibody inverse folding and optimization tasks.
The core challenge in antibody design is generating functional antibody sequences with predefined biochemical properties. Traditional antibody development relies on labor-intensive experimental methods such as animal immunization or screening large antibody libraries, which often fail to effectively produce antibodies targeting therapeutically relevant epitopes.
- Data Scarcity: Primarily dependent on the SAbDab database, containing fewer than 10,000 antigen-antibody complex structures, limiting the model's ability to capture high-order interaction information
- De Novo Design Difficulty: Existing methods attempt to design antibody sequences from scratch, lacking template-based guidance and requiring substantial data and extensive training
- Missing Structural Constraints: Current generative models struggle to design antibodies that comply with structural constraints while possessing desired biological properties
Inspired by template- and fragment-based antibody design, this work aims to:
- Enhance model generation capabilities by leveraging template-aware local and global protein geometric information
- Integrate motif evolutionary signals to prevent overfitting
- Require minimal training or fine-tuning for practical applications
- First Retrieval-Augmented Generation Framework: Proposes the first retrieval-augmented generation framework for rational antibody design, using a set of functional CDR-like fragments satisfying desired scaffold structures and properties to guide generation
- Novel Retrieval Mechanism: Introduces a structure-informed retrieval mechanism that integrates exemplar motifs with input scaffolds through a dual-branch denoising module, leveraging both structural and evolutionary information
- Significant Performance Improvements: Improves upon state-of-the-art methods on multiple antibody inverse folding tasks, such as 8.08% improvement in AAR for long CDRH3 inverse folding and 7 cal/mol average absolute ΔΔG improvement in functional optimization tasks
Given an antibody framework complex Cab, antigen Cag, and retrieved CDR-like fragments A, the objective is to predict the sequence distribution of the CDR region R={sj∣j∈{a+1,...,a+m}}, where m is the CDR length and a is the starting position.
Uses the MASTER algorithm for structure retrieval:
- Input: Set of CDR scaffold atom coordinates X={xk∣k∈{1,...,m}}
- Similarity Metric: Root mean square deviation (RMSD) of scaffold atoms
- Output: Set of structurally similar CDR-like fragments A={Ai∣i∈{1,...,k}}
Global Geometric Context Branch:
- Context Encoder: Extracts single-residue features zi and residue-pair features yij
- Evolutionary Encoder: Uses ESM2 to extract evolutionary embeddings et of antibody sequences
- Structure Information Network: Processes through stacked IPA layers, outputting global probability representation rglobal
Local CDR-Focused Branch:
- CDR-Focused Axial Attention: Constructs pseudo-MSA matrix P:
P=concat((Sab∪Rgt),E)
where E is the CDR-like sequence matrix
- Tied Row Attention Mechanism: Simultaneously considers multiple row attention scores, leveraging structural similarity
- Information Fusion: Fuses rlocal and rglobal through skip connections
Forward process noise addition:
q(sjt∣sjt−1)=Multinomial((1−βt)⋅onehot(sjt−1)+βt⋅201⋅1)
Reverse denoising process:
p(sjt−1∣Rt,Cab,Cag,A)=Multinomial[F(Rt,Cab,Cag,et)+G(F(Rt,Cab,Cag,et),A)][j]
- Structure-Informed Retrieval: Utilizes the MASTER algorithm to retrieve CDR-like fragments based on scaffold structure, avoiding sequence information leakage
- Dual-Branch Architecture: Global branch captures antigen-antibody complex context, local branch learns homologous evolutionary information
- Tied Row Attention: Specially designed attention mechanism fully exploits structural similarity
- Model Agnosticism: Framework can be integrated with any diffusion generative model
- Training Set: SAbDab database, excluding structures with resolution below 4Å, clustered based on 50% sequence similarity in CDRH3 region
- Test Set: 50 PDB files containing 63 antibody-antigen complex structures
- CDR-Like Fragment Database: Constructed from non-redundant PDB, containing structurally compatible CDR-like linear functional motifs
- Amino Acid Recovery Rate (AAR): Proportion of positions where designed sequences match true CDR sequences
- Self-Consistency RMSD (scRMSD): RMSD of Cα atoms in CDR regions after refolding antibody structures
- Plausibility: Pseudo-log-likelihood computed using AntiBERTy
- Traditional Methods: Grafting (direct transplantation of top-1 retrieved fragments)
- Deep Learning Methods: ProteinMPNN, ESM-IF1, Diffab-fix, AbMPNN
- Optimizer: Adam, learning rate 0.0001
- Batch size: 8
- CDRH3 trained separately for 100,000 iterations, other CDR regions jointly trained for 250,000 iterations
- Diffusion timesteps: 100
Antibody CDR Sequence Inverse Folding Results:
| Method | CDRH3 AAR(%) | CDRH3 scRMSD | CDRH3 Plausibility |
|---|
| Grafting | 19.63 | 3.20 | -0.591 |
| ProteinMPNN | 41.77 | 2.27 | -0.605 |
| Diffab-fix | 49.17 | 2.24 | -0.541 |
| AbMPNN | 52.99 | 2.80 | -0.675 |
| RADAb | 57.02 | 2.23 | -0.530 |
Long CDRH3 Sequence Design Results (length > 14):
| Method | AAR(%) | scRMSD | Plausibility |
|---|
| Diffab-fix | 42.26 | 3.02 | -0.740 |
| RADAb | 51.35 | 2.52 | -0.747 |
Binding Energy Optimization Results:
| Method | ΔΔG↓ | ΔΔG-seq↓ | IMP-seq(%)↑ |
|---|
| Grafting | 135.17 | 40.22 | 32.69 |
| ProteinMPNN | 127.14 | 24.72 | 35.51 |
| Diffab-fix | 116.36 | 14.05 | 34.52 |
| RADAb | 109.16 | 7.06 | 37.30 |
| Component | AAR(%) | scRMSD | Plausibility |
|---|
| Complete Model | 57.02 | 2.23 | -0.530 |
| Without Retrieval Augmentation | 52.15 | 2.39 | -0.529 |
| Without Evolutionary Embedding | 51.36 | 2.23 | -0.538 |
| Baseline Diffab | 49.17 | 2.24 | -0.541 |
Using SARS-CoV-2 neutralizing antibody (PDB: 7d6i) as an example, 68% of 50 generated CDRH3 sequences exhibited lower ΔG values compared to the original complex, demonstrating functional optimization effectiveness.
- Traditional Methods: Energy function optimization and sequence similarity-based approaches
- Machine Learning Methods:
- Antibody sequence design: Language models and inverse folding models
- Antigen-specific sequence-structure co-design: Graph neural network approaches
Applications of diffusion models in protein design, including forward noise processes and reverse generation processes of DDPM.
RAG technology extended from NLP to computer vision and molecular generation domains, with this work being the first application to antibody design.
- RADAb achieves state-of-the-art performance on multiple antibody design tasks
- The retrieval augmentation mechanism significantly improves generation quality and functionality
- The dual-branch architecture effectively integrates global context and local evolutionary information
- Insufficient Experimental Validation: Lacks comprehensive wet lab verification
- Computational Overhead: Structure retrieval and ESM2 encoding require additional computational resources
- Data Leakage Risk: Applying current retrieval mechanisms in sequence-structure co-design tasks poses data leakage risks
- Wet lab validation will be a primary task
- Extend the model to various protein motif design applications
- Explore PPI retrieval to avoid data leakage issues
- Strong Innovation: First application of retrieval augmentation technology to antibody design with a novel dual-branch architecture
- Solid Technical Foundation: Well-designed structure-informed retrieval mechanism that avoids sequence information leakage
- Comprehensive Experiments: Thorough evaluation across multiple tasks and metrics, including ablation studies
- Outstanding Performance: Achieves state-of-the-art results on all evaluated tasks
- Practical Validation Pending: Lacks wet lab verification; actual application effectiveness remains unknown
- High Computational Complexity: Retrieval process and dual-branch network increase computational burden
- Limited Applicability Scope: Primarily targets inverse folding tasks with limitations in full-atom design
- Academic Contribution: Provides new perspectives for biomolecular generative models, advancing retrieval augmentation technology in protein design
- Practical Value: Promises to accelerate antibody drug design processes and reduce experimental costs
- Reproducibility: Provides detailed implementation details and open-source code
- CDR optimization design based on known antibody templates
- Antibody sequence improvement requiring structural constraint preservation
- Antibody affinity maturation and functional optimization
This paper cites important works in antibody design, diffusion models, and retrieval-augmented generation, providing solid theoretical foundation and technical support for the RADAb framework.
Overall Assessment: This is a high-quality research paper that proposes an innovative retrieval-augmented diffusion framework for antibody design. The technical approach is well-designed, experimental evaluation is comprehensive, and results are convincing. Although practical application validation requires further strengthening, the work opens new research directions in protein design with significant academic value and application prospects.