2025-11-11T13:16:09.695232

Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization

Wang, Ji, Tian et al.
Antibodies are essential proteins responsible for immune responses in organisms, capable of specifically recognizing antigen molecules of pathogens. Recent advances in generative models have significantly enhanced rational antibody design. However, existing methods mainly create antibodies from scratch without template constraints, leading to model optimization challenges and unnatural sequences. To address these issues, we propose a retrieval-augmented diffusion framework, termed RADAb, for efficient antibody design. Our method leverages a set of structural homologous motifs that align with query structural constraints to guide the generative model in inversely optimizing antibodies according to desired design criteria. Specifically, we introduce a structure-informed retrieval mechanism that integrates these exemplar motifs with the input backbone through a novel dual-branch denoising module, utilizing both structural and evolutionary information. Additionally, we develop a conditional diffusion model that iteratively refines the optimization process by incorporating both global context and local evolutionary conditions. Our approach is agnostic to the choice of generative models. Empirical experiments demonstrate that our method achieves state-of-the-art performance in multiple antibody inverse folding and optimization tasks, offering a new perspective on biomolecular generative models.
academic

Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization

Basic Information

  • Paper ID: 2410.15040
  • Title: Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization
  • Authors: Zichen Wang, Yaokun Ji, Jianing Tian, Shuangjia Zheng
  • Classification: cs.AI
  • Conference: ICLR 2025
  • Paper Link: https://arxiv.org/abs/2410.15040

Abstract

Antibodies are essential proteins responsible for immune responses in organisms, capable of specifically recognizing pathogenic antigens. Although recent advances in generative models have significantly enhanced rational antibody design capabilities, existing methods primarily create antibodies de novo while lacking template constraints, leading to optimization difficulties and unnatural sequences. To address these challenges, this paper proposes RADAb, a retrieval-augmented diffusion framework for efficient antibody design. The method leverages a set of structurally homologous motifs aligned with query structure constraints to guide the generative model in reverse-optimizing antibodies according to desired design criteria. Specifically, a structure-informed retrieval mechanism is introduced, which integrates exemplar motifs with input scaffolds through a novel dual-branch denoising module while leveraging structural and evolutionary information. Additionally, a conditional diffusion model is developed to iteratively optimize the process by combining global context and local evolutionary conditions. The approach is model-agnostic, and experiments demonstrate state-of-the-art performance on multiple antibody inverse folding and optimization tasks.

Research Background and Motivation

Problem Definition

The core challenge in antibody design is generating functional antibody sequences with predefined biochemical properties. Traditional antibody development relies on labor-intensive experimental methods such as animal immunization or screening large antibody libraries, which often fail to effectively produce antibodies targeting therapeutically relevant epitopes.

Limitations of Existing Methods

  1. Data Scarcity: Primarily dependent on the SAbDab database, containing fewer than 10,000 antigen-antibody complex structures, limiting the model's ability to capture high-order interaction information
  2. De Novo Design Difficulty: Existing methods attempt to design antibody sequences from scratch, lacking template-based guidance and requiring substantial data and extensive training
  3. Missing Structural Constraints: Current generative models struggle to design antibodies that comply with structural constraints while possessing desired biological properties

Research Motivation

Inspired by template- and fragment-based antibody design, this work aims to:

  1. Enhance model generation capabilities by leveraging template-aware local and global protein geometric information
  2. Integrate motif evolutionary signals to prevent overfitting
  3. Require minimal training or fine-tuning for practical applications

Core Contributions

  1. First Retrieval-Augmented Generation Framework: Proposes the first retrieval-augmented generation framework for rational antibody design, using a set of functional CDR-like fragments satisfying desired scaffold structures and properties to guide generation
  2. Novel Retrieval Mechanism: Introduces a structure-informed retrieval mechanism that integrates exemplar motifs with input scaffolds through a dual-branch denoising module, leveraging both structural and evolutionary information
  3. Significant Performance Improvements: Improves upon state-of-the-art methods on multiple antibody inverse folding tasks, such as 8.08% improvement in AAR for long CDRH3 inverse folding and 7 cal/mol average absolute ΔΔG improvement in functional optimization tasks

Methodology Details

Task Definition

Given an antibody framework complex CabC_{ab}, antigen CagC_{ag}, and retrieved CDR-like fragments AA, the objective is to predict the sequence distribution of the CDR region R={sjj{a+1,...,a+m}}R = \{s_j | j \in \{a+1, ..., a+m\}\}, where mm is the CDR length and aa is the starting position.

Model Architecture

1. Structure Retrieval Module

Uses the MASTER algorithm for structure retrieval:

  • Input: Set of CDR scaffold atom coordinates X={xkk{1,...,m}}X = \{x_k | k \in \{1, ..., m\}\}
  • Similarity Metric: Root mean square deviation (RMSD) of scaffold atoms
  • Output: Set of structurally similar CDR-like fragments A={Aii{1,...,k}}A = \{A_i | i \in \{1, ..., k\}\}

2. Dual-Branch Denoising Network

Global Geometric Context Branch:

  • Context Encoder: Extracts single-residue features ziz_i and residue-pair features yijy_{ij}
  • Evolutionary Encoder: Uses ESM2 to extract evolutionary embeddings ete^t of antibody sequences
  • Structure Information Network: Processes through stacked IPA layers, outputting global probability representation rglobalr_{global}

Local CDR-Focused Branch:

  • CDR-Focused Axial Attention: Constructs pseudo-MSA matrix PP: P=concat((SabRgt),E)P = \text{concat}((S_{ab} \cup R^t_g), E) where EE is the CDR-like sequence matrix
  • Tied Row Attention Mechanism: Simultaneously considers multiple row attention scores, leveraging structural similarity
  • Information Fusion: Fuses rlocalr_{local} and rglobalr_{global} through skip connections

3. Conditional Diffusion Process

Forward process noise addition: q(sjtsjt1)=Multinomial((1βt)onehot(sjt1)+βt1201)q(s^t_j | s^{t-1}_j) = \text{Multinomial}((1-\beta_t) \cdot \text{onehot}(s^{t-1}_j) + \beta_t \cdot \frac{1}{20} \cdot \mathbf{1})

Reverse denoising process: p(sjt1Rt,Cab,Cag,A)=Multinomial[F(Rt,Cab,Cag,et)+G(F(Rt,Cab,Cag,et),A)][j]p(s^{t-1}_j | R^t, C_{ab}, C_{ag}, A) = \text{Multinomial}[F(R^t, C_{ab}, C_{ag}, e^t) + G(F(R^t, C_{ab}, C_{ag}, e^t), A)][j]

Technical Innovations

  1. Structure-Informed Retrieval: Utilizes the MASTER algorithm to retrieve CDR-like fragments based on scaffold structure, avoiding sequence information leakage
  2. Dual-Branch Architecture: Global branch captures antigen-antibody complex context, local branch learns homologous evolutionary information
  3. Tied Row Attention: Specially designed attention mechanism fully exploits structural similarity
  4. Model Agnosticism: Framework can be integrated with any diffusion generative model

Experimental Setup

Datasets

  • Training Set: SAbDab database, excluding structures with resolution below 4Å, clustered based on 50% sequence similarity in CDRH3 region
  • Test Set: 50 PDB files containing 63 antibody-antigen complex structures
  • CDR-Like Fragment Database: Constructed from non-redundant PDB, containing structurally compatible CDR-like linear functional motifs

Evaluation Metrics

  1. Amino Acid Recovery Rate (AAR): Proportion of positions where designed sequences match true CDR sequences
  2. Self-Consistency RMSD (scRMSD): RMSD of Cα atoms in CDR regions after refolding antibody structures
  3. Plausibility: Pseudo-log-likelihood computed using AntiBERTy

Comparison Methods

  • Traditional Methods: Grafting (direct transplantation of top-1 retrieved fragments)
  • Deep Learning Methods: ProteinMPNN, ESM-IF1, Diffab-fix, AbMPNN

Implementation Details

  • Optimizer: Adam, learning rate 0.0001
  • Batch size: 8
  • CDRH3 trained separately for 100,000 iterations, other CDR regions jointly trained for 250,000 iterations
  • Diffusion timesteps: 100

Experimental Results

Main Results

Antibody CDR Sequence Inverse Folding Results:

MethodCDRH3 AAR(%)CDRH3 scRMSDCDRH3 Plausibility
Grafting19.633.20-0.591
ProteinMPNN41.772.27-0.605
Diffab-fix49.172.24-0.541
AbMPNN52.992.80-0.675
RADAb57.022.23-0.530

Long CDRH3 Sequence Design Results (length > 14):

MethodAAR(%)scRMSDPlausibility
Diffab-fix42.263.02-0.740
RADAb51.352.52-0.747

Functional Optimization Results

Binding Energy Optimization Results:

MethodΔΔG↓ΔΔG-seq↓IMP-seq(%)↑
Grafting135.1740.2232.69
ProteinMPNN127.1424.7235.51
Diffab-fix116.3614.0534.52
RADAb109.167.0637.30

Ablation Study

ComponentAAR(%)scRMSDPlausibility
Complete Model57.022.23-0.530
Without Retrieval Augmentation52.152.39-0.529
Without Evolutionary Embedding51.362.23-0.538
Baseline Diffab49.172.24-0.541

Case Study

Using SARS-CoV-2 neutralizing antibody (PDB: 7d6i) as an example, 68% of 50 generated CDRH3 sequences exhibited lower ΔG values compared to the original complex, demonstrating functional optimization effectiveness.

Antibody Design Methods

  1. Traditional Methods: Energy function optimization and sequence similarity-based approaches
  2. Machine Learning Methods:
    • Antibody sequence design: Language models and inverse folding models
    • Antigen-specific sequence-structure co-design: Graph neural network approaches

Diffusion Generative Models

Applications of diffusion models in protein design, including forward noise processes and reverse generation processes of DDPM.

Retrieval-Augmented Generation

RAG technology extended from NLP to computer vision and molecular generation domains, with this work being the first application to antibody design.

Conclusions and Discussion

Main Conclusions

  1. RADAb achieves state-of-the-art performance on multiple antibody design tasks
  2. The retrieval augmentation mechanism significantly improves generation quality and functionality
  3. The dual-branch architecture effectively integrates global context and local evolutionary information

Limitations

  1. Insufficient Experimental Validation: Lacks comprehensive wet lab verification
  2. Computational Overhead: Structure retrieval and ESM2 encoding require additional computational resources
  3. Data Leakage Risk: Applying current retrieval mechanisms in sequence-structure co-design tasks poses data leakage risks

Future Directions

  1. Wet lab validation will be a primary task
  2. Extend the model to various protein motif design applications
  3. Explore PPI retrieval to avoid data leakage issues

In-Depth Evaluation

Strengths

  1. Strong Innovation: First application of retrieval augmentation technology to antibody design with a novel dual-branch architecture
  2. Solid Technical Foundation: Well-designed structure-informed retrieval mechanism that avoids sequence information leakage
  3. Comprehensive Experiments: Thorough evaluation across multiple tasks and metrics, including ablation studies
  4. Outstanding Performance: Achieves state-of-the-art results on all evaluated tasks

Weaknesses

  1. Practical Validation Pending: Lacks wet lab verification; actual application effectiveness remains unknown
  2. High Computational Complexity: Retrieval process and dual-branch network increase computational burden
  3. Limited Applicability Scope: Primarily targets inverse folding tasks with limitations in full-atom design

Impact

  1. Academic Contribution: Provides new perspectives for biomolecular generative models, advancing retrieval augmentation technology in protein design
  2. Practical Value: Promises to accelerate antibody drug design processes and reduce experimental costs
  3. Reproducibility: Provides detailed implementation details and open-source code

Applicable Scenarios

  1. CDR optimization design based on known antibody templates
  2. Antibody sequence improvement requiring structural constraint preservation
  3. Antibody affinity maturation and functional optimization

References

This paper cites important works in antibody design, diffusion models, and retrieval-augmented generation, providing solid theoretical foundation and technical support for the RADAb framework.


Overall Assessment: This is a high-quality research paper that proposes an innovative retrieval-augmented diffusion framework for antibody design. The technical approach is well-designed, experimental evaluation is comprehensive, and results are convincing. Although practical application validation requires further strengthening, the work opens new research directions in protein design with significant academic value and application prospects.