2025-11-16T00:34:12.699199

Latent Retrieval Augmented Generation of Cross-Domain Protein Binders

Zhang, Kong, Huang et al.
Designing protein binders targeting specific sites, which requires to generate realistic and functional interaction patterns, is a fundamental challenge in drug discovery. Current structure-based generative models are limited in generating nterfaces with sufficient rationality and interpretability. In this paper, we propose Retrieval-Augmented Diffusion for Aligned interface (RADiAnce), a new framework that leverages known interfaces to guide the design of novel binders. By unifying retrieval and generation in a shared contrastive latent space, our model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross-domain interface transfer. Extensive exeriments show that RADiAnce significantly outperforms baseline models across multiple metrics, including binding affinity and recovery of geometries and interactions. Additional experimental results validate cross-domain generalization, demonstrating that retrieving interfaces from diverse domains, such as peptides, antibodies, and protein fragments, enhances the generation performance of binders for other domains. Our work establishes a new paradigm for protein binder design that successfully bridges retrieval-based knowledge and generative AI, opening new possibilities for drug discovery.
academic

Latent Retrieval Augmented Generation of Cross-Domain Protein Binders

Basic Information

  • Paper ID: 2510.10480
  • Title: Latent Retrieval Augmented Generation of Cross-Domain Protein Binders
  • Authors: Zishen Zhang, Xiangzhe Kong, Wenbing Huang, Yang Liu
  • Classification: cs.LG cs.AI
  • Publication Date/Venue: Preprint. Under review (October 2024)
  • Paper Link: https://arxiv.org/abs/2510.10480

Abstract

Designing protein binders targeting specific binding sites is a fundamental challenge in drug discovery, requiring the generation of realistic and functional interaction patterns. Current structure-based generative models have limitations in generating interfaces with sufficient plausibility and interpretability. This paper proposes RADiAnce (Retrieval Augmented Diffusion Aligned Interface), which guides the design of novel binders by leveraging known interfaces. By unifying retrieval and generation in a shared contrastive latent space, the model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross-domain interface transfer.

Research Background and Motivation

Core Problems

  1. Protein Binder Design Challenge: Designing binders that target specific protein sites requires generating realistic and functional molecular interface interaction patterns
  2. Limitations of Existing Methods: Current structure generation models lack plausibility and interpretability, failing to effectively utilize known structural information

Significance

  • Broad application value in drug discovery, structural biology, and related fields
  • Traditional methods rely on physical or statistical energy landscape sampling optimization, which is inefficient
  • While deep generative models have made progress, they still struggle to generate plausible molecular interfaces

Limitations of Existing Approaches

  1. Neglect of Prior Knowledge: Most methods generate based solely on target binding sites, ignoring the abundant reusable interaction patterns in existing protein complexes
  2. Lack of Cross-Domain Generalization: Inability to effectively leverage common interaction motifs across different types of binders (e.g., peptides, antibodies, protein fragments)
  3. Insufficient Interpretability: The generation process lacks explicit biological guiding principles

Core Contributions

  1. Proposes RADiAnce Framework: The first method applying retrieval-augmented generation to protein binder sequence-structure co-design
  2. Constructs Contrastive Latent Space: Designs a unified latent representation supporting both retrieval and generation, enabling cross-domain interface similarity measurement
  3. Enables Cross-Domain Interface Transfer: Validates that retrieving interfaces from different binder types enhances generation performance for other domains
  4. Significant Performance Improvement: Substantially outperforms baseline methods across multiple evaluation metrics, including binding affinity, geometry, and interaction recovery

Methodology Details

Task Definition

  • Input: Binding site Y of target protein (residues within 10Å distance)
  • Output: Molecular binder X capable of specific binding to the site
  • Objective: Model conditional distribution p_θ(X | Y, T(Y|D)), where T(Y|D) represents relevant interfaces retrieved from database D

Model Architecture

1. Contrastive Variational Autoencoder (Contrastive VAE)

Encoder: Zx = Eφ(X), Zy = Eφ(Y)
Decoder: X̂ = Dξ(Zx, Zy, Y)

Key Design:

  • Independently encodes binding site Y and binder X into latent point clouds
  • Latent variables contain scalar embeddings zi and 3D coordinates z⃗i
  • Aligns positive sample pairs through contrastive learning while repelling negative pairs

Loss Function:

L(D) = Σ(Lrec + LKL + Lretrieval)

Where:

  • Lrec: Reconstruction loss (cross-entropy + MSE)
  • LKL: KL divergence regularization
  • Lretrieval: Bidirectional contrastive loss

2. Retrieval-Augmented Latent Diffusion

Forward Process:

q(u⃗ti | u⃗t-1i) = N(u⃗ti; √(1-βt)·u⃗t-1i, βtI)

Reverse Process:

pθ(u⃗t-1i | Ztx, Zy, Tv) = N(u⃗t-1i; μ⃗θ(Ztx, Zy, Tv), βtI)

Template Integration Mechanism:

  • Employs E(3)-equivariant Transformer as denoising core
  • Integrates retrieved template information through cross-attention mechanism
  • Query-Key-Value computation: Q = HWQ, K = TWK, V = TWV

Technical Innovations

  1. Unified Latent Space: First to achieve unified retrieval and generation in the same latent space, ensuring retrieved results directly guide the generation process
  2. Cross-Domain Similarity Measurement: Latent representations learned through contrastive learning capture common interaction motifs across different binder types
  3. Conditional Diffusion Integration: Innovatively integrates retrieved interface embeddings into the diffusion process through cross-attention and residual MLPs

Experimental Setup

Datasets

  1. Peptide Design: PepBench Dataset
    • Training: 4,157 complexes
    • Validation: 114 complexes
    • Testing: 93 LNR benchmark cases
  2. Antibody Design: SAbDab Dataset
    • Training: 9,473 entries
    • Validation: 400 entries
    • Testing: 60 RAbD benchmark cases
  3. Protein Fragments: ProtFrag Dataset
    • 70,498 monomer-derived protein fragments

Evaluation Metrics

  • AAR (Amino Acid Recovery Rate): Proportion of generated sequence matching reference sequence
  • RMSD: Root mean square deviation of Cα coordinates
  • ISM (Interaction Site Matching): Recovery degree of key physicochemical interactions
  • ∆∆G: Binding free energy change
  • IMP: Proportion of generated binders superior to natural ligands

Baseline Methods

  • Peptide Design: RFDiffusion, PepFlow, PepGLAD, UniMoMo
  • Antibody Design: MEAN, DyMEAN, DiffAb, GeoAB, UniMoMo

Experimental Results

Main Results

Peptide Sequence-Structure Co-Design

ModelAAR (%)RMSD (Å)∆∆G (kJ/mol)IMP (%)ISM (%)
RFDiffusion34.684.6924.785.3828.38
PepFlow35.472.8715.7114.1327.83
PepGLAD38.622.7415.2616.1332.63
UniMoMo38.692.312.40940.8649.13
RADiAnce39.422.291.96341.9452.15

Antibody CDR Design

RADiAnce significantly outperforms baseline methods across all CDR regions (H1, H2, H3, L1, L2, L3):

  • H1 Region: AAR improved to 90.83%, ∆∆G improved to -8.221 kJ/mol
  • H3 Region (most challenging): AAR reaches 54.66%, significantly superior to other methods

Retrieval Reliability Verification

Model ConfigurationITO(%)RC-0.1%RC-0.5%RC-5%
Antibody CVAE (Complete)43.9366.6796.67100.0
Peptide CVAE (Complete)61.4111.5822.5867.74

Ablation Studies

  1. Cross-Domain Training Effect: Including multi-domain data significantly improves retrieval and generation performance
  2. Joint Training Necessity: Simultaneously optimizing VAE and contrastive loss is critical
  3. Retrieval Quantity Impact: Optimal performance achieved with moderate retrieval (10-20 samples)

Case Analysis

Using GPIIb/IIIa complex (PDB ID: 3NID) as example:

  • Without retrieval guidance: Difficult to reconstruct characteristic multi-hydrogen bond interactions
  • With retrieval enhancement: Successfully inherits key interaction motifs, recovering arginine and tyrosine-mediated hydrogen bonding patterns

Peptide Design

  • Transition from classical energy sampling to deep generative modeling
  • PepFlow/PPFlow employ multimodal flow matching
  • PepGLAD applies geometric latent diffusion

Antibody Design

  • Evolution from traditional physical sampling to deep learning frameworks
  • DiffAb and others introduce antigen-conditioned generation
  • Language model approaches such as PALM-H3 gain attention

Retrieval-Augmented Generation

  • Initially applied to NLP tasks
  • Methods like f-RAG and IRDiff in molecular design
  • First application to protein binder co-design in this work

Conclusions and Discussion

Main Conclusions

  1. RADiAnce successfully establishes a new paradigm for retrieval-augmented protein binder design
  2. Cross-domain interface transfer significantly enhances generation performance, validating the existence of common interaction motifs
  3. Achieves substantial performance improvements across multiple benchmarks

Limitations

  1. Performance Dependent on Retrieval Quality: Relevance of retrieval results directly impacts generation effectiveness
  2. Limited Structural Descriptors: Current similarity measurements may not fully capture complex structural relationships
  3. Computational Complexity: Requires maintaining large-scale interface databases and performing real-time retrieval

Future Directions

  1. Improve structural descriptors and similarity measurements
  2. Explore more robust structure-aware conditional integration strategies
  3. Extend to additional molecular types and interaction patterns

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to introduce RAG paradigm to protein binder design with novel technical approach
  2. Comprehensive Experiments: Thorough evaluation across multiple datasets and metrics, including detailed ablation studies
  3. Cross-Domain Generalization: Validates feasibility of knowledge transfer across different binder types
  4. High Practical Value: Demonstrates potential in real applications such as HIV-1 CD4 receptor antibody design

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for cross-domain similarity measurement effectiveness
  2. Computational Efficiency: Insufficient analysis of computational overhead and storage requirements for large-scale retrieval
  3. Missing Biological Validation: Lacks laboratory verification of actual functionality of generated binders

Impact

  1. Academic Contribution: Provides new methodological framework for computational structural biology
  2. Practical Value: Promises to accelerate drug discovery and protein engineering applications
  3. Reproducibility: Provides detailed implementation details and code for reproduction and extension

Applicable Scenarios

  • Lead compound design in drug discovery
  • Computational-assisted antibody drug design
  • Protein interaction research
  • Protein engineering in synthetic biology

References

The paper cites 54 relevant references covering multiple domains including protein design, deep generative models, and retrieval-augmented generation, providing a solid theoretical foundation for the research.