2025-11-16T00:34:12.699199

Latent Retrieval Augmented Generation of Cross-Domain Protein Binders

Zhang, Kong, Huang et al.

Designing protein binders targeting specific sites, which requires to generate realistic and functional interaction patterns, is a fundamental challenge in drug discovery. Current structure-based generative models are limited in generating nterfaces with sufficient rationality and interpretability. In this paper, we propose Retrieval-Augmented Diffusion for Aligned interface (RADiAnce), a new framework that leverages known interfaces to guide the design of novel binders. By unifying retrieval and generation in a shared contrastive latent space, our model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross-domain interface transfer. Extensive exeriments show that RADiAnce significantly outperforms baseline models across multiple metrics, including binding affinity and recovery of geometries and interactions. Additional experimental results validate cross-domain generalization, demonstrating that retrieving interfaces from diverse domains, such as peptides, antibodies, and protein fragments, enhances the generation performance of binders for other domains. Our work establishes a new paradigm for protein binder design that successfully bridges retrieval-based knowledge and generative AI, opening new possibilities for drug discovery.

academic

Latent Retrieval Augmented Generation of Cross-Domain Protein Binders

Basic Information

Paper ID: 2510.10480
Title: Latent Retrieval Augmented Generation of Cross-Domain Protein Binders
Authors: Zishen Zhang, Xiangzhe Kong, Wenbing Huang, Yang Liu
Classification: cs.LG cs.AI
Publication Date/Venue: Preprint. Under review (October 2024)
Paper Link: https://arxiv.org/abs/2510.10480

Abstract

Designing protein binders targeting specific binding sites is a fundamental challenge in drug discovery, requiring the generation of realistic and functional interaction patterns. Current structure-based generative models have limitations in generating interfaces with sufficient plausibility and interpretability. This paper proposes RADiAnce (Retrieval Augmented Diffusion Aligned Interface), which guides the design of novel binders by leveraging known interfaces. By unifying retrieval and generation in a shared contrastive latent space, the model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross-domain interface transfer.

Research Background and Motivation

Core Problems

Protein Binder Design Challenge: Designing binders that target specific protein sites requires generating realistic and functional molecular interface interaction patterns
Limitations of Existing Methods: Current structure generation models lack plausibility and interpretability, failing to effectively utilize known structural information

Significance

Broad application value in drug discovery, structural biology, and related fields
Traditional methods rely on physical or statistical energy landscape sampling optimization, which is inefficient
While deep generative models have made progress, they still struggle to generate plausible molecular interfaces

Limitations of Existing Approaches

Neglect of Prior Knowledge: Most methods generate based solely on target binding sites, ignoring the abundant reusable interaction patterns in existing protein complexes
Lack of Cross-Domain Generalization: Inability to effectively leverage common interaction motifs across different types of binders (e.g., peptides, antibodies, protein fragments)
Insufficient Interpretability: The generation process lacks explicit biological guiding principles

Core Contributions

Proposes RADiAnce Framework: The first method applying retrieval-augmented generation to protein binder sequence-structure co-design
Constructs Contrastive Latent Space: Designs a unified latent representation supporting both retrieval and generation, enabling cross-domain interface similarity measurement
Enables Cross-Domain Interface Transfer: Validates that retrieving interfaces from different binder types enhances generation performance for other domains
Significant Performance Improvement: Substantially outperforms baseline methods across multiple evaluation metrics, including binding affinity, geometry, and interaction recovery

Methodology Details

Task Definition

Input: Binding site Y of target protein (residues within 10Å distance)
Output: Molecular binder X capable of specific binding to the site
Objective: Model conditional distribution p_θ(X | Y, T(Y|D)), where T(Y|D) represents relevant interfaces retrieved from database D

Model Architecture

1. Contrastive Variational Autoencoder (Contrastive VAE)

Encoder: Zx = Eφ(X), Zy = Eφ(Y)
Decoder: X̂ = Dξ(Zx, Zy, Y)

Key Design:

Independently encodes binding site Y and binder X into latent point clouds
Latent variables contain scalar embeddings zi and 3D coordinates z⃗i
Aligns positive sample pairs through contrastive learning while repelling negative pairs

Loss Function:

L(D) = Σ(Lrec + LKL + Lretrieval)

Where:

Lrec: Reconstruction loss (cross-entropy + MSE)
LKL: KL divergence regularization
Lretrieval: Bidirectional contrastive loss

2. Retrieval-Augmented Latent Diffusion

Forward Process:

q(u⃗ti | u⃗t-1i) = N(u⃗ti; √(1-βt)·u⃗t-1i, βtI)

Reverse Process:

pθ(u⃗t-1i | Ztx, Zy, Tv) = N(u⃗t-1i; μ⃗θ(Ztx, Zy, Tv), βtI)

Template Integration Mechanism:

Employs E(3)-equivariant Transformer as denoising core
Integrates retrieved template information through cross-attention mechanism
Query-Key-Value computation: Q = HWQ, K = TWK, V = TWV

Technical Innovations

Unified Latent Space: First to achieve unified retrieval and generation in the same latent space, ensuring retrieved results directly guide the generation process
Cross-Domain Similarity Measurement: Latent representations learned through contrastive learning capture common interaction motifs across different binder types
Conditional Diffusion Integration: Innovatively integrates retrieved interface embeddings into the diffusion process through cross-attention and residual MLPs

Experimental Setup

Datasets

Peptide Design: PepBench Dataset
- Training: 4,157 complexes
- Validation: 114 complexes
- Testing: 93 LNR benchmark cases
Antibody Design: SAbDab Dataset
- Training: 9,473 entries
- Validation: 400 entries
- Testing: 60 RAbD benchmark cases
Protein Fragments: ProtFrag Dataset
- 70,498 monomer-derived protein fragments

Evaluation Metrics

AAR (Amino Acid Recovery Rate): Proportion of generated sequence matching reference sequence
RMSD: Root mean square deviation of Cα coordinates
ISM (Interaction Site Matching): Recovery degree of key physicochemical interactions
∆∆G: Binding free energy change
IMP: Proportion of generated binders superior to natural ligands

Baseline Methods

Peptide Design: RFDiffusion, PepFlow, PepGLAD, UniMoMo
Antibody Design: MEAN, DyMEAN, DiffAb, GeoAB, UniMoMo

Experimental Results

Main Results

Peptide Sequence-Structure Co-Design

Model	AAR (%)	RMSD (Å)	∆∆G (kJ/mol)	IMP (%)	ISM (%)
RFDiffusion	34.68	4.69	24.78	5.38	28.38
PepFlow	35.47	2.87	15.71	14.13	27.83
PepGLAD	38.62	2.74	15.26	16.13	32.63
UniMoMo	38.69	2.31	2.409	40.86	49.13
RADiAnce	39.42	2.29	1.963	41.94	52.15

Antibody CDR Design

RADiAnce significantly outperforms baseline methods across all CDR regions (H1, H2, H3, L1, L2, L3):

H1 Region: AAR improved to 90.83%, ∆∆G improved to -8.221 kJ/mol
H3 Region (most challenging): AAR reaches 54.66%, significantly superior to other methods

Retrieval Reliability Verification

Model Configuration	ITO(%)	RC-0.1%	RC-0.5%	RC-5%
Antibody CVAE (Complete)	43.93	66.67	96.67	100.0
Peptide CVAE (Complete)	61.41	11.58	22.58	67.74

Ablation Studies

Cross-Domain Training Effect: Including multi-domain data significantly improves retrieval and generation performance
Joint Training Necessity: Simultaneously optimizing VAE and contrastive loss is critical
Retrieval Quantity Impact: Optimal performance achieved with moderate retrieval (10-20 samples)

Case Analysis

Using GPIIb/IIIa complex (PDB ID: 3NID) as example:

Without retrieval guidance: Difficult to reconstruct characteristic multi-hydrogen bond interactions
With retrieval enhancement: Successfully inherits key interaction motifs, recovering arginine and tyrosine-mediated hydrogen bonding patterns

Peptide Design

Transition from classical energy sampling to deep generative modeling
PepFlow/PPFlow employ multimodal flow matching
PepGLAD applies geometric latent diffusion

Antibody Design

Evolution from traditional physical sampling to deep learning frameworks
DiffAb and others introduce antigen-conditioned generation
Language model approaches such as PALM-H3 gain attention

Retrieval-Augmented Generation

Initially applied to NLP tasks
Methods like f-RAG and IRDiff in molecular design
First application to protein binder co-design in this work

Conclusions and Discussion

Main Conclusions

RADiAnce successfully establishes a new paradigm for retrieval-augmented protein binder design
Cross-domain interface transfer significantly enhances generation performance, validating the existence of common interaction motifs
Achieves substantial performance improvements across multiple benchmarks

Limitations

Performance Dependent on Retrieval Quality: Relevance of retrieval results directly impacts generation effectiveness
Limited Structural Descriptors: Current similarity measurements may not fully capture complex structural relationships
Computational Complexity: Requires maintaining large-scale interface databases and performing real-time retrieval

Future Directions

Improve structural descriptors and similarity measurements
Explore more robust structure-aware conditional integration strategies
Extend to additional molecular types and interaction patterns

In-Depth Evaluation

Strengths

Strong Innovation: First to introduce RAG paradigm to protein binder design with novel technical approach
Comprehensive Experiments: Thorough evaluation across multiple datasets and metrics, including detailed ablation studies
Cross-Domain Generalization: Validates feasibility of knowledge transfer across different binder types
High Practical Value: Demonstrates potential in real applications such as HIV-1 CD4 receptor antibody design

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for cross-domain similarity measurement effectiveness
Computational Efficiency: Insufficient analysis of computational overhead and storage requirements for large-scale retrieval
Missing Biological Validation: Lacks laboratory verification of actual functionality of generated binders

Impact

Academic Contribution: Provides new methodological framework for computational structural biology
Practical Value: Promises to accelerate drug discovery and protein engineering applications
Reproducibility: Provides detailed implementation details and code for reproduction and extension

Applicable Scenarios

Lead compound design in drug discovery
Computational-assisted antibody drug design
Protein interaction research
Protein engineering in synthetic biology

References

The paper cites 54 relevant references covering multiple domains including protein design, deep generative models, and retrieval-augmented generation, providing a solid theoretical foundation for the research.