Large pretrained language models have transformed natural language processing, and their adaptation to protein sequences -- viewed as strings of amino acid characters -- has advanced protein analysis. However, the distinct properties of proteins, such as variable sequence lengths and lack of word-sentence analogs, necessitate a deeper understanding of protein language models (LMs). We investigate the isotropy of protein LM embedding spaces using average pairwise cosine similarity and the IsoScore method, revealing that models like ProtBERT and ProtXLNet are highly anisotropic, utilizing only 2--14 dimensions for global and local representations. In contrast, multi-modal training in ProteinBERT, which integrates sequence and gene ontology data, enhances isotropy, suggesting that diverse biological inputs improve representational efficiency. We also find that embedding distances weakly correlate with alignment-based similarity scores, particularly at low similarity.
Isotropy and Geometry of Pretrained Protein Language Models
- Paper ID: 2510.10655
- Title: A Look at the Isotropy of Pretrained Protein Language Models
- Authors: Sheikh Azizul Hakim, Kowshic Roy, M Saifur Rahman
- Classification: q-bio.OT (Quantitative Biology - Other)
- Conference: ICML 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences
- Paper Link: https://arxiv.org/abs/2510.10655
Large-scale pretrained language models have transformed the field of natural language processing, and their adaptation to protein sequences—treating proteins as strings of amino acids—has advanced protein analysis. However, the unique properties of proteins, such as variable sequence lengths and the absence of word-sentence analogies, necessitate deeper understanding of protein language models (LMs). This study investigates the isotropy of protein LM embedding spaces using average pairwise cosine similarity and IsoScore methods, finding that models such as ProtBERT and ProtXLNet exhibit high anisotropy, with global and local representations utilizing only 2-14 dimensions. In contrast, ProteinBERT's multimodal training integrating sequence and gene ontology data enhances isotropy, suggesting that diverse biological inputs improve representation efficiency. The study also reveals weak correlation between embedding distances and alignment-based similarity scores, particularly in low-similarity cases.
This research addresses the insufficient understanding of geometric properties in protein language model embedding spaces. Specifically, it includes:
- Missing Isotropy Analysis: While extensive research on embedding space isotropy exists in NLP, such analysis is nearly absent in the protein domain
- Embedding Space Efficiency: Understanding whether high-dimensional protein embeddings effectively utilize all dimensions
- Biological Relevance Verification: The relationship between distance metrics in embedding space and traditional biological similarity measures remains unclear
- Theoretical Value: Provides theoretical foundations for understanding representation learning mechanisms in protein language models and guides model improvements
- Practical Value: Isotropy analysis can guide dimensionality reduction and model compression, improving computational efficiency
- Generative Model Applications: Diverse and information-rich latent spaces are crucial for generative tasks such as protein design and variant prediction
- Direct Transfer Issues: Most protein language models directly adopt NLP architectures without fully considering the unique properties of protein sequences
- Unimodal Constraints: Most models are trained solely on sequence information, lacking functional and structural biological priors
- Neglect of Geometric Properties: Lack of systematic analysis of embedding space geometric structures
- First Systematic Analysis: Provides the first comprehensive analysis of isotropy in protein language model embedding spaces
- Multi-dimensional Evaluation Methods: Employs two complementary isotropy metrics: average pairwise cosine similarity and IsoScore
- Verification of Multimodal Training Advantages: Demonstrates the effectiveness of multimodal training (sequence + gene ontology) in improving representation isotropy
- Biological Relevance Analysis: Deeply analyzes the relationship between embedding distances and traditional alignment similarity, revealing limitations of existing methods
- Local Representation Analysis: Extends analysis to amino acid-level local embeddings, discovering similar anisotropy patterns
The core task of this research is analyzing geometric properties of protein language model embedding spaces, specifically including:
- Input: Protein sequence datasets and pretrained protein language models
- Output: Isotropy metrics (IsoScore, average pairwise cosine similarity), effective dimensionality, correlation analysis between embedding distances and biological similarity
- Constraints: Using standard protein datasets and published pretrained models to ensure reproducibility
Cosine similarity is defined as the normalized dot product of two vectors x and y:
cosine similarity=∣x∣∣y∣x⋅y
Isotropy is assessed by computing the average cosine similarity across all vector pairs in the embedding space.
Adopts the IsoScore method proposed by Rudman et al., with the following characteristics:
- Mean Independence: Unaffected by data mean
- Global Stability: Stable across data subsets
- Rotation Invariance: Unaffected by coordinate system rotation
IsoScore is computed from the covariance matrix of principal components, with effective dimensionality calculated as:
effective dim(X)=i(X)×(n−1)+1
where i(X) is the IsoScore and n is the original dimensionality.
- ProtBERT/ProtBERT-BFD: BERT-based architecture, 1024-dimensional embeddings
- ProtXLNet: XLNet-based architecture, 1024-dimensional embeddings
- ProteinBERT: Specially designed multimodal architecture, 512-dimensional embeddings
- Global Embeddings: Generated through average pooling of local embeddings (ProtBERT series) or direct generation (ProteinBERT)
- Local Embeddings: Per-residue representations corresponding to each amino acid residue
Traditional alignment similarity is computed using BioPython and PAM-250 scoring matrices:
- Alignment Score: Sequence alignment score based on substitution matrices
- Similarity Score: Proportion of identical residues in optimal alignment
- Embedding Distance: Squared Euclidean distance and cosine similarity
- SwissProt Subset: From UniProt database, approximately 570,000 protein sequences
- Data Characteristics: Manually curated with experimentally validated annotations and high-quality functional and structural information
- Sampling Strategy: For correlation analysis, randomly sample 1% of proteins, yielding 6.4×10^6 protein pairs
- IsoScore: Isotropy metric, range 0,1, where 0 indicates high anisotropy and 1 indicates perfect isotropy
- Effective Dimensionality: Number of actually utilized dimensions calculated from IsoScore
- Correlation Coefficient: Pearson correlation coefficient measuring linear relationships between different distance metrics
- Hugging Face pretrained weights for ProtBERT series
- ProteinBERT weights from official GitHub repository
- Standard average pooling strategy for generating global representations
| Model | Embedding Dimension | IsoScore | Effective Dimensions Used |
|---|
| ProtBERT | 1024 | 0.001658 | 3 |
| ProtBERT-BFD | 1024 | 0.003968 | 6 |
| ProtXLNet | 1024 | 0.001502 | 3 |
| ProteinBERT | 512 | 0.231228 | 120 |
Key Findings:
- Traditional architecture models (ProtBERT, ProtXLNet) exhibit high anisotropy, utilizing only 2-6 effective dimensions
- ProteinBERT is significantly more isotropic (IsoScore=0.23), utilizing 120 effective dimensions
- In comparison, natural language BERT and GPT have IsoScores of 0.11 and 0.18 respectively
ProtBERT Correlation Matrix:
| Metric | Cosine Similarity | Squared Euclidean Distance | Alignment Score | Similarity Score |
|---|
| Cosine Similarity | 1.000 | 0.791 | 0.014 | -0.011 |
| Squared Euclidean Distance | - | 1.000 | -0.103 | -0.146 |
| Alignment Score | - | - | 1.000 | 0.847 |
| Similarity Score | - | - | - | 1.000 |
Important Observations:
- Strong correlation between embedding metrics (0.791)
- Strong correlation between traditional biological metrics (0.847)
- Weak cross-domain correlation, even negative values
For 1024-dimensional local embeddings, each amino acid on average utilizes only approximately 14 effective dimensions, displaying similar anisotropy patterns as global embeddings.
Through scatter plot analysis:
- Low Similarity Region: Large variance in embedding distances, poor predictive power
- High Similarity Region: Embedding distances converge, Euclidean distance tends toward low values, cosine similarity approaches 1.0
- This asymmetric behavior indicates embeddings are more reliable at high biological similarity but unreliable at low similarity
- Ethayarajh (2019) first discovered high anisotropy in models like BERT
- Rogers et al. suggested increasing isotropy to improve BERT performance
- Rajaee & Pilehvar (2021) found that post-processing to increase isotropy may damage performance
- Rudman et al. proposed IsoScore method to address deficiencies in existing metrics
- ProtTrans series (Elnaggar et al.): Direct application of NLP architectures to proteins
- ProteinBERT (Brandes et al.): Specially designed multimodal architecture
- Existing research primarily focuses on downstream task performance, lacking analysis of representation space geometric properties
- High Anisotropy: Sequence unimodal protein language models exhibit extreme anisotropy with substantial dimensional redundancy
- Multimodal Advantages: Multimodal training integrating sequence and gene ontology information significantly improves isotropy
- Biological Relevance Limitations: Weak correlation between embedding distances and traditional biological similarity metrics, particularly in low-similarity regions
- Universal Dimensional Redundancy: Severe dimensional redundancy exists in both global and local representations
- Dataset Constraints: Only SwissProt dataset used, may not fully represent protein diversity
- Limited Model Scope: Limited number of evaluated models, not covering latest large-scale protein language models
- Missing Biological Validation: Lack of direct association analysis with protein structure and function
- Absent Dynamic Analysis: No analysis of isotropy changes during training
- Geometric Optimization Training: Develop training methods that explicitly optimize geometric richness and isotropy
- Biologically Supervised Learning: Contrastive pretraining based on biological priors
- Isotropy Regularization: Incorporate isotropy-promoting regularization during training
- Functionally Constrained Embeddings: Functional embedding constraints based on ontology or structural data
- Pioneering Research: First systematic analysis of geometric properties in protein language models, filling an important research gap
- Scientific Methodology: Multiple complementary isotropy metrics employed, reliable results
- High Practical Value: Provides theoretical basis for model compression and dimensionality reduction
- Multimodal Insights: Demonstrates importance of multimodal training in improving representation quality
- Comprehensive Analysis: Holistic analysis from global to local, from isotropy to biological relevance
- Missing Mechanistic Explanation: Lacks deep explanation of why multimodal training improves isotropy
- Absent Downstream Task Validation: Lacks verification of isotropy improvement's impact on specific biological task performance
- Limited Model Coverage: Does not include more recent protein language models
- Missing Optimization Solutions: While problems are identified, specific improvement proposals are lacking
- Theoretical Contribution: Provides important foundation for theoretical understanding of protein language models
- Methodological Value: Establishes standard methods for analyzing protein embedding spaces
- Engineering Guidance: Provides clear direction for model design and optimization
- Cross-domain Significance: Methods generalizable to other biological sequence analysis domains
- Model Design: Guides new protein language model architecture design
- Model Compression: Provides theoretical basis for compression and acceleration of large-scale protein models
- Generative Models: Provides better representation learning foundation for protein design and engineering
- Multimodal Fusion: Guides design of protein multimodal models
- Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations?
- Rudman, W. et al. (2022). IsoScore: Measuring the uniformity of embedding space utilization
- Elnaggar, A. et al. (2022). ProtTrans: Toward Understanding the Language of Life
- Brandes, N. et al. (2022). ProteinBERT: a universal deep-learning model of protein sequence and function
This report is based on complete reading and analysis of the paper PDF document, objectively presenting technical details, experimental results, and academic contributions, providing comprehensive reference for relevant researchers.