2025-11-21T19:43:16.429165

Isotropy and Geometry of Pretrained Protein LMs

Hakim, Roy, Rahman
Large pretrained language models have transformed natural language processing, and their adaptation to protein sequences -- viewed as strings of amino acid characters -- has advanced protein analysis. However, the distinct properties of proteins, such as variable sequence lengths and lack of word-sentence analogs, necessitate a deeper understanding of protein language models (LMs). We investigate the isotropy of protein LM embedding spaces using average pairwise cosine similarity and the IsoScore method, revealing that models like ProtBERT and ProtXLNet are highly anisotropic, utilizing only 2--14 dimensions for global and local representations. In contrast, multi-modal training in ProteinBERT, which integrates sequence and gene ontology data, enhances isotropy, suggesting that diverse biological inputs improve representational efficiency. We also find that embedding distances weakly correlate with alignment-based similarity scores, particularly at low similarity.
academic

Isotropy and Geometry of Pretrained Protein Language Models

Basic Information

  • Paper ID: 2510.10655
  • Title: A Look at the Isotropy of Pretrained Protein Language Models
  • Authors: Sheikh Azizul Hakim, Kowshic Roy, M Saifur Rahman
  • Classification: q-bio.OT (Quantitative Biology - Other)
  • Conference: ICML 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences
  • Paper Link: https://arxiv.org/abs/2510.10655

Abstract

Large-scale pretrained language models have transformed the field of natural language processing, and their adaptation to protein sequences—treating proteins as strings of amino acids—has advanced protein analysis. However, the unique properties of proteins, such as variable sequence lengths and the absence of word-sentence analogies, necessitate deeper understanding of protein language models (LMs). This study investigates the isotropy of protein LM embedding spaces using average pairwise cosine similarity and IsoScore methods, finding that models such as ProtBERT and ProtXLNet exhibit high anisotropy, with global and local representations utilizing only 2-14 dimensions. In contrast, ProteinBERT's multimodal training integrating sequence and gene ontology data enhances isotropy, suggesting that diverse biological inputs improve representation efficiency. The study also reveals weak correlation between embedding distances and alignment-based similarity scores, particularly in low-similarity cases.

Research Background and Motivation

Problem Definition

This research addresses the insufficient understanding of geometric properties in protein language model embedding spaces. Specifically, it includes:

  1. Missing Isotropy Analysis: While extensive research on embedding space isotropy exists in NLP, such analysis is nearly absent in the protein domain
  2. Embedding Space Efficiency: Understanding whether high-dimensional protein embeddings effectively utilize all dimensions
  3. Biological Relevance Verification: The relationship between distance metrics in embedding space and traditional biological similarity measures remains unclear

Significance

  1. Theoretical Value: Provides theoretical foundations for understanding representation learning mechanisms in protein language models and guides model improvements
  2. Practical Value: Isotropy analysis can guide dimensionality reduction and model compression, improving computational efficiency
  3. Generative Model Applications: Diverse and information-rich latent spaces are crucial for generative tasks such as protein design and variant prediction

Limitations of Existing Approaches

  1. Direct Transfer Issues: Most protein language models directly adopt NLP architectures without fully considering the unique properties of protein sequences
  2. Unimodal Constraints: Most models are trained solely on sequence information, lacking functional and structural biological priors
  3. Neglect of Geometric Properties: Lack of systematic analysis of embedding space geometric structures

Core Contributions

  1. First Systematic Analysis: Provides the first comprehensive analysis of isotropy in protein language model embedding spaces
  2. Multi-dimensional Evaluation Methods: Employs two complementary isotropy metrics: average pairwise cosine similarity and IsoScore
  3. Verification of Multimodal Training Advantages: Demonstrates the effectiveness of multimodal training (sequence + gene ontology) in improving representation isotropy
  4. Biological Relevance Analysis: Deeply analyzes the relationship between embedding distances and traditional alignment similarity, revealing limitations of existing methods
  5. Local Representation Analysis: Extends analysis to amino acid-level local embeddings, discovering similar anisotropy patterns

Methodology Details

Task Definition

The core task of this research is analyzing geometric properties of protein language model embedding spaces, specifically including:

  • Input: Protein sequence datasets and pretrained protein language models
  • Output: Isotropy metrics (IsoScore, average pairwise cosine similarity), effective dimensionality, correlation analysis between embedding distances and biological similarity
  • Constraints: Using standard protein datasets and published pretrained models to ensure reproducibility

Isotropy Measurement Methods

1. Average Pairwise Cosine Similarity

Cosine similarity is defined as the normalized dot product of two vectors x and y: cosine similarity=xyxy\text{cosine similarity} = \frac{x \cdot y}{|x||y|}

Isotropy is assessed by computing the average cosine similarity across all vector pairs in the embedding space.

2. IsoScore Method

Adopts the IsoScore method proposed by Rudman et al., with the following characteristics:

  • Mean Independence: Unaffected by data mean
  • Global Stability: Stable across data subsets
  • Rotation Invariance: Unaffected by coordinate system rotation

IsoScore is computed from the covariance matrix of principal components, with effective dimensionality calculated as: effective dim(X)=i(X)×(n1)+1\text{effective dim}(X) = i(X) \times (n-1) + 1

where i(X) is the IsoScore and n is the original dimensionality.

Model Architecture Analysis

Evaluated Models

  1. ProtBERT/ProtBERT-BFD: BERT-based architecture, 1024-dimensional embeddings
  2. ProtXLNet: XLNet-based architecture, 1024-dimensional embeddings
  3. ProteinBERT: Specially designed multimodal architecture, 512-dimensional embeddings

Embedding Generation Strategies

  • Global Embeddings: Generated through average pooling of local embeddings (ProtBERT series) or direct generation (ProteinBERT)
  • Local Embeddings: Per-residue representations corresponding to each amino acid residue

Biological Similarity Analysis

Traditional alignment similarity is computed using BioPython and PAM-250 scoring matrices:

  • Alignment Score: Sequence alignment score based on substitution matrices
  • Similarity Score: Proportion of identical residues in optimal alignment
  • Embedding Distance: Squared Euclidean distance and cosine similarity

Experimental Setup

Datasets

  • SwissProt Subset: From UniProt database, approximately 570,000 protein sequences
  • Data Characteristics: Manually curated with experimentally validated annotations and high-quality functional and structural information
  • Sampling Strategy: For correlation analysis, randomly sample 1% of proteins, yielding 6.4×10^6 protein pairs

Evaluation Metrics

  1. IsoScore: Isotropy metric, range 0,1, where 0 indicates high anisotropy and 1 indicates perfect isotropy
  2. Effective Dimensionality: Number of actually utilized dimensions calculated from IsoScore
  3. Correlation Coefficient: Pearson correlation coefficient measuring linear relationships between different distance metrics

Implementation Details

  • Hugging Face pretrained weights for ProtBERT series
  • ProteinBERT weights from official GitHub repository
  • Standard average pooling strategy for generating global representations

Experimental Results

Main Results

Global Embedding Isotropy Analysis

ModelEmbedding DimensionIsoScoreEffective Dimensions Used
ProtBERT10240.0016583
ProtBERT-BFD10240.0039686
ProtXLNet10240.0015023
ProteinBERT5120.231228120

Key Findings:

  • Traditional architecture models (ProtBERT, ProtXLNet) exhibit high anisotropy, utilizing only 2-6 effective dimensions
  • ProteinBERT is significantly more isotropic (IsoScore=0.23), utilizing 120 effective dimensions
  • In comparison, natural language BERT and GPT have IsoScores of 0.11 and 0.18 respectively

Correlation Between Embedding Distance and Biological Similarity

ProtBERT Correlation Matrix:

MetricCosine SimilaritySquared Euclidean DistanceAlignment ScoreSimilarity Score
Cosine Similarity1.0000.7910.014-0.011
Squared Euclidean Distance-1.000-0.103-0.146
Alignment Score--1.0000.847
Similarity Score---1.000

Important Observations:

  • Strong correlation between embedding metrics (0.791)
  • Strong correlation between traditional biological metrics (0.847)
  • Weak cross-domain correlation, even negative values

Local Embedding Isotropy

For 1024-dimensional local embeddings, each amino acid on average utilizes only approximately 14 effective dimensions, displaying similar anisotropy patterns as global embeddings.

Nonlinear Relationship Findings

Through scatter plot analysis:

  • Low Similarity Region: Large variance in embedding distances, poor predictive power
  • High Similarity Region: Embedding distances converge, Euclidean distance tends toward low values, cosine similarity approaches 1.0
  • This asymmetric behavior indicates embeddings are more reliable at high biological similarity but unreliable at low similarity

Isotropy Research in Natural Language Processing

  • Ethayarajh (2019) first discovered high anisotropy in models like BERT
  • Rogers et al. suggested increasing isotropy to improve BERT performance
  • Rajaee & Pilehvar (2021) found that post-processing to increase isotropy may damage performance
  • Rudman et al. proposed IsoScore method to address deficiencies in existing metrics

Protein Language Model Development

  • ProtTrans series (Elnaggar et al.): Direct application of NLP architectures to proteins
  • ProteinBERT (Brandes et al.): Specially designed multimodal architecture
  • Existing research primarily focuses on downstream task performance, lacking analysis of representation space geometric properties

Conclusions and Discussion

Main Conclusions

  1. High Anisotropy: Sequence unimodal protein language models exhibit extreme anisotropy with substantial dimensional redundancy
  2. Multimodal Advantages: Multimodal training integrating sequence and gene ontology information significantly improves isotropy
  3. Biological Relevance Limitations: Weak correlation between embedding distances and traditional biological similarity metrics, particularly in low-similarity regions
  4. Universal Dimensional Redundancy: Severe dimensional redundancy exists in both global and local representations

Limitations

  1. Dataset Constraints: Only SwissProt dataset used, may not fully represent protein diversity
  2. Limited Model Scope: Limited number of evaluated models, not covering latest large-scale protein language models
  3. Missing Biological Validation: Lack of direct association analysis with protein structure and function
  4. Absent Dynamic Analysis: No analysis of isotropy changes during training

Future Directions

  1. Geometric Optimization Training: Develop training methods that explicitly optimize geometric richness and isotropy
  2. Biologically Supervised Learning: Contrastive pretraining based on biological priors
  3. Isotropy Regularization: Incorporate isotropy-promoting regularization during training
  4. Functionally Constrained Embeddings: Functional embedding constraints based on ontology or structural data

In-Depth Evaluation

Strengths

  1. Pioneering Research: First systematic analysis of geometric properties in protein language models, filling an important research gap
  2. Scientific Methodology: Multiple complementary isotropy metrics employed, reliable results
  3. High Practical Value: Provides theoretical basis for model compression and dimensionality reduction
  4. Multimodal Insights: Demonstrates importance of multimodal training in improving representation quality
  5. Comprehensive Analysis: Holistic analysis from global to local, from isotropy to biological relevance

Weaknesses

  1. Missing Mechanistic Explanation: Lacks deep explanation of why multimodal training improves isotropy
  2. Absent Downstream Task Validation: Lacks verification of isotropy improvement's impact on specific biological task performance
  3. Limited Model Coverage: Does not include more recent protein language models
  4. Missing Optimization Solutions: While problems are identified, specific improvement proposals are lacking

Impact

  1. Theoretical Contribution: Provides important foundation for theoretical understanding of protein language models
  2. Methodological Value: Establishes standard methods for analyzing protein embedding spaces
  3. Engineering Guidance: Provides clear direction for model design and optimization
  4. Cross-domain Significance: Methods generalizable to other biological sequence analysis domains

Applicable Scenarios

  1. Model Design: Guides new protein language model architecture design
  2. Model Compression: Provides theoretical basis for compression and acceleration of large-scale protein models
  3. Generative Models: Provides better representation learning foundation for protein design and engineering
  4. Multimodal Fusion: Guides design of protein multimodal models

References

  1. Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations?
  2. Rudman, W. et al. (2022). IsoScore: Measuring the uniformity of embedding space utilization
  3. Elnaggar, A. et al. (2022). ProtTrans: Toward Understanding the Language of Life
  4. Brandes, N. et al. (2022). ProteinBERT: a universal deep-learning model of protein sequence and function

This report is based on complete reading and analysis of the paper PDF document, objectively presenting technical details, experimental results, and academic contributions, providing comprehensive reference for relevant researchers.