2025-11-14T08:19:11.556995

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

Zuo, Guerzhoy, Guerzhoy
Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.
academic

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

Basic Information

  • Paper ID: 2501.00073
  • Title: Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
  • Authors: Chunsheng Zuo (Johns Hopkins University), Pavel Guerzhoy (University of Hawai'i at Mānoa), Michael Guerzhoy (University of Toronto)
  • Categories: cs.CL (Computational Linguistics), cs.LG (Machine Learning)
  • Publication Date: December 30, 2024
  • Paper Link: https://arxiv.org/abs/2501.00073

Abstract

This study investigates how Transformers with causal attention can solve position-dependent tasks without explicit positional encodings. The authors propose and validate a novel hypothesis: position information can be stored through the similarity between adjacent embedding vectors. The research demonstrates that adjacent embeddings exhibit higher similarity than distant embeddings, enabling Transformers to reconstruct token positions. This pattern is observed both in trained and randomly initialized causal Transformer models.

Research Background and Motivation

Problem Definition

Conventional wisdom suggests that Transformers require explicit positional encodings to process token positions in sequences. However, recent research (Haviv et al. 2022; Kazemnejad et al. 2024; Chi et al. 2023) indicates that decoder-only Transformers using only causal attention can learn position information without positional encodings.

Research Motivation

  1. Theoretical Gap: Existing research lacks deep understanding of how causal Transformers store position information
  2. Mechanism Exploration: Chi et al. (2023) proposed that position information is stored in embedding variance, but this explanation may be insufficient
  3. Need for New Perspectives: A new angle is needed to understand the representation mechanism of position information

Limitations of Existing Approaches

  • Non-causal attention mechanisms are permutation-invariant with respect to input token ordering, unable to handle position information
  • Chi et al.'s variance theory performs poorly in certain experiments and cannot fully explain observed phenomena

Core Contributions

  1. Adjacency Pattern Hypothesis: Discovers that embeddings at adjacent positions exhibit higher cosine similarity, forming an "adjacency pattern"
  2. Theoretical Analysis: Provides mathematical explanation for why the adjacency pattern emerges in the first layer of causal attention
  3. Comprehensive Experimental Validation: Verifies the existence of adjacency patterns across multiple tasks, model configurations, and initialization schemes
  4. Quantitative Assessment Method: Proposes the adjacency probability score to quantify the strength of position information
  5. Comparative Analysis: Demonstrates through probing experiments that cosine similarity more effectively encodes position information than embedding variance

Methodology Details

Task Definition

The study focuses on how causal Transformers represent and utilize position information without explicit positional encodings, with emphasis on similarity patterns between embedding vectors.

Core Concepts

Self-Cosine Similarity Matrix

For a token embedding sequence X ∈ R^(n×d) of length n and dimension d, the self-cosine similarity matrix C is defined as:

C_ij = cos θ(X_i, X_j) = (X_i · X_j) / (||X_i|| ||X_j||)

Adjacency Pattern

The adjacency pattern refers to the characteristic of the self-cosine similarity matrix where values near the diagonal are higher and values far from the diagonal are lower, indicating that embeddings at adjacent positions are more similar.

Adjacency Probability Score

To quantify the strength of the adjacency pattern, the authors propose the adjacency probability score:

For row k, the row-level adjacency probability score is defined as:

P_Adjacency = P(C_ki < C_kj if i < j) = 1/C(k,2) * Σ I(C_ki < C_kj)

The adjacency probability score for the entire matrix is the average across all rows.

Theoretical Analysis

Averaging Effect

In the first layer, the embedding at position k is computed through a linear combination of the previous k-1 embeddings:

  • Embedding at position k+t: Σ(i=1 to k+t) α_i * e_i
  • Embedding at position k+t+1: Σ(i=1 to k+t+1) β_i * e_i

Since adjacent positions share more input embeddings, the difference in their dot products is positive:

(Σ α_i * v_i) · (Σ β_i * v_i) - (Σ α_i * v_i) · (Σ β'_i * v_i) > 0

This mathematically explains the emergence of the adjacency pattern.

Experimental Setup

Datasets and Tasks

The authors designed four synthetic tasks requiring position information:

  1. Addition Task: Generate answers for "123+456=", with maximum input length of 9
  2. Reversal Task: Generate "4321" for "rev(1234)=", with maximum input length of 22
  3. Indexing Task: Output the first occurrence position "2" for "wherex(134504392,4)=", with maximum input length of 20
  4. Ordering Task: Given original and reordered sequences, output the new index order, with maximum input length of 18

Model Configuration

  • Base Model: 6-layer NanoGPT with 10.6 million parameters
  • Variant Configurations: 6/12/24 layers, 192/384/768 hidden dimensions
  • Initialization: Default N(0, 0.02), testing different means and variances
  • Training Settings: 20,000 training and 20,000 test samples per task, 5 random seeds

Evaluation Metrics

  1. Adjacency Probability Score: Quantifies adjacency pattern strength
  2. Task Accuracy: Model performance on various tasks
  3. Probing Experiments: Use 4-layer MLP to probe position information, evaluating NRMSE and Pearson-R

Experimental Results

Main Findings

1. Universal Existence of Adjacency Pattern

  • At the token embedding layer, adjacency probability score is approximately 0.5 (random level)
  • After the first causal attention layer, the score jumps to 0.8-1.0
  • This pattern remains stable before and after training, across different tasks and model configurations

2. Layer-wise Analysis Results

LayerInitialized ModelTrained Model
Embedding0.480.54
Layer 10.980.89
Layer 20.990.97
Layer 30.990.98
Layer 60.990.82

3. Hyperparameter Sensitivity

  • Layer Count Impact: Models with 6-24 layers all display adjacency patterns
  • Dimension Impact: Configurations with 192-768 dimensions maintain the pattern
  • Initialization Impact: Pattern is stable under standard initialization schemes (σ ≤ 0.02)

Ablation Studies

Initialization Scheme Testing

Tested different means (μ ∈ {0,4,8}) and standard deviations (σ ∈ {0.002,0.02,0.2}):

  • Small variance (σ ≤ 0.02): Adjacency pattern is stable
  • Large variance (σ = 0.2): Pattern disappears
  • Large mean has minimal impact on pattern

Comparison with Variance Theory

Compared cosine similarity and embedding variance as position features through probing experiments:

Feature TypePearson-RNRMSE
Embedding Vector0.710.20
Embedding Variance0.490.23
Cosine Similarity0.930.11

Case Analysis

Figure 1 presents visualization of the self-cosine similarity matrix in the reversal task:

  • Initialized Model: Clear diagonal pattern emerges from Layer 1
  • Trained Model: Strong adjacency pattern maintained in early layers, gradually weakens in later layers

Positional Encoding Research

  • Traditional Methods: Absolute positional encodings, relative positional encodings
  • Recent Discoveries: Haviv et al. (2022) first proved causal Transformers can be trained without positional encodings

Causal Attention Mechanisms

  • Permutation Invariance: Tsai et al. (2019) proved non-causal attention is permutation-invariant
  • Position Information Storage: Chi et al. (2023) proposed the variance-decaying hypothesis

Contributions of This Work

Compared to Chi et al.'s variance theory, this paper's adjacency pattern hypothesis:

  1. Provides more intuitive geometric explanation
  2. Performs better in probing experiments
  3. Applies to broader model configurations

Conclusions and Discussion

Main Conclusions

  1. Universal Adjacency Pattern: Causal Transformers naturally form adjacency patterns after the first attention layer
  2. Position Information Encoding: High similarity of adjacent embeddings enables position reconstruction
  3. Mechanism Explanation: The averaging effect mathematically explains the pattern's emergence
  4. Practical Value: Cosine similarity is more suitable than embedding variance as a position feature

Limitations

  1. Dataset Constraints: Validation primarily on synthetic tasks; generalization to real datasets requires further investigation
  2. Architecture Dependency: Conclusions based on specific Transformer architecture; applicability to other variants unknown
  3. Completeness Issues: Neither adjacency pattern nor variance fully explains 100% of task performance

Future Directions

  1. Large-scale Validation: Verify adjacency patterns in real language modeling tasks
  2. Mechanism Integration: Explore combining adjacency patterns with other positional encoding mechanisms
  3. Theory Refinement: Establish a more complete theoretical framework for position information representation

In-Depth Evaluation

Strengths

  1. Novel Perspective: Provides new theoretical insights by understanding position information from a geometric similarity angle
  2. Rigorous Validation: Comprehensively verifies hypotheses through multiple tasks, configurations, and analytical methods
  3. Mathematical Foundation: Provides theoretical explanation for the adjacency pattern's emergence
  4. Practical Tool: The adjacency probability score provides an effective method for quantifying position information

Weaknesses

  1. Task Limitations: Synthetic tasks may not fully reflect the complexity of real-world application scenarios
  2. Incomplete Mechanism: Acknowledges that existing theory cannot fully explain model performance
  3. Computational Cost: Self-cosine similarity matrix computation may be expensive for long sequences

Impact

  1. Theoretical Contribution: Provides new perspective for understanding Transformer position representation
  2. Practical Guidance: Supports theory for designing models without positional encodings
  3. Research Inspiration: Opens new direction for analyzing Transformer internal mechanisms from geometric perspective

Applicable Scenarios

  1. Lightweight Models: Model design reducing positional encoding parameters
  2. Long Sequence Processing: Sequence modeling avoiding positional encoding constraints
  3. Model Analysis: Understanding and debugging Transformer internal representations

References

This paper primarily references the following important works:

  • Haviv et al. (2022): First proved feasibility of training without positional encodings
  • Chi et al. (2023): Proposed variance-decaying hypothesis for position information
  • Tsai et al. (2019): Analyzed permutation properties of attention mechanisms
  • Vaswani et al. (2017): Original Transformer paper

This research provides important new perspectives for understanding how Transformers handle position information. While it has limitations in completeness, its theoretical insights and experimental findings establish a solid foundation for further development in this field.