2025-11-14T08:19:11.556995

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

Zuo, Guerzhoy, Guerzhoy

Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.

academic

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

Basic Information

Paper ID: 2501.00073
Title: Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
Authors: Chunsheng Zuo (Johns Hopkins University), Pavel Guerzhoy (University of Hawai'i at Mānoa), Michael Guerzhoy (University of Toronto)
Categories: cs.CL (Computational Linguistics), cs.LG (Machine Learning)
Publication Date: December 30, 2024
Paper Link: https://arxiv.org/abs/2501.00073

Abstract

This study investigates how Transformers with causal attention can solve position-dependent tasks without explicit positional encodings. The authors propose and validate a novel hypothesis: position information can be stored through the similarity between adjacent embedding vectors. The research demonstrates that adjacent embeddings exhibit higher similarity than distant embeddings, enabling Transformers to reconstruct token positions. This pattern is observed both in trained and randomly initialized causal Transformer models.

Research Background and Motivation

Problem Definition

Conventional wisdom suggests that Transformers require explicit positional encodings to process token positions in sequences. However, recent research (Haviv et al. 2022; Kazemnejad et al. 2024; Chi et al. 2023) indicates that decoder-only Transformers using only causal attention can learn position information without positional encodings.

Research Motivation

Theoretical Gap: Existing research lacks deep understanding of how causal Transformers store position information
Mechanism Exploration: Chi et al. (2023) proposed that position information is stored in embedding variance, but this explanation may be insufficient
Need for New Perspectives: A new angle is needed to understand the representation mechanism of position information

Limitations of Existing Approaches

Non-causal attention mechanisms are permutation-invariant with respect to input token ordering, unable to handle position information
Chi et al.'s variance theory performs poorly in certain experiments and cannot fully explain observed phenomena

Core Contributions

Adjacency Pattern Hypothesis: Discovers that embeddings at adjacent positions exhibit higher cosine similarity, forming an "adjacency pattern"
Theoretical Analysis: Provides mathematical explanation for why the adjacency pattern emerges in the first layer of causal attention
Comprehensive Experimental Validation: Verifies the existence of adjacency patterns across multiple tasks, model configurations, and initialization schemes
Quantitative Assessment Method: Proposes the adjacency probability score to quantify the strength of position information
Comparative Analysis: Demonstrates through probing experiments that cosine similarity more effectively encodes position information than embedding variance

Methodology Details

Task Definition

The study focuses on how causal Transformers represent and utilize position information without explicit positional encodings, with emphasis on similarity patterns between embedding vectors.

Core Concepts

Self-Cosine Similarity Matrix

For a token embedding sequence X ∈ R^(n×d) of length n and dimension d, the self-cosine similarity matrix C is defined as:

C_ij = cos θ(X_i, X_j) = (X_i · X_j) / (||X_i|| ||X_j||)

Adjacency Pattern

The adjacency pattern refers to the characteristic of the self-cosine similarity matrix where values near the diagonal are higher and values far from the diagonal are lower, indicating that embeddings at adjacent positions are more similar.

Adjacency Probability Score

To quantify the strength of the adjacency pattern, the authors propose the adjacency probability score:

For row k, the row-level adjacency probability score is defined as:

P_Adjacency = P(C_ki < C_kj if i < j) = 1/C(k,2) * Σ I(C_ki < C_kj)

The adjacency probability score for the entire matrix is the average across all rows.

Theoretical Analysis

Averaging Effect

In the first layer, the embedding at position k is computed through a linear combination of the previous k-1 embeddings:

Embedding at position k+t: Σ(i=1 to k+t) α_i * e_i
Embedding at position k+t+1: Σ(i=1 to k+t+1) β_i * e_i

Since adjacent positions share more input embeddings, the difference in their dot products is positive:

(Σ α_i * v_i) · (Σ β_i * v_i) - (Σ α_i * v_i) · (Σ β'_i * v_i) > 0

This mathematically explains the emergence of the adjacency pattern.

Experimental Setup

Datasets and Tasks

The authors designed four synthetic tasks requiring position information:

Addition Task: Generate answers for "123+456=", with maximum input length of 9
Reversal Task: Generate "4321" for "rev(1234)=", with maximum input length of 22
Indexing Task: Output the first occurrence position "2" for "wherex(134504392,4)=", with maximum input length of 20
Ordering Task: Given original and reordered sequences, output the new index order, with maximum input length of 18

Model Configuration

Base Model: 6-layer NanoGPT with 10.6 million parameters
Variant Configurations: 6/12/24 layers, 192/384/768 hidden dimensions
Initialization: Default N(0, 0.02), testing different means and variances
Training Settings: 20,000 training and 20,000 test samples per task, 5 random seeds

Evaluation Metrics

Adjacency Probability Score: Quantifies adjacency pattern strength
Task Accuracy: Model performance on various tasks
Probing Experiments: Use 4-layer MLP to probe position information, evaluating NRMSE and Pearson-R

Experimental Results

Main Findings

1. Universal Existence of Adjacency Pattern

At the token embedding layer, adjacency probability score is approximately 0.5 (random level)
After the first causal attention layer, the score jumps to 0.8-1.0
This pattern remains stable before and after training, across different tasks and model configurations

2. Layer-wise Analysis Results

Layer	Initialized Model	Trained Model
Embedding	0.48	0.54
Layer 1	0.98	0.89
Layer 2	0.99	0.97
Layer 3	0.99	0.98
Layer 6	0.99	0.82

3. Hyperparameter Sensitivity

Layer Count Impact: Models with 6-24 layers all display adjacency patterns
Dimension Impact: Configurations with 192-768 dimensions maintain the pattern
Initialization Impact: Pattern is stable under standard initialization schemes (σ ≤ 0.02)

Ablation Studies

Initialization Scheme Testing

Tested different means (μ ∈ {0,4,8}) and standard deviations (σ ∈ {0.002,0.02,0.2}):

Small variance (σ ≤ 0.02): Adjacency pattern is stable
Large variance (σ = 0.2): Pattern disappears
Large mean has minimal impact on pattern

Comparison with Variance Theory

Compared cosine similarity and embedding variance as position features through probing experiments:

Feature Type	Pearson-R	NRMSE
Embedding Vector	0.71	0.20
Embedding Variance	0.49	0.23
Cosine Similarity	0.93	0.11

Case Analysis

Figure 1 presents visualization of the self-cosine similarity matrix in the reversal task:

Initialized Model: Clear diagonal pattern emerges from Layer 1
Trained Model: Strong adjacency pattern maintained in early layers, gradually weakens in later layers

Positional Encoding Research

Traditional Methods: Absolute positional encodings, relative positional encodings
Recent Discoveries: Haviv et al. (2022) first proved causal Transformers can be trained without positional encodings

Causal Attention Mechanisms

Permutation Invariance: Tsai et al. (2019) proved non-causal attention is permutation-invariant
Position Information Storage: Chi et al. (2023) proposed the variance-decaying hypothesis

Contributions of This Work

Compared to Chi et al.'s variance theory, this paper's adjacency pattern hypothesis:

Provides more intuitive geometric explanation
Performs better in probing experiments
Applies to broader model configurations

Conclusions and Discussion

Main Conclusions

Universal Adjacency Pattern: Causal Transformers naturally form adjacency patterns after the first attention layer
Position Information Encoding: High similarity of adjacent embeddings enables position reconstruction
Mechanism Explanation: The averaging effect mathematically explains the pattern's emergence
Practical Value: Cosine similarity is more suitable than embedding variance as a position feature

Limitations

Dataset Constraints: Validation primarily on synthetic tasks; generalization to real datasets requires further investigation
Architecture Dependency: Conclusions based on specific Transformer architecture; applicability to other variants unknown
Completeness Issues: Neither adjacency pattern nor variance fully explains 100% of task performance

Future Directions

Large-scale Validation: Verify adjacency patterns in real language modeling tasks
Mechanism Integration: Explore combining adjacency patterns with other positional encoding mechanisms
Theory Refinement: Establish a more complete theoretical framework for position information representation

In-Depth Evaluation

Strengths

Novel Perspective: Provides new theoretical insights by understanding position information from a geometric similarity angle
Rigorous Validation: Comprehensively verifies hypotheses through multiple tasks, configurations, and analytical methods
Mathematical Foundation: Provides theoretical explanation for the adjacency pattern's emergence
Practical Tool: The adjacency probability score provides an effective method for quantifying position information

Weaknesses

Task Limitations: Synthetic tasks may not fully reflect the complexity of real-world application scenarios
Incomplete Mechanism: Acknowledges that existing theory cannot fully explain model performance
Computational Cost: Self-cosine similarity matrix computation may be expensive for long sequences

Impact

Theoretical Contribution: Provides new perspective for understanding Transformer position representation
Practical Guidance: Supports theory for designing models without positional encodings
Research Inspiration: Opens new direction for analyzing Transformer internal mechanisms from geometric perspective

Applicable Scenarios

Lightweight Models: Model design reducing positional encoding parameters
Long Sequence Processing: Sequence modeling avoiding positional encoding constraints
Model Analysis: Understanding and debugging Transformer internal representations

References

This paper primarily references the following important works:

Haviv et al. (2022): First proved feasibility of training without positional encodings
Chi et al. (2023): Proposed variance-decaying hypothesis for position information
Tsai et al. (2019): Analyzed permutation properties of attention mechanisms
Vaswani et al. (2017): Original Transformer paper

This research provides important new perspectives for understanding how Transformers handle position information. While it has limitations in completeness, its theoretical insights and experimental findings establish a solid foundation for further development in this field.