Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
Zuo, Guerzhoy, Guerzhoy
Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.
academic
Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
This study investigates how Transformers with causal attention can solve position-dependent tasks without explicit positional encodings. The authors propose and validate a novel hypothesis: position information can be stored through the similarity between adjacent embedding vectors. The research demonstrates that adjacent embeddings exhibit higher similarity than distant embeddings, enabling Transformers to reconstruct token positions. This pattern is observed both in trained and randomly initialized causal Transformer models.
Conventional wisdom suggests that Transformers require explicit positional encodings to process token positions in sequences. However, recent research (Haviv et al. 2022; Kazemnejad et al. 2024; Chi et al. 2023) indicates that decoder-only Transformers using only causal attention can learn position information without positional encodings.
Adjacency Pattern Hypothesis: Discovers that embeddings at adjacent positions exhibit higher cosine similarity, forming an "adjacency pattern"
Theoretical Analysis: Provides mathematical explanation for why the adjacency pattern emerges in the first layer of causal attention
Comprehensive Experimental Validation: Verifies the existence of adjacency patterns across multiple tasks, model configurations, and initialization schemes
Quantitative Assessment Method: Proposes the adjacency probability score to quantify the strength of position information
Comparative Analysis: Demonstrates through probing experiments that cosine similarity more effectively encodes position information than embedding variance
The study focuses on how causal Transformers represent and utilize position information without explicit positional encodings, with emphasis on similarity patterns between embedding vectors.
The adjacency pattern refers to the characteristic of the self-cosine similarity matrix where values near the diagonal are higher and values far from the diagonal are lower, indicating that embeddings at adjacent positions are more similar.
This paper primarily references the following important works:
Haviv et al. (2022): First proved feasibility of training without positional encodings
Chi et al. (2023): Proposed variance-decaying hypothesis for position information
Tsai et al. (2019): Analyzed permutation properties of attention mechanisms
Vaswani et al. (2017): Original Transformer paper
This research provides important new perspectives for understanding how Transformers handle position information. While it has limitations in completeness, its theoretical insights and experimental findings establish a solid foundation for further development in this field.