Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
Zuo, Guerzhoy, Guerzhoy
Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.
academic
Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
传统观念认为Transformer需要显式的位置编码来处理序列中token的位置信息,但最近的研究(Haviv et al. 2022; Kazemnejad et al. 2024; Chi et al. 2023)表明,仅使用因果注意力的decoder-only Transformer可以在没有位置编码的情况下学习位置信息。