Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling
Fan, Yang, Kankanhalli et al.
When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named α-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including α-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.
academic
Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling
Title: Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling
Authors: Hehe Fan (Zhejiang University), Yi Yang (Zhejiang University), Mohan Kankanhalli (National University of Singapore), Fei Wu (Zhejiang University)
Classification: cs.LG cs.AI cs.CL cs.CV
Publication Date: October 11, 2025 (arXiv preprint)
The authors argue that data modeling involves two key aspects: 1) identifying elements related to a central element (such as a convolutional receptive field) or a query element (such as self-attention); 2) effectively encoding these tokens. Self-attention can adaptively identify these elements but relies on absolute positional embeddings for structural representation learning. In contrast, convolution encodes elements in a relative manner, but fixed kernel sizes limit its ability to adaptively select relevant elements. This paper proposes the Translution operation, which unifies the adaptive identification capability of self-attention with the relative encoding advantages of convolution. However, this integration results in a substantial increase in parameters, exceeding the computational resources available to most institutions. Therefore, the authors propose a lightweight variant, α-Translution. Experiments demonstrate that Translution outperforms self-attention on both computer vision and natural language processing tasks.
As direct extensions of models like Transformers encounter diminishing returns, AI laboratories have observed that improvement rates for next-generation models fall short of expectations. Under data saturation and current scaling law limitations, designing innovative neural network architectures has become critical.
Experiments show that simply increasing parameters (absolute encoding) does not yield performance improvements, validating the effectiveness of the relative encoding method itself.
The paper cites important works in the deep learning field, including:
Original Transformer paper (Vaswani et al., 2017)
Vision Transformer (Dosovitskiy et al., 2021)
Related work on relative position encoding (Shaw et al., 2018; Liu et al., 2021, etc.)
Classical convolutional neural network works (LeCun et al., 1998; He et al., 2016, etc.)
Overall Assessment: This is a high-quality paper with contributions in both theory and practice. While it faces challenges regarding computational resource requirements, the proposed α-Translution variant effectively balances performance and efficiency. The paper provides a novel perspective on unifying fundamental operations in deep learning, possessing significant academic value and practical significance.