2025-11-21T23:25:16.078828

Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

Fan, Yang, Kankanhalli et al.

When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named Î±-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including Î±-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.

academic

Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

Basic Information

Paper ID: 2510.10060
Title: Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling
Authors: Hehe Fan (Zhejiang University), Yi Yang (Zhejiang University), Mohan Kankanhalli (National University of Singapore), Fei Wu (Zhejiang University)
Classification: cs.LG cs.AI cs.CL cs.CV
Publication Date: October 11, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10060v1

Abstract

The authors argue that data modeling involves two key aspects: 1) identifying elements related to a central element (such as a convolutional receptive field) or a query element (such as self-attention); 2) effectively encoding these tokens. Self-attention can adaptively identify these elements but relies on absolute positional embeddings for structural representation learning. In contrast, convolution encodes elements in a relative manner, but fixed kernel sizes limit its ability to adaptively select relevant elements. This paper proposes the Translution operation, which unifies the adaptive identification capability of self-attention with the relative encoding advantages of convolution. However, this integration results in a substantial increase in parameters, exceeding the computational resources available to most institutions. Therefore, the authors propose a lightweight variant, α-Translution. Experiments demonstrate that Translution outperforms self-attention on both computer vision and natural language processing tasks.

Research Background and Motivation

Problem Definition

The core challenge facing current deep learning is how to effectively model data. The authors decompose data modeling into two key aspects:

Relevant Element Identification: Determining which data elements are relevant to the currently processed element
Effective Encoding: Encoding these relevant elements into effective representations

Limitations of Existing Methods

Limitations of Convolutional Neural Networks:

Use fixed-size kernels to define local receptive fields
Cannot avoid including irrelevant pixels, particularly at object boundaries or background regions
While capable of encoding local structure relatively, lack adaptivity

Limitations of Self-attention Mechanisms:

Can adaptively identify relevant regions without predefined locality constraints
Rely on absolute positional embeddings to capture structural information
May struggle to recognize identical objects when they move to different positions

Research Motivation

As direct extensions of models like Transformers encounter diminishing returns, AI laboratories have observed that improvement rates for next-generation models fall short of expectations. Under data saturation and current scaling law limitations, designing innovative neural network architectures has become critical.

Core Contributions

Proposes Translution Operation: Unifies the adaptive identification capability of self-attention with the relative encoding advantages of convolution
Designs α-Translution Lightweight Variant: Significantly reduces parameter count, making the approach feasible with current computational resources
Theoretical Unification: Proves that convolution and self-attention can be viewed as special cases of Translution
Experimental Validation: Validates the method's effectiveness on computer vision and natural language processing tasks
Open-source Implementation: Provides complete code implementation for community use

Method Details

Task Definition

Given input data (image patches or text tokens), the objective is to learn an operation that can:

Adaptively identify elements relevant to the query element
Encode structural relationships of these elements in a relative manner
Generate effective output representations

Model Architecture

Translution Operation

Translution adopts a convolution-style approach, assigning different parameter matrices for each distance and direction:

Relative query encoding: qi,j = fi · W^q_{δx,δy}, δx = xi - xj, δy = yi - yj
Relative key encoding: kj,i = fj · W^k_{-δx,-δy}
Relative attention: ai,j = (qi,j · k^T_{j,i})/√C', αi,j = e^{ai,j}/∑e^{ai,n}
Relative value encoding: vi,j = fj · W^v_{δx,δy}
Weighted summation: f'i = ∑αi,j × vi,j

where W^q_{δx,δy}, W^k_{δx,δy}, W^v_{δx,δy} ∈ R^{C×C'} are learnable parameter matrices corresponding to displacement (δx,δy).

α-Translution Lightweight Variant

Since Translution requires (2H-1)×(2W-1)×C×C' parameters, α-Translution reduces parameters by lowering input and output dimensions:

W^q_{δx,δy} ⇒ W^q_1 · W^q_{δx,δy}
W^k_{δx,δy} ⇒ W^k_1 · W^k_{δx,δy}
W^v_{δx,δy} ⇒ W^v_1 · W^v_{δx,δy} · W^v_2

where C1 ≪ C, C2 ≪ C'.

Technical Innovations

1. Theoretical Unification

The authors prove that convolution and self-attention are special cases of Translution:

Convolution: Attention weights are 1 within the receptive field and 0 outside
Self-attention: Uses shared W^q, W^k, W^v parameters, ignoring directional and distance encoding
Translution: Combines the advantages of both

2. Relative Position Encoding

Unlike existing methods (scalar bias or vector addition), Translution uses offset-based matrices for relative encoding, better capturing directional and distance information.

3. Memory-Optimized Implementation

A memory-efficient implementation for α-Translution is designed, reducing peak memory usage from N×N×C' to N×C'+N×N×C2.

Experimental Setup

Datasets

Computer Vision Tasks:

Dynamic MNIST: Synthetic dataset with digits moving within an 84×84 pixel region
Static MNIST: Control dataset with digits fixed at image center
ImageNet-1K: Large-scale image classification dataset with 1000 classes

Natural Language Processing Tasks:

OpenWebText: 9 billion training tokens, 4 million validation tokens, 50K vocabulary

Evaluation Metrics

Image Classification: Top-1 and Top-5 accuracy
Language Modeling: Perplexity

Comparison Methods

Standard self-attention (Transformer baseline)
Relative position encoding variants (Shaw et al., Swin Transformer, ConViT, RoFormer, etc.)
Absolute encoding variants (for ablation studies)

Implementation Details

Architecture configuration: 6-12 layers depth, 192-384 embedding dimensions, 3-6 attention heads
α-Translution default compression dimensions: C1 = C2 = 8
Batch size: 256 (ImageNet), 8 (OpenWebText)
All training from scratch without external pretraining

Experimental Results

Main Results

Dynamic MNIST Experiments

Method	Parameters	Static→Static	Dynamic→Dynamic	Static→Dynamic
Self-attention	2.7M	98.48%	92.64%	18.18%
α-Translution	4.6M	98.48%	97.31%	34.90%
Translution	116.2M	98.60%	97.35%	36.40%

Key Finding: Translution performs significantly better in position-variation scenarios, demonstrating the advantages of relative encoding.

ImageNet-1K Experiments

Using ViT-A/56 as example:

Method	Parameters	Top-1	Top-5
Self-attention	4.7M	46.28%	71.17%
α-Translution	5.3M	48.36%	73.31%
Translution	38.5M	52.41%	76.50%

Natural Language Modeling Experiments

Method	Parameters	Perplexity
Self-attention	22.0M	60.40
α-Translution	23.7M	57.97
Translution	127.5M	56.26

Ablation Studies

1. Impact of Parameter Increase vs. Relative Encoding

Experiments show that simply increasing parameters (absolute encoding) does not yield performance improvements, validating the effectiveness of the relative encoding method itself.

2. Impact of Relative Encoding Dimensions

As C1 and C2 increase, α-Translution performance improves, but parameter count also increases, presenting an efficiency-effectiveness trade-off.

3. Comparison of Position Encoding Methods

Method	Parameters	Top-1	Top-5
No positional embedding	4.69M	42.49%	67.39%
Standard positional embedding	4.69M	46.28%	71.17%
Swin Transformer	4.69M	46.36%	71.31%
RoFormer	4.69M	46.65%	71.51%
α-Translution	5.33M	48.36%	73.31%
Translution	38.53M	52.41%	76.50%

Experimental Findings

Importance of Relative Encoding: In position-variation scenarios, relative encoding significantly outperforms absolute encoding
Parameter Efficiency: α-Translution achieves significant performance improvements with modest parameter increases
Cross-modal Effectiveness: The method is effective on both vision and language tasks
Memory Constraints: Current GPU memory limitations restrict large-scale experiments, requiring 2-3TB memory for larger-scale evaluations

Position Encoding Research

The authors categorize related work into three types:

Relative Position Vectors: Shaw et al., BoTNet, HaloNet, etc.
Relative Position Scalars: Swin Transformer, CoAtNet, ConViT, etc.
Rotary Position Embeddings: RoFormer, etc.

Combining Convolution and Attention

Architecture-level Combination: Conformer, CeiT, etc. use convolution and attention in different layers
Module-level Combination: Translution unifies both at the fundamental operation level

Conclusions and Discussion

Main Conclusions

Translution successfully unifies the adaptive identification capability of self-attention with the relative encoding advantages of convolution
α-Translution provides a good balance between parameter efficiency and performance
Relative encoding significantly outperforms absolute encoding when handling position variations
The method demonstrates improvements across multiple tasks and modalities

Limitations

Computational Resource Requirements: Full Translution requires substantial parameters and memory
Evaluation Scale Limitations: Due to resource constraints, evaluation is primarily on small-to-medium scale architectures
Scenario-specific Optimization: Certain relative positions may share parameters, particularly at greater distances

Future Directions

Optimized Variant Exploration: Design more efficient Translution variants
Multi-modal Extension: Extend to other modalities such as 3D, video, and molecules
Architecture Design: Design more effective specialized architectures for Translution
Large-scale Evaluation: Validate on larger-scale frameworks and datasets

In-depth Evaluation

Strengths

Theoretical Contribution: Provides a unified perspective on convolution and self-attention, theoretically elegant
Practical Value: α-Translution delivers performance improvements even under resource constraints
Comprehensive Experiments: Covers multiple tasks, datasets, and ablation studies
Clear Problem Identification: Clearly identifies and addresses core limitations of existing methods
Open-source Contribution: Provides complete implementation, promoting community research

Weaknesses

Resource Requirements: Computational demands of the full method may limit practical applications
Evaluation Scale: Lacks evaluation on large-scale models due to resource constraints
Theoretical Analysis: Lacks in-depth theoretical analysis of convergence and optimization properties
Comparison Fairness: Significant parameter differences with baselines may affect comparison fairness

Impact

Academic Value: Provides new insights into combining attention mechanisms and convolution
Practical Prospects: The practicality of α-Translution makes it likely to be adopted in real applications
Inspirational Significance: May inspire further research on fundamental operation unification

Applicable Scenarios

Position-sensitive Tasks: Particularly suitable for tasks requiring position-variation handling
Structured Data: Effective on data with spatial or sequential structure such as images and text
Resource-rich Environments: Full Translution suitable for scenarios with abundant computational resources
Research Exploration: Provides new directions for fundamental architecture research

References

The paper cites important works in the deep learning field, including:

Original Transformer paper (Vaswani et al., 2017)
Vision Transformer (Dosovitskiy et al., 2021)
Related work on relative position encoding (Shaw et al., 2018; Liu et al., 2021, etc.)
Classical convolutional neural network works (LeCun et al., 1998; He et al., 2016, etc.)

Overall Assessment: This is a high-quality paper with contributions in both theory and practice. While it faces challenges regarding computational resource requirements, the proposed α-Translution variant effectively balances performance and efficiency. The paper provides a novel perspective on unifying fundamental operations in deep learning, possessing significant academic value and practical significance.